Grafana and Prometheus: A Monitoring Stack That Scales Down

The dashboard pairing that powers data centres, running happily on a single Pi

Smarc Included in

17-10-2023 1911 words 9 min read

Grafana and Prometheus: A Monitoring Stack That Scales Down

Contents

There is a particular flavour of homelab anxiety that strikes at 3am: is the NAS still alive? Did the disk fill up while I slept? Is that container restarting in a loop, quietly burning through SD-card writes? You can answer these questions by SSHing in and squinting at df and top, or you can answer them with a graph. I am firmly in the graph camp, and for the better part of a decade the graph has come from Grafana fed by Prometheus.

The interesting thing about this pairing is that it’s the same stack large companies use to watch thousands of nodes, yet it scales all the way down to a single Raspberry Pi watching itself. You don’t outgrow it, and you don’t have to be Google to justify it.

Two tools, two jobs

People say “Grafana and Prometheus” as if they were one product, but they do quite different things and it pays to keep them straight in your head.

Prometheus is the database and the collector. It does one trick extremely well: every fifteen seconds (or whatever interval you set) it reaches out to a list of HTTP endpoints, scrapes a page of plain-text metrics, and stores them as time series. It’s a pull model — Prometheus goes and fetches; the things being monitored don’t push anything. Each metric is a number with labels, like node_filesystem_avail_bytes{mountpoint="/"}, and Prometheus keeps the history.

Grafana is the eyes. It doesn’t store anything of consequence; it queries Prometheus (and dozens of other data sources) and draws the pretty dashboards. The split matters because you can restart Grafana, break a dashboard, or upgrade it recklessly without losing a single data point — your history lives safely in Prometheus.

This separation is the single most important thing to internalise, because it’s where the architecture’s resilience comes from. Grafana holds no state you can’t recreate: dashboards are JSON, data sources are config, and both can be backed up as files or, better, provisioned from disk so a fresh Grafana container comes up already configured. Prometheus holds the only thing that’s genuinely irreplaceable — the time-series history — and you can give that a volume and a backup routine and otherwise treat the rest of the stack as cattle. I’ve blown away and rebuilt the Grafana container on a whim more than once and lost nothing. Try that with a monolithic monitoring appliance and you’ll be restoring from backup for an afternoon.

The pull model deserves a word too, because it’s unusual and it’s the source of both Prometheus’s elegance and its occasional awkwardness. Most older monitoring systems push: the agent on each box sends data up to a central server. Prometheus inverts that — the server reaches out and scrapes. The upside is that service discovery and health are the same thing: if Prometheus can’t scrape a target, that target is, by definition, down, and you get that for free. The downside is the firewall problem mentioned later, and the fact that short-lived jobs (a backup script that runs for ninety seconds) never live long enough to be scraped. For those, Prometheus ships a Pushgateway — a small buffer the job pushes to and Prometheus scrapes — which neatly sidesteps the one case the pull model handles badly.

The thing that produces the numbers

Prometheus scrapes endpoints, but something has to expose those endpoints. That something is an exporter. The one you’ll install first is node_exporter, which turns a Linux box’s CPU, memory, disk, network and load into Prometheus metrics. For Docker stats there’s cAdvisor; for a hundred other things there’s an exporter on GitHub maintained by someone who also couldn’t sleep at 3am.

Here’s a minimal stack that monitors the host it runs on:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    ports:
      - "9090:9090"

  node-exporter:
    image: prom/node-exporter:latest
    pid: host
    volumes:
      - /:/host:ro,rslave
    command:
      - '--path.rootfs=/host'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - graf_data:/var/lib/grafana
    ports:
      - "3000:3000"

volumes:
  prom_data:
  graf_data:

And the prometheus.yml that tells it what to scrape:

1
2
3
4
5
6
7
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

That’s the whole thing. Bring it up, open Grafana on port 3000 (default login admin/admin, which it will force you to change), add Prometheus (http://prometheus:9090) as a data source, and you have monitoring. Note the pid: host and the bind mount on node-exporter: without them the exporter reports the container’s view of the world, not the host’s, and your disk graphs will be quietly, confusingly wrong. That single line catches a lot of people.

If you want to monitor more than the one box, you add more exporters and list them under scrape_configs. The mental model never changes: something exposes a /metrics page, Prometheus scrapes it on a timer, Grafana draws it. Once that clicks, the whole ecosystem stops being intimidating. This is the same lego-brick philosophy I lean on in Docker Compose demystified: a full stack in a single file — small services, each doing one job, wired together declaratively.

Dashboards you don’t have to build

The fear with Grafana is that you’ll spend a weekend dragging panels around. You won’t, because the community has already done it. Grafana’s dashboard library has thousands of pre-built dashboards you import by pasting a numeric ID. The famous Node Exporter Full dashboard (ID 1860) gives you a wall of beautifully laid-out graphs for everything node_exporter produces. Import it, point it at your data source, done. You can build your own later when you have opinions, but you’ll start with something that looks like a NASA control room for zero effort.

A word of warning on imported dashboards: they assume specific metric names and label conventions, and if your exporter version differs you’ll get panels reading “No data” while others work fine. That’s not a broken dashboard, it’s a naming drift — open the panel, read the query, and you’ll usually spot a renamed metric. The fix is a five-minute edit, and doing it once teaches you more about PromQL than any tutorial. Once you’ve outgrown the borrowed dashboards, build a single overview with the four numbers you actually check at a glance — CPU, memory, disk, and whatever your most important service exposes — and pin it as your home dashboard. That one screen is what turns monitoring from a thing you set up into a thing you use.

Alerts, because graphs you don’t look at are useless

A dashboard only helps if you happen to be staring at it. The real payoff is alerting: get told before the disk fills. Prometheus uses PromQL, a query language that’s genuinely pleasant once it clicks. A rule that fires when a root filesystem drops below 10% free looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
groups:
  - name: disk
    rules:
      - alert: DiskAlmostFull
        expr: node_filesystem_avail_bytes{mountpoint="/"}
              / node_filesystem_size_bytes{mountpoint="/"} < 0.10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Root filesystem under 10% free"

The for: 10m clause is the unsung hero here — it stops a momentary blip from waking you. Pair Prometheus with Alertmanager to route those alerts to email, a webhook, or your phone, and the 3am anxiety finally has somewhere to go that isn’t your own imagination.

A few PromQL patterns earn their keep in a homelab. rate(node_network_receive_bytes_total[5m]) turns an ever-climbing counter into a sensible bytes-per-second line — counters in Prometheus only go up, and rate() is how you make them readable. 100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) gives you CPU-busy percentage, which is non-obvious because node_exporter reports idle time, not busy time. And predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0 will warn you that a disk is on track to fill within four hours, which is a far more civilised thing to learn than that it already has.

Troubleshooting: when the graphs lie or vanish

The failures here are nearly always the same handful, and they look more mysterious than they are.

Target shows as DOWN in Prometheus. Open http://prometheus:9090/targets first — it tells you exactly why. Nine times in ten it’s a name-resolution or networking issue: inside Docker Compose, services reach each other by service name (node-exporter:9100), not localhost, because each container has its own loopback. If you wrote localhost:9100 in prometheus.yml, Prometheus is scraping itself and finding nothing.

Grafana shows “No data” but the target is up. This is almost always a time-range or data-source mismatch: the dashboard is querying a Prometheus that has only been running for two minutes while the panel asks for the last 24 hours, or you imported a dashboard whose data-source variable points at a source that doesn’t exist. Check the dashboard’s data-source dropdown at the top before you doubt your queries.

Disk usage in Grafana doesn’t match df. Back to that pid: host / rootfs bind mount. Without --path.rootfs=/host, node_exporter measures the container’s overlay filesystem. Fix the compose file, restart the exporter, and the numbers reconcile.

Prometheus is eating RAM or restarting. Each unique combination of labels is a separate time series, and high-cardinality labels (a label whose value is a user ID, a request path, a timestamp) explode the series count and the memory with it. If memory climbs without bound, find the offending metric with topk(10, count by (__name__)({__name__=~".+"})) and drop or relabel it. When even that isn’t enough — when you genuinely want years of history on modest hardware — that’s the moment to look at VictoriaMetrics: when Prometheus gets too hungry for your hardware, which is a near drop-in replacement with a far smaller footprint.

The honest costs

It’s not all free. Prometheus stores everything locally and is not built for years of retention — by default it keeps fifteen days, and long-term storage means bolting on extra machinery. PromQL has a learning curve; your first few queries will be cargo-culted from Stack Overflow. And the pull model means Prometheus needs network reach to everything it watches, which gets fiddly across firewalls and NAT.

For a homelab none of that bites hard. Fifteen days of history is plenty, the queries you need are mostly already written, and everything lives on one network anyway.

It is also fair to ask whether you need the whole stack at all. If you have one or two machines and just want a green tick that turns red when something falls over, this pairing is genuinely overkill — a single-binary tool will get you there with a tenth of the moving parts. I’ve written about exactly those lighter options in Beszel: lightweight server monitoring without the Grafana overhead, and they’re the right call for a lot of small setups. Grafana and Prometheus earn their keep when you want history and queries — when “the disk is full” matters less than “the disk has been filling at 2GB a day for a week and will be full on Thursday.”

The verdict

If you run more than one always-on machine and you’ve ever been surprised by an outage you could have seen coming, this stack is worth the afternoon. It’s the rare piece of “enterprise” software that genuinely scales down: light enough for a Pi, capable enough to never replace. I’ve run it for years, it sips resources, and it has turned more than one would-be 3am disaster into a graph I noticed at a civilised hour. That’s exactly what I want from monitoring.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#self-hosting #monitoring #prometheus #grafana

Contents

Grafana and Prometheus: A Monitoring Stack That Scales Down

The dashboard pairing that powers data centres, running happily on a single Pi

Two tools, two jobs

The thing that produces the numbers

Dashboards you don’t have to build

Alerts, because graphs you don’t look at are useless

Troubleshooting: when the graphs lie or vanish

The honest costs

The verdict

Related Content

Prometheus and Grafana Without the Enterprise Baggage

VictoriaMetrics: When Prometheus Gets Too Hungry for Your Hardware

Alerting That Doesn't Cry Wolf: Tuning Alertmanager

A Self-Hosted Speedtest That Keeps History