Loki: Log Aggregation for People Who Can't Afford Splunk

Grep-able logs from every box, indexed by labels instead of by every word

Smarc Included in

23-01-2024 1841 words 9 min read

Loki: Log Aggregation for People Who Can't Afford Splunk

Contents

There are two kinds of homelabber: the ones who SSH into each box and run journalctl when something breaks, and the ones who got tired of doing that around the fourth machine. I crossed that line a while ago, and I remember the exact incident that pushed me over it — a service falling over at 11pm, and me hopping between four terminals trying to work out which of them held the log line that mattered. Once you have a handful of hosts and a stack of Docker containers, “which log, on which box, from which container?” becomes a small archaeological dig every single time, invariably conducted in a hurry while something is actively on fire.

The grown-up answer to this is log aggregation: ship every log line to one central place, and search them all at once from a single bar. The grown-up price for that has historically been Splunk, which is genuinely superb and costs roughly the GDP of a small island nation once your volume gets serious. Loki, from the Grafana people, is the answer for the rest of us — and the trick that makes it cheap is worth understanding before you install it, because it also explains the one way you can badly misconfigure it.

Loki’s clever, slightly weird idea

The reason traditional log systems are expensive is that they index everything. Every word in every line gets put into a full-text index so you can search for it later, and that index is enormous — often bigger than the logs themselves. It’s powerful and it’s why your wallet hurts.

Loki does something deliberately different and a bit cheeky: it doesn’t index the log content at all. It only indexes a small set of labels — things like which host, which container, which job — exactly the way Prometheus indexes metrics. The actual log text is just compressed and dumped into chunks of object storage. When you search, you first narrow down by labels to a small set of streams, then Loki brute-force greps through only those chunks.

The trade-off is explicit and worth stating plainly, because it dictates everything about how you should use the tool: you get cheap storage and cheap ingestion, in exchange for searches that are fast if you narrow by label first and slow if you ask it to grep blindly across everything you’ve ever ingested. Splunk indexes the world so any query is fast; Loki indexes almost nothing so ingestion is cheap, and pushes the cost onto the queries you write carelessly. For a homelab, where “everything” is gigabytes rather than terabytes and you usually know roughly which service you’re chasing, this is a brilliant bargain. For a security team doing forensic full-text hunts across an entire estate with no starting point, it is the wrong tool, and no amount of tuning will change that. Match the tool to the shape of your searches.

The three pieces

A Loki setup has three moving parts, and it helps to name them:

Loki itself — the server that stores chunks and answers queries.
Promtail (or, increasingly, the Grafana Alloy agent) — the thing that runs on each host, tails log files, attaches labels, and ships lines to Loki.
Grafana — the same Grafana you already run for metrics, which gets a “Logs” view and a query language called LogQL.

That last point is the killer feature: logs and metrics live in the same Grafana, so you can spot a spike on a graph and pivot straight to the log lines from that exact minute. No context switch, no second tool, no re-deriving the timestamp by hand. If you already run Grafana for metrics, Loki slots in beside your existing dashboards rather than becoming a whole new thing to maintain. And if you also do image or media work in the same lab, the “one dashboard for everything” instinct is the same one that makes tools like ComfyUI’s node-based control satisfying — you can see the whole pipeline in one place instead of guessing.

A minimal stack

Loki ships sensible single-binary defaults now, so you don’t need to understand its internal microservices. Here’s a compose file that runs Loki and a Promtail that scrapes Docker container logs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
services:
  loki:
    image: grafana/loki:latest
    command: -config.file=/etc/loki/config.yml
    volumes:
      - ./loki-config.yml:/etc/loki/config.yml
      - loki_data:/loki
    ports:
      - "3100:3100"

  promtail:
    image: grafana/promtail:latest
    command: -config.file=/etc/promtail/config.yml
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro

volumes:
  loki_data:

A word on the agent: Promtail was the traditional shipper and is what most existing guides show, but Grafana has moved development to Alloy, its unified collector, and Promtail is now in long-term-support-only mode. For a new build I’d reach for Alloy; the configuration concepts below are identical, only the file syntax differs. Whichever you pick, the config is where you decide what labels exist — and this is the part that actually matters, because your labels are your whole search index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: 'project'

Those three labels — container, stream, project — are a sensible starting set for a Docker host: enough to slice by, few enough not to explode. Notice what’s not here: no request IDs, no user IDs, no timestamps. That restraint is the whole game, and I’ll come back to why in a moment.

LogQL, and the cardinality trap

In Grafana you add Loki as a data source and start querying. LogQL looks pleasantly like a hybrid of grep and PromQL. A query to find errors from a specific container in the last hour:

1
{container="caddy"} |= "error" | json | status >= 500

That {container="caddy"} part picks the stream by label — the cheap, indexed step; |= "error" greps the resulting lines — the brute-force step, now operating on a tiny slice rather than the whole dataset; and the rest parses the JSON payload and filters on a field. The ordering is the performance model made visible: narrow hard with labels, then grep the remainder. Get in the habit of always leading with a tight label selector and Loki stays fast; lead with a bare |= "error" across everything and you’ll wait. You can even turn logs into metrics on the fly — count_over_time({container="caddy"} |= "error" [5m]) graphs an error rate straight from raw log lines — which is genuinely magic the first time you do it, and quietly removes the need to instrument a metric for every single thing you might one day want to count.

The one rule you must internalise, the rule that separates a happy Loki from a miserable one: never put high-cardinality values in labels. A label is high-cardinality when it can take many distinct values — a user ID, a request ID, a session token, a raw timestamp, an IP address. Loki creates a separate stream for every unique combination of label values, and putting a request ID in a label means one stream per request. That number climbs into the millions and brings the whole system to its knees. It is, without competition, the single most common way people wreck their Loki installation, and the failure is nasty because it works fine in testing and falls over under real traffic. Labels are for the handful of low-cardinality dimensions you slice by — host, container, environment, level. Everything else stays in the log line, where LogQL’s grep can still find it perfectly well; you just filter with |= instead of a label selector.

Troubleshooting: the things that actually go wrong

A short field guide to the failures you’ll hit, roughly in order of how often I’ve hit them:

Queries are slow and Loki eats RAM. You have a cardinality problem. Run logcli series '{}' or check the loki_ingester_memory_streams metric; if your stream count is in the tens of thousands for a homelab, a label is carrying something it shouldn’t. Find it and move that value out of the labels into the line body.
“too many outstanding requests” / “maximum active stream limit exceeded”. Same root cause, now hitting a limit rather than just being slow. Loki is protecting itself from your labels. Fix the labelling; don’t just raise the limit and paper over it.
No logs appear at all. Nine times out of ten the agent can’t read the socket or the log path. On a Docker setup, confirm the container has /var/run/docker.sock mounted read-only and the agent has permission. docker logs <agent-container> usually says exactly what it can’t reach.
Timestamps are wrong / logs land in the future. Clock skew between hosts, or a parsing stage grabbing the wrong field as the timestamp. Loki rejects lines too far out of order by default. Sort your host clocks with NTP first.
Old logs won’t delete / disk fills up. Retention isn’t on by default in some configs. Set a retention_period in the compactor config and confirm the compactor is actually running, or your “cheap” object storage grows forever.

Where it falls short

Loki is not Splunk, and it is refreshingly honest about that. If your daily job is genuine needle-in-a-haystack full-text search across terabytes with no idea which service produced the line, Loki’s brute-force-within-a-stream model will feel slow, and you’d honestly be better served by something with a full inverted index. LogQL, while it improves with every release, still has rough edges compared to a mature search language. And early Loki setups earned a reputation for fiddly configuration that scared a lot of people off — that’s much better now with the single-binary defaults, but the folklore lingers and colours people’s expectations.

The verdict

For a self-hoster who already runs Grafana and Prometheus, Loki is close to a no-brainer, and I’d point most people straight at it. It’s cheap to run, it puts every log from every box behind one search bar, and it lives inside the dashboard you already have open all day. The one discipline it demands — keep your labels low-cardinality — is the very same discipline that keeps Prometheus healthy, so if you’ve run metrics for any length of time you already have the right instincts baked in. That overlap is not a coincidence; Loki was deliberately built to feel like Prometheus for logs, and it’s the same instinct behind keeping any self-hosted stack’s running costs honest rather than letting storage quietly balloon.

Who is it not for? If you have exactly one server, journalctl -f is still fine and Loki is overkill — come back when you hit machine number three or four and the archaeology starts. But for anyone past that point, I added Loki to my stack expecting a weekend of pain and had a working “search all my logs” box by lunchtime. The first time an outage took two minutes to diagnose instead of twenty, it had already paid for itself, and it has kept paying since.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#self-hosting #monitoring #loki #logging

Contents

Loki: Log Aggregation for People Who Can't Afford Splunk

Grep-able logs from every box, indexed by labels instead of by every word

Loki’s clever, slightly weird idea

The three pieces

A minimal stack

LogQL, and the cardinality trap

Troubleshooting: the things that actually go wrong

Where it falls short

The verdict

Related Content

Loki for Logs When ELK Is Overkill

Loki for Logs: Grep at Scale Without the Elasticsearch Tax

Dozzle: Docker Logs Without the Terminal Gymnastics

Beszel: Lightweight Server Monitoring Without the Grafana Overhead