Automated Chaos: Using Fault Injection to Build Resilience Before Your Users Notice

Break things on purpose so they don't break for real

28-04-2025 28-04-2025 1908 words 9 minutes

Automated Chaos: Using Fault Injection to Build Resilience Before Your Users Notice

Contents

The first time a single dead Redis container took down my entire homelab dashboard stack, it was 11pm on a Sunday and I had no idea why. Grafana was throwing 502s, the reverse proxy was timing out, and the actual fault — one cache container that had OOM-killed itself — was three layers down from anything I was looking at. I spent forty minutes chasing the symptom before I found the cause. The annoying part is that I could have found that fragility on a Tuesday afternoon, in daylight, with a coffee, if I had simply killed that container on purpose and watched what happened.

That is the whole pitch for chaos engineering. You do not wait for a 3am page to discover that your “redundant” setup has a single point of failure baked into it. You go looking for the weakness deliberately, in a controlled window, while you are awake and paying attention. The faults you inject are the same ones reality will eventually inject for free — a process dies, a disk fills, a network link gets slow and lossy instead of cleanly failing — except you get to choose the timing.

1 Why inject faults at all

The instinct most people have is that their system is fine because it has not fallen over yet. That is survivorship bias wearing a hard hat. “Has not failed” and “will not fail” are very different claims, and the gap between them is exactly where outages live. Real distributed systems — even a modest homelab with a reverse proxy, a database, a cache and a handful of app containers — have failure modes nobody designed on purpose. A retry storm here, a missing timeout there, a health check that returns 200 while the thing behind it is wedged.

The point of fault injection is to surface those interactions while the cost is low. A killed container in a test window costs you nothing. The same container dying during the one evening you actually wanted to use the service costs you your evening. The discipline is sometimes dressed up in grand language, but at its root it is just this: stop assuming, start measuring, and prefer to learn about your weak points on your own schedule.

There is a hard prerequisite, though, and it is worth saying plainly before any of the fun stuff. If you cannot observe the system, you cannot do chaos engineering — you can only do chaos. You need to know what “healthy” looks like in numbers before you start breaking things, or you will inject a fault, see nothing obvious, and conclude wrongly that everything is fine. Get a baseline first. If you have not got monitoring in place, a lightweight monitoring setup is the thing to do before you read another line of this.

2 Start small: the kitchen-table experiment

You do not need a platform to begin. The cheapest possible chaos experiment is a single command run on one box, and Linux ships the tools already. The classic starting point is network impairment, because slow-and-lossy is far nastier than cleanly-down: a dead host gets refused fast, but a host that answers in 800ms with 10% packet loss will quietly exhaust your connection pools and time out your callers in ways that cascade.

tc with the netem queueing discipline does this in one line. To add 150ms of latency with realistic jitter to a single interface:

# add 150ms ± 30ms latency, normally distributed, to outbound traffic on eth0
sudo tc qdisc add dev eth0 root netem delay 150ms 30ms distribution normal

# watch your service's latency and error rate react, then clean up:
sudo tc qdisc del dev eth0 root

Packet loss is the other big one. Real loss is bursty, not uniform, so a correlation factor makes it more honest than a flat percentage:

# 10% loss with 25% correlation (losses tend to clump) on eth0
sudo tc qdisc add dev eth0 root netem loss 10% 25%
sudo tc qdisc del dev eth0 root   # always have the undo ready before you start

The rule I follow religiously: type the cleanup command before you type the fault command. netem rules persist until you remove them or reboot, and the failure mode of forgetting is “why is everything on this box slow for the next two days.” More than once I have left a delay rule on a test VM and rediscovered it weeks later while debugging something completely unrelated.

For resource exhaustion rather than network faults, stress-ng is the equivalent one-liner — spin up CPU load, eat memory, or hammer the disk and watch whether your alerts fire and your service degrades gracefully or just falls over:

# 4 workers spinning the CPU, 2 eating 256MB each, for 60 seconds
stress-ng --cpu 4 --vm 2 --vm-bytes 256M --timeout 60s

Run these against a non-critical service first, observe, document, undo. That loop — hypothesise, inject, measure, revert — is the entire methodology. Everything fancier is just automating it.

3 Designing an experiment that actually teaches you something

A fault injection is only useful if it tests a hypothesis. “Let me kill a thing and see what happens” is a fine way to spend ten minutes, but you learn far more by writing down, before you touch anything, exactly what you expect. The shape I use is borrowed from the practitioners who formalised this: state the steady state, form a hypothesis, inject the fault, and compare reality to the prediction.

Concretely, for my dashboard stack, an experiment reads like this. Steady state: the dashboard responds in under 300ms and the error rate is below 0.1%, as measured by my monitoring over the last hour. Hypothesis: if I kill the cache container, the dashboard should keep serving — slower, because it now hits the database directly — but it should not return errors, because the app is supposed to treat the cache as optional. Inject: kill the cache. Observe: does the error rate stay flat, or does it spike?

The first time I ran exactly that experiment, the error rate did not stay flat. It spiked to 100%, because the app had been written to treat the cache as required, not optional — a connection failure to Redis threw an unhandled exception instead of falling through to the database. That is the entire value proposition in one sentence: a belief I held about my own system (“the cache is optional”) was false, and I found out on a Tuesday for free instead of at 11pm under duress. The fix was a few lines of fall-through logic and a sensible timeout. I would never have written those lines without the experiment, because I genuinely believed the code already did the right thing.

Write the hypothesis down. The experiments where you are surprised are the ones that pay for the whole practice, and you cannot be surprised if you never said what you expected.

4 Levelling up: Chaos Mesh on Kubernetes

Once you are running things on Kubernetes, hand-rolling tc inside pods gets old fast, and this is where a proper tool earns its keep. Chaos Mesh is a CNCF project that injects faults as Kubernetes custom resources, which means your experiments are declarative YAML you can version-control alongside everything else — the same instinct that makes GitOps-free automated deployments pleasant applies here.

The simplest useful experiment is killing a pod and watching the deployment recover. This PodChaos resource kills one pod matching a label, once:

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-one-api-pod
  namespace: chaos-testing
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: my-api

Apply it with kubectl apply -f kill-one-api-pod.yaml, then watch whether the ReplicaSet brings a replacement up before any request actually fails. If requests fail, your readiness probe or your client retry logic has a gap.

Network faults are expressed the same way. This NetworkChaos adds latency to traffic between two services for five minutes, then removes it automatically — the auto-revert is the killer feature over raw tc:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: slow-db-link
  namespace: chaos-testing
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: my-api
  delay:
    latency: "200ms"
    correlation: "50"
    jitter: "40ms"
  duration: "5m"

That duration field is the safety net. Even if you walk away, the fault lifts itself. Chaos Mesh also supports StressChaos (CPU/memory), IOChaos (disk latency and errors) and TimeChaos (clock skew, which exposes a surprising number of token-expiry bugs), and you can chain them into Workflows for multi-stage scenarios.

5 What goes wrong, and how to not ruin your weekend

Chaos engineering has its own failure modes, and they are predictable enough to list.

The fault never reverts. This is the big one with raw tc and stress-ng. Always pair injection with a timeout or a pre-written cleanup. With Chaos Mesh, set duration on every experiment and never rely on remembering to delete the resource.

You inject blind because monitoring is not ready. If you kill a pod and your dashboard shows nothing, you have not proven resilience — you have proven your monitoring has a blind spot. Fix observability first, then break things.

The blast radius is bigger than you thought. mode: all in Chaos Mesh hits every matching pod, not one. A label selector that is broader than you intended can take out a whole tier. Start with mode: one, scope selectors tightly to a chaos-testing-style namespace, and read the selector twice before applying.

You experiment during peak hours by accident. Pick a genuinely quiet window for the first runs. “Low usage” for a homelab might be the middle of a workday; for a home media server it is the opposite. Know your own traffic shape.

A health check lies. The most common discovery is that a service reports healthy while being functionally dead — the HTTP 200 comes from a thin handler that does not actually touch the database. When chaos reveals this, the fix is a deeper readiness probe, not a quieter alert.

6 Is it worth it, and who is this for

For a single static website on one box, no — chaos engineering is solving a problem you do not have. The honest cutoff is roughly: if your system has more than one moving part that the others depend on, and you would be genuinely annoyed if it went down unexpectedly, the exercise pays for itself the first time it finds a single point of failure you did not know about.

For a homelabber running a real stack — a reverse proxy, a database, a cache, several interdependent services — it is absolutely worth it, and you can start tonight with tc and stress-ng for the cost of a free afternoon. You do not need Chaos Mesh, Gremlin, or any platform to begin; you need one non-critical service, a baseline from your monitoring, and the discipline to write the undo command first. The platforms are worth adopting once you are on Kubernetes and want repeatable, version-controlled, auto-reverting experiments rather than artisanal one-liners.

The mindset is the real deliverable. Once you have deliberately killed a few things and watched your system either shrug or fall over, you stop trusting “it has not failed yet” as evidence of anything. You build with timeouts, retries with backoff, and deeper health checks because you have seen what happens without them. Your users never know you spent Tuesday afternoon breaking things — they just notice the Sunday-night outage that never comes.

Contents

Automated Chaos: Using Fault Injection to Build Resilience Before Your Users Notice

Break things on purpose so they don't break for real

1 Why inject faults at all

2 Start small: the kitchen-table experiment

3 Designing an experiment that actually teaches you something

4 Levelling up: Chaos Mesh on Kubernetes

5 What goes wrong, and how to not ruin your weekend

6 Is it worth it, and who is this for

Related Content

The Home Lab Upgrade Trap: When Good Enough Should Be Good Enough

Self-Hosting Is Not Free: Accounting for Your Own Time

Borg vs Restic: Painless Encrypted Backups You'll Actually Run

Power Monitoring with Home Assistant: Tracking What Your Home Lab Actually Costs