Why Your Kubernetes Cluster Crashes at 2 a.m. and How to Stop It

Diagnosing and fixing common stability pitfalls

The first time my home cluster fell over at 02:14, I assumed it was a fluke. The second time, three nights later, at almost exactly the same hour, I stopped believing in coincidence and started believing in cron. Because that is what a 2 a.m. crash almost always is: not random cosmic bad luck, but a scheduled job — a backup, a log rotation, a certificate renewal, a docker image prune you set up and forgot — colliding with a cluster that has no headroom to absorb it. The outage feels mysterious because you are asleep and the trigger fires on a timer you no longer remember writing.

I have debugged this pattern on a three-node k3s cluster in my rack more times than I would like to admit, and the good news is that once you understand why clusters die at night specifically, the fixes are boring, cheap, and permanent. This is not a story about exotic failures. It is a story about resource exhaustion, badly tuned probes, and the difference between a cluster that degrades gracefully and one that falls off a cliff.

Advertisement

Daytime clusters are noisy but self-correcting. You are awake, you notice a slow pod, you restart something, traffic is spread across the day. The small hours are different for three reasons that stack on top of each other.

First, that is when the batch jobs run. Almost every default schedule — logrotate, apt unattended-upgrades, database backups, restic snapshots, container image garbage collection — is pinned to the quiet hours precisely because nobody is meant to be watching. So the exact moment your cluster gets a sudden spike in disk I/O and memory pressure is the moment there is no human around to catch the early warning.

Second, memory that leaks slowly all day crosses its threshold at night. A pod that grows 50 MB an hour looks fine at 6 p.m. and is a problem by 2 a.m. The crash time is not when the bug happened; it is when the slow accumulation finally hit the ceiling.

Third, kubelet and the container runtime are far less forgiving of a full disk or exhausted memory than most people expect. When a node’s root filesystem crosses the eviction threshold, the kubelet starts evicting pods to reclaim space. If the thing filling the disk is container logs from the pods themselves, you get a feedback loop: pods evicted, rescheduled, immediately start logging again, disk stays full, more evictions. That is the classic nightly death spiral, and it is entirely preventable.

The single biggest mistake is to reboot the node and move on. You have just destroyed the evidence and guaranteed a repeat performance. Before anything else, find out what the kubelet believed was wrong. Node conditions are the fastest tell:

kubectl get nodes -o wide
kubectl describe node node-01 | grep -A15 Conditions

You are looking for DiskPressure, MemoryPressure, or PIDPressure set to True. Then read the events in chronological order — this is the flight recorder:

# cluster-wide events, oldest first
kubectl get events -A --sort-by=.lastTimestamp | tail -40

# what actually got killed, and why
kubectl get pods -A --field-selector=status.phase=Failed

The words that matter are Evicted, OOMKilled, and BackOff. An OOMKilled container hit its memory limit (or the node ran out entirely). An Evicted pod was thrown off a node under pressure. CrashLoopBackOff means the container keeps dying and Kubernetes keeps restarting it with exponential delay — which, if it happens at 2 a.m., is usually a probe problem, covered below.

If the node is so wedged you cannot reach the API, drop to the host and ask systemd directly:

journalctl -u k3s -u kubelet --since "02:00" --until "02:30"
df -h /var/lib   # is the disk actually full?

That thirty-minute window around the crash time is where the truth lives. I keep a note of the exact minute each outage started precisely so I can scope journalctl tightly instead of scrolling through gigabytes.

This is the most common nightly killer I see, and it has a one-line root cause: nothing is rotating the container logs. By default the containerd/k3s log directory under /var/log/pods and /var/lib/rancher/k3s grows without bound. A single chatty pod in a restart loop can write gigabytes overnight.

The fix is to cap log size at the runtime level so no pod can ever fill the disk, no matter how much it screams. For k3s, set the kubelet log limits:

# /etc/rancher/k3s/config.yaml
kubelet-arg:
  - "container-log-max-size=10Mi"
  - "container-log-max-files=3"

That caps every container at 30 MB of logs, rotated. Restart k3s and the ceiling is in place. On a full kubeadm cluster the equivalent lives in the kubelet config or your container runtime’s daemon.json. Either way, the principle is the same: a runaway log stream should cost you a truncated log file, not a dead node.

Here is the uncomfortable truth about most home and small-team clusters: the pods have no memory limits at all. Without a limit, a single leaking pod can consume all the RAM on a node, at which point the Linux OOM killer starts shooting processes — and it does not politely target your buggy app. It may kill the kubelet itself, taking the whole node offline.

Set requests and limits on everything that matters. Requests tell the scheduler how much to reserve; limits are the hard ceiling that contains a leak:

resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "256Mi"
    # note: deliberately no CPU limit — see below

Two hard-won opinions. First, always set a memory limit, because an unbounded memory leak takes down a node while an OOMKilled pod merely restarts — a contained failure beats a spreading one every time. Second, I usually omit the CPU limit and set only a CPU request. CPU is compressible; a pod that wants more CPU than its share just gets throttled and slows down, it does not crash. A too-aggressive CPU limit throttles a healthy pod into missing its liveness probe, which restarts it, which is a self-inflicted 2 a.m. outage. If you take one thing from this post: limit memory, request CPU.

To stop one namespace from starving the rest, add a ResourceQuota and a LimitRange so anything deployed without limits inherits a sane default instead of running unbounded.

The second great nightly killer is a liveness probe that is too impatient. A liveness probe that fails restarts the container. If your app pauses for eight seconds during a nightly backup or a garbage-collection sweep, and your probe gives it three, Kubernetes helpfully restarts a perfectly healthy application — mid-backup — and now you have both a crash and a corrupted job.

The rule I follow: liveness probes should answer one question only — is this process wedged beyond recovery? — and should be generous. Use readiness probes, which pull a pod out of the load balancer without restarting it, for transient slowness. And use startupProbe for anything slow to boot, so the liveness clock does not even start ticking until the app is up.

startupProbe:
  httpGet: { path: /healthz, port: 8080 }
  failureThreshold: 30      # up to 30 × 5s = 150s to boot
  periodSeconds: 5
livenessProbe:
  httpGet: { path: /healthz, port: 8080 }
  periodSeconds: 15
  timeoutSeconds: 5
  failureThreshold: 3       # only restart after ~45s of hard failure
readinessProbe:
  httpGet: { path: /ready, port: 8080 }
  periodSeconds: 5
  failureThreshold: 2       # drop from LB fast, but don't restart

The distinction between “slow” (readiness) and “dead” (liveness) is the whole game. Most CrashLoopBackOff mysteries dissolve the moment you widen the liveness thresholds and move impatience over to the readiness probe where it belongs.

When the pager (or, in my case, the Uptime Kuma push notification) fires, work this list in order rather than flailing:

  1. Is it the disk? df -h on every node. A full /var explains most evictions instantly.
  2. What got killed and why? kubectl get events -A --sort-by=.lastTimestamp | grep -Ei 'evict|oom|backoff'.
  3. Which resource is pressured? kubectl describe node <n> | grep -A15 Conditions.
  4. Did something scheduled fire? Check for a CronJob or host cron in the crash window: kubectl get cronjobs -A and journalctl --since "02:00".
  5. Is the control plane itself healthy? On a single-server k3s the answer to “why did everything vanish” is sometimes “the one control-plane node ran out of memory”. A single control-plane node is a single point of failure, and 2 a.m. is when it exercises that fact.

If the culprit is a CronJob, give it its own resource limits and, ideally, a concurrencyPolicy: Forbid so a slow run cannot overlap the next one and double the load. If the culprit is the storage layer misbehaving under the nightly backup, that is its own rabbit hole — persistent storage on Kubernetes is a genuinely hard problem, and I have written separately about why picking the right storage backend matters.

The reason these crashes feel sudden is that nobody was watching the slope. Memory that climbs linearly all evening is completely legible on a graph and completely invisible if you only look after the node is already dead. Run Prometheus and Grafana (or k3s’s bundled metrics stack) and alert on the approach, not the event: node memory above 85%, disk above 80%, a pod’s restart count climbing. An alert that fires at 85% disk at 23:00 lets you fix it before the 02:00 spiral; an alert that fires when the node is NotReady just tells you what you already know.

The other half of resilience is making the failure survivable when it does happen. A PodDisruptionBudget keeps a minimum number of replicas alive during voluntary disruption; anti-affinity rules stop all your replicas landing on the same node that is about to die; and honestly, a tested backup is what turns a bad night into a shrug. If your cluster state and persistent volumes are backed up and you have actually restored them once, a total node loss is an inconvenience rather than a catastrophe — which is exactly the argument I make in more detail about why every project needs a backup plan you have actually tested.

If you run a single hobby app on a single node and you genuinely do not care if it is down until you wake up, most of this is over-engineering — set container log limits and a memory limit and stop there; that alone prevents the ugliest death spiral for about ten minutes of work. But if anything depends on that cluster staying up overnight — a home automation setup, a family photo server, anything you would be embarrassed to have down when someone else needs it — then log rotation, memory limits, sane probes, and a single trend alert are the highest-leverage hour you will spend on your cluster all year.

The deeper lesson is that Kubernetes does not crash at 2 a.m. because it is fragile. It crashes at 2 a.m. because you gave it no limits, no headroom, and no early warning, and then a timer you forgot about pulled the trigger while you slept. Fix those three things and the small hours get very quiet — which, for a piece of infrastructure, is the highest compliment there is.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.