Kubernetes CronJobs: Scheduled Tasks That Don't Silently Fail

How to schedule recurring work in a cluster and actually know when it breaks

Smarc Included in

27-08-2024 1834 words 9 min read

Kubernetes CronJobs: Scheduled Tasks That Don't Silently Fail

Contents

Every cluster ends up with a graveyard of scheduled tasks. Backups, certificate renewals, cache warmers, the nightly script that reconciles a database with some upstream nonsense. On a single box you’d reach for crontab -e and move on with your life. In Kubernetes you reach for a CronJob, and if you’re not careful you reach for a thing that fails quietly at 3am and tells nobody.

I have lost data to silently failing cron tasks. Not a CronJob specifically — an old-fashioned line in a crontab that had been emailing its errors into the void for eight months, because the mail relay it wrote to had been decommissioned and nobody noticed the bounce. The nightly backup had been failing since roughly the spring; I found out in the autumn, the hard way, when I needed to restore. The lesson stuck. A scheduled task you don’t monitor isn’t automation, it’s a slow-motion incident waiting for a calendar invite. Kubernetes gives you better tooling for this than a plain crontab ever did, but only if you opt in.

What a CronJob actually is

A CronJob is a thin wrapper. On a schedule, it creates a Job. The Job creates one or more Pods, runs them to completion, and records whether they succeeded. The CronJob itself does no work — it’s a factory for Jobs, nothing more. Understanding that layering matters, because almost every confusing behaviour comes from one of those three objects, not from “cron” as a vague concept. When a task misbehaves, the first question is always: is this a CronJob problem (scheduling), a Job problem (retries, deadlines), or a Pod problem (the container itself)? The answer tells you where to look.

Here’s a backup task that does the sensible things:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup
spec:
  schedule: "17 2 * * *"        # 02:17 daily, not on the busy hour
  timeZone: "Europe/London"
  concurrencyPolicy: Forbid     # don't stack runs if one is slow
  startingDeadlineSeconds: 300  # give up if missed by >5 min
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5     # keep failures around to inspect
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800  # kill it after 30 min, no zombies
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: dump
              image: postgres:16
              command: ["/bin/sh", "-c"]
              args: ["pg_dump \"$DATABASE_URL\" | gzip | aws s3 cp - s3://backups/pg/$(date +%F).sql.gz"]
              envFrom:
                - secretRef:
                    name: pg-backup-env

Every field above is doing real work, and the defaults for most of them are wrong for a task you care about. Let’s go through the ones that actually save you.

The fields that save you

Most CronJob disasters trace to two or three defaults nobody changed.

concurrencyPolicy defaults to Allow. If your task occasionally runs long, the next scheduled tick fires anyway, and now two backups are fighting over the same lock, or two identical pg_dumps are hammering the same database. Set it to Forbid for anything stateful, or Replace if a newer run should supersede a stale one. Forbid is almost always what a backup wants.

startingDeadlineSeconds is the quiet hero. If the controller was down, or the cluster was busy, a scheduled run can be missed. Without a deadline, Kubernetes may try to backfill missed runs, or — worse — if more than 100 schedules are missed it stops scheduling entirely and logs a cryptic message about too many missed start times. This is a real trap: a cluster that was down over a long weekend can come back and refuse to schedule your CronJob at all, silently, forever, until you notice. Setting a deadline says “if you can’t start within five minutes of the scheduled time, skip it and wait for the next one,” which sidesteps the whole backfill mess.

backoffLimit and activeDeadlineSeconds govern the Job. The first limits retries before the Job is marked failed; the second is a hard wall-clock cap so a wedged Pod doesn’t run forever, eating a node and never finishing. A backup that hangs on a network stall will otherwise sit there consuming resources — and, with concurrencyPolicy: Forbid, block every subsequent run — until someone kills it by hand. The deadline turns that into a clean failure you can alert on.

Two more worth setting deliberately: restartPolicy: Never (with Never, a crashed Pod fails the Job for backoffLimit to handle, rather than the kubelet silently restarting the container under you), and the history limits, which control how many old Jobs stick around. Keep failedJobsHistoryLimit non-zero — those failed Jobs are your evidence.

Time zones, and the trap that came before them

For a long time CronJob schedules ran in the time zone of the kube-controller-manager, which on most managed clusters meant UTC, whether you’d noticed or not. Your carefully-chosen “2am, off-peak” backup ran at whatever 2am-UTC translated to locally — the middle of the working day in some regions. The timeZone field (stable since Kubernetes 1.27) fixed that — set it explicitly and stop doing daylight-saving arithmetic in your head. If your cluster predates that field, assume UTC and write your schedules accordingly, because “2am” to you is not “2am” to the controller. One caution: the timeZone value must be a valid IANA name (Europe/London, not BST), and a typo makes the CronJob fail to schedule with an error you have to go looking for in the controller logs.

Making failure loud

This is the part people skip, and it’s the whole point. A CronJob that fails leaves a failed Job behind, and unless you’re staring at kubectl get jobs, you’ll never see it. You need failure to come and find you.

1
2
3
4
5
6
# A failed Job sticks around if failedJobsHistoryLimit > 0:
kubectl get jobs --sort-by=.status.startTime

# Wire an alert on the metric your monitoring already scrapes:
#   kube_job_status_failed > 0
# Fire it through Alertmanager to wherever you actually look.

The pattern that works: keep failedJobsHistoryLimit non-zero so failures persist for inspection, and alert on kube_job_status_failed via kube-state-metrics. That single metric, scraped by Prometheus and wired to Alertmanager, converts “silent 3am failure” into a notification. If you’re running a leaner stack without a full Prometheus deployment, even something lightweight like Beszel plus a health-check ping is miles better than nothing.

For tasks that must run — a backup, a billing job — go further and alert on absence too: a “dead man’s switch” that complains if the success signal hasn’t arrived within the expected window. A backup that didn’t run is just as dangerous as one that ran and failed, and only the dead man’s switch catches the first case. The classic implementation is a hosted check-in URL (Healthchecks.io and similar) that your Job curls on success; if the ping doesn’t arrive on schedule, the monitor alerts you. That’s the piece that would have caught my eight-months-of-nothing backup.

Idempotency, because retries will happen

Here’s a subtlety that bites people who come from single-host cron: in Kubernetes, your task will occasionally run more than once. A node dies mid-run and the Job reschedules the Pod. backoffLimit retries a transient failure. A concurrencyPolicy: Replace kills a slow run and starts a fresh one. None of this is a bug — it’s the resilience you signed up for — but it means your task must be idempotent or safe to re-run. A backup that writes to s3://backups/pg/$(date +%F).sql.gz is fine, because a second run just overwrites today’s file with an identical dump. A task that increments a counter, sends an email, or charges a card is dangerous the moment it runs twice.

The fix is to design for at-least-once execution. Make the task check whether the work is already done before doing it, key side-effects to an idempotency token, or wrap the critical bit in a database transaction that’s safe to replay. If you genuinely cannot make a task idempotent, concurrencyPolicy: Forbid plus a short startingDeadlineSeconds narrows the window, but it never fully closes it — distributed scheduling and exactly-once delivery don’t coexist. Assume at-least-once and build accordingly; it’s the single most important habit that separates a CronJob you can trust from one that quietly does the wrong thing under load.

Resource requests, so the scheduler cooperates

Set resources.requests on the Job’s Pod template. Without them the scheduler treats your task as needing nothing, which on a busy cluster means it either can’t find a node with headroom (the Pod sits Pending past its deadline and the run is missed) or it lands somewhere and gets OOM-killed halfway through a backup. A modest request and a sane limit — enough memory for pg_dump plus gzip, a fraction of a CPU — makes scheduling deterministic and stops a heavy nightly job from starving your interactive workloads. It’s the same discipline that applies to any pod; a scheduled task is not exempt just because it only runs at 2am.

Troubleshooting: the failures you’ll actually hit

“My CronJob stopped scheduling.” Almost always the too-many-missed-starts trap. Check the CronJob’s events (kubectl describe cronjob pg-backup) for a message about missed start times exceeding 100. Set startingDeadlineSeconds and the counter resets.

Jobs pile up and never clear. You’ve left concurrencyPolicy: Allow on a task that runs long, so runs stack. Switch to Forbid or Replace, and add activeDeadlineSeconds so a hung Pod can’t block the queue indefinitely.

The Pod runs but the task didn’t do anything. Check restartPolicy and exit codes. A script that exits 0 on a caught error looks like success to Kubernetes — the Job is only as honest as your container’s exit code. Make your task exit non-zero on failure, or Kubernetes will cheerfully record a “successful” no-op.

Schedule fires at the wrong time. Time zone. Confirm timeZone is set to a valid IANA name and that you’re not reading UTC as local.

This same “fail loudly, catch the silent case” discipline is really just incident-response hygiene applied to scheduled work; if that’s a rabbit hole you want to go down, I’ve written about where automated incident response helps and where it doesn’t.

Is it worth it?

If you’re running a single application on a single VM, a CronJob is overkill — use a real crontab and a monitoring ping. But if you already have a cluster, scheduling work as CronJobs is the right call: you get retries, history, resource limits, and the same observability stack as everything else. The catch is that none of the safety is automatic. The defaults will let a task fail in silence for months. Spend ten minutes setting the concurrency policy, the deadlines, the time zone, and an alert on kube_job_status_failed plus a dead man’s switch on anything critical, and you turn a liability into genuine automation. Skip that step and you’ve just reinvented the crontab that lost my data, only with more YAML — and the extra YAML will make you feel like you’ve done something robust, which is arguably worse than the honest crontab that never pretended to be.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#kubernetes #devops #automation #reliability

Contents

Kubernetes CronJobs: Scheduled Tasks That Don't Silently Fail

How to schedule recurring work in a cluster and actually know when it breaks

What a CronJob actually is

The fields that save you

Time zones, and the trap that came before them

Making failure loud

Idempotency, because retries will happen

Resource requests, so the scheduler cooperates

Troubleshooting: the failures you’ll actually hit

Is it worth it?

Related Content

Pre-commit Hooks: Catching Mistakes Before They Reach the Repo

Taskfile: A Modern Task Runner That Replaces Make Without the Pain

Renovate Bot: Automated Dependency Updates That Don't Break Everything

Gitea Actions: Self-Hosted CI That Boots in Seconds