Kubernetes CronJobs: Scheduled Tasks That Don't Silently Fail

How to schedule recurring work in a cluster and actually know when it breaks

Every cluster ends up with a graveyard of scheduled tasks. Backups, certificate renewals, cache warmers, the nightly script that reconciles a database with some upstream nonsense. On a single box you’d reach for crontab -e and move on with your life. In Kubernetes you reach for a CronJob, and if you’re not careful you reach for a thing that fails quietly at 3am and tells nobody.

I have lost data to silently failing cron tasks. Not a CronJob specifically — an old-fashioned line in a crontab that had been emailing its errors into the void for eight months. The lesson stuck. A scheduled task you don’t monitor isn’t automation, it’s a slow-motion incident waiting for a calendar invite. Kubernetes gives you better tooling for this than a plain crontab ever did, but only if you opt in.

Advertisement

A CronJob is a thin wrapper. On a schedule, it creates a Job. The Job creates one or more Pods, runs them to completion, and records whether they succeeded. The CronJob itself does no work — it’s a factory for Jobs, nothing more. Understanding that layering matters, because almost every confusing behaviour comes from one of those three objects, not from “cron” as a vague concept.

Here’s a backup task that does the sensible things:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pg-backup
spec:
  schedule: "17 2 * * *"        # 02:17 daily, not on the busy hour
  timeZone: "Europe/London"
  concurrencyPolicy: Forbid     # don't stack runs if one is slow
  startingDeadlineSeconds: 300  # give up if missed by >5 min
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5     # keep failures around to inspect
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800  # kill it after 30 min, no zombies
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: dump
              image: postgres:16
              command: ["/bin/sh", "-c"]
              args: ["pg_dump \"$DATABASE_URL\" | gzip | aws s3 cp - s3://backups/pg/$(date +%F).sql.gz"]
              envFrom:
                - secretRef:
                    name: pg-backup-env

Most CronJob disasters trace to two or three defaults nobody changed.

concurrencyPolicy defaults to Allow. If your task occasionally runs long, the next scheduled tick fires anyway, and now two backups are fighting over the same lock. Set it to Forbid for anything stateful, or Replace if a newer run should supersede a stale one.

startingDeadlineSeconds is the quiet hero. If the controller was down, or the cluster was busy, a scheduled run can be missed. Without a deadline, Kubernetes may try to backfill missed runs, or — worse — if more than 100 schedules are missed it stops scheduling entirely and logs a cryptic message. Setting a deadline says “if you can’t start within five minutes of the scheduled time, skip it and wait for the next one.”

backoffLimit and activeDeadlineSeconds govern the Job. The first limits retries before the Job is marked failed; the second is a hard wall-clock cap so a wedged Pod doesn’t run forever, eating a node and never finishing.

For a long time CronJob schedules ran in the time zone of the kube-controller-manager, which on most managed clusters meant UTC, whether you’d noticed or not. The timeZone field fixed that — set it explicitly and stop doing daylight-saving arithmetic in your head. If your cluster predates that field, assume UTC and write your schedules accordingly, because “2am” to you is not “2am” to the controller.

This is the part people skip, and it’s the whole point. A CronJob that fails leaves a failed Job behind, and unless you’re staring at kubectl get jobs, you’ll never see it. You need failure to come and find you.

# A failed Job sticks around if failedJobsHistoryLimit > 0:
kubectl get jobs -l job-name --sort-by=.status.startTime

# Wire an alert on the metric your monitoring already scrapes:
#   kube_job_status_failed > 0
# Fire it through Alertmanager to wherever you actually look.

The pattern that works: keep failedJobsHistoryLimit non-zero so failures persist for inspection, and alert on kube_job_status_failed via kube-state-metrics. For tasks that must run — a backup, a billing job — go further and alert on absence too: a “dead man’s switch” that complains if the success signal hasn’t arrived within the expected window. A backup that didn’t run is just as dangerous as one that ran and failed, and only the dead man’s switch catches the first case.

If you’re running a single application on a single VM, a CronJob is overkill — use a real crontab and a monitoring ping. But if you already have a cluster, scheduling work as CronJobs is the right call: you get retries, history, resource limits, and the same observability stack as everything else. The catch is that none of the safety is automatic. The defaults will let a task fail in silence for months. Spend ten minutes setting the concurrency policy, the deadlines, and an alert on kube_job_status_failed, and you turn a liability into genuine automation. Skip that step and you’ve just reinvented the crontab that lost my data, only with more YAML.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.