Healthchecks.io (Self-Hosted): Making Sure Your Cron Jobs Actually Ran

Dead-man's-switch monitoring for the backups and scripts you never think about

Smarc Included in

21-11-2023 2038 words 10 min read

Healthchecks.io (Self-Hosted): Making Sure Your Cron Jobs Actually Ran

Contents

Here is a story that has happened to almost everyone who has ever written a cron job. You set up a nightly backup. You test it once, it works, you feel responsible and adult. Eight months later you actually need that backup, and you discover it stopped running in March because of a full disk, an expired token, or a typo you made while “tidying up.” The cron job didn’t fail loudly. It failed silently, which is the worst way for anything to fail, and nobody told you because there was nobody to tell.

The problem with monitoring cron jobs is that you can’t watch for the thing going wrong — you have to watch for the thing not happening. That’s a dead man’s switch, and Healthchecks is the cleanest implementation of one I’ve found. There’s a hosted version at healthchecks.io, but the project is open source, and self-hosting it is both easy and faintly appropriate: monitoring your own infrastructure on your own infrastructure has a pleasing symmetry to it.

The pattern: a ping you expect

The idea is almost embarrassingly simple. You create a “check,” which gives you a unique URL. Your cron job, at the end of its run, makes an HTTP request to that URL. Healthchecks knows roughly how often it should hear from you, and if a ping doesn’t arrive on schedule, that silence is what triggers the alert.

So instead of trying to detect failure, you detect the absence of success. A backup that crashes never sends its ping, the expected window passes, and Healthchecks emails you to say “I haven’t heard from your backup in 25 hours.” It’s the opposite of normal monitoring and it’s exactly right for this job.

This is worth dwelling on because it’s the bit people get wrong when they roll their own. The naïve instinct is to make the backup script email you when it fails. But think about how backups actually die: the disk fills and the script can’t write its log, let alone send mail; the box is powered off; cron itself is misconfigured and the script never runs at all; a DNS change quietly breaks the SMTP relay the script was going to use. In every one of those cases, a failure-triggered email is exactly the email that never arrives. The dead-man’s-switch is immune to all of them, because it asks the question from the outside: did the expected signal show up, yes or no? If your monitoring depends on the failing thing to report its own failure, you don’t have monitoring, you have a comforting story.

It pairs neatly with active monitoring rather than replacing it. A tool like Uptime Kuma watches things that are meant to be up — websites, APIs, ports — by poking them on a timer. Healthchecks watches things that are meant to happen — scheduled jobs — by waiting for them to check in. Most homelabs want both, and they answer genuinely different questions.

Adding it to a script

Wiring it in is a one-liner. The crudest version just curls the URL at the end of the crontab line:

1
30 2 * * * /usr/local/bin/backup.sh && curl -fsS -m 10 --retry 3 https://hc.example.com/ping/your-uuid-here

The && is doing real work: the ping only fires if backup.sh exits successfully. If the script fails, no ping, and you get told.

But the better pattern signals start, success and failure explicitly, which gives you timing data and immediate failure alerts rather than waiting for the silence window:

1
2
3
4
5
6
7
8
9
#!/usr/bin/env bash
URL="https://hc.example.com/ping/your-uuid-here"
curl -fsS -m 10 "$URL/start"            # mark the run as started

if /usr/local/bin/backup.sh; then
  curl -fsS -m 10 "$URL"                # success
else
  curl -fsS -m 10 "$URL/fail"           # explicit failure, alert now
fi

That’s the whole integration. Anything that can make an HTTP request — a shell script, a Python job, a Kubernetes CronJob, a Windows task — can report to Healthchecks.

There’s one more refinement worth adopting from the start: capture the job’s output and ship it with the ping. Healthchecks accepts a POST body up to a generous size limit and stores it against that run, so when an alert fires at 3am you don’t just learn that the backup failed, you can read the last hundred lines of what it said as it died. The pattern is a single extra pipe:

1
2
3
4
#!/usr/bin/env bash
URL="https://hc.example.com/ping/your-uuid-here"
m=$(/usr/local/bin/backup.sh 2>&1)          # capture stdout+stderr
curl -fsS -m 10 --data-raw "$m" "$URL/$?"   # body = output, /$? = exit code

That $? trick is neat: Healthchecks treats a ping to /0 as success and any non-zero exit code (/1, /2, …) as failure, so the script’s own exit status drives the result with no if statement at all. The captured output turns “something broke” into “the volume /mnt/data was not mounted,” which is the difference between fixing it before coffee and spending the morning reproducing it.

Self-hosting it

The application is a Django project, so the moving parts are the app and a database. Postgres is the sensible choice for anything you care about:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
services:
  healthchecks:
    image: healthchecks/healthchecks:latest
    environment:
      ALLOWED_HOSTS: hc.example.com
      DB: postgres
      DB_HOST: db
      DB_NAME: healthchecks
      DB_USER: hc
      DB_PASSWORD: changeme
      SITE_ROOT: https://hc.example.com
      DEFAULT_FROM_EMAIL: [email protected]
      EMAIL_HOST: smtp.example.com
    ports:
      - "8000:8000"
    depends_on: [db]

  db:
    image: postgres:16
    environment:
      POSTGRES_DB: healthchecks
      POSTGRES_USER: hc
      POSTGRES_PASSWORD: changeme
    volumes:
      - hc_db:/var/lib/postgresql/data

volumes:
  hc_db:

The one bit of configuration you must not skip is email — get the SMTP settings right and test them, because an alerting system that can’t deliver alerts is an elaborate way of feeling falsely safe. Healthchecks also speaks to a long list of other channels: Slack, Telegram, webhooks, Pushover, and others, so you can route the “your backup is missing” message wherever you’ll actually see it.

Two environment variables earn special mention. ALLOWED_HOSTS and SITE_ROOT must both match the hostname you actually reach the app on, including the scheme. Django is strict about ALLOWED_HOSTS and will throw a 400 “Bad Request” if the Host header doesn’t match — a confusing error if you don’t know to expect it. And SITE_ROOT is what Healthchecks bakes into the ping URLs it shows you; get it wrong and you’ll copy ping URLs that point nowhere. Put the app behind a reverse proxy with TLS (the ping endpoint is hit by scripts that may transmit nothing sensitive, but the dashboard has your whole infrastructure’s heartbeat on it), and set both variables to the public HTTPS hostname.

A first sensible job to point at it is your backup itself. If you’ve followed something like Firefly III self-hosted or any stateful service, the database dump that protects it is precisely the kind of silent, scheduled job that needs this.

There’s an obvious bootstrapping problem worth solving on day one: who monitors the monitor? If Healthchecks itself goes down, silently, you lose visibility into every job it was watching, and the failure looks identical from the outside to “everything is fine, nothing has pinged yet.” The fix is cheap — point an external dead-man’s-switch at Healthchecks’ own health, or simply have a second, independent host ping a check that alerts if that check goes silent, so the monitoring tool is watched by something that isn’t itself. And back up the Healthchecks database the same way you’d back up anything else that matters: a nightly pg_dump of the healthchecks database, piped straight to the same off-box storage as everything else, with its own check pinging on success. Monitoring infrastructure that can’t survive losing its own disk isn’t infrastructure, it’s a single point of failure wearing a reassuring dashboard.

Troubleshooting: when checks misbehave

A check goes red even though the job ran. The usual cause is the schedule, not the job. If you set a simple “period” of one day but the job runs at slightly different times, a run at 02:05 followed by one at 02:35 the next day is more than 24 hours apart, and the check flaps. Either widen the grace period or, better, switch the check to a cron expression so Healthchecks knows the intended schedule precisely.

Pings work from your laptop but not from the server. Almost always egress: the cron host can’t reach the Healthchecks URL. If you self-host on an internal hostname, a job on a different network segment may not resolve or route to it. Test with the exact command the cron line uses — curl -fsS -m 10 <url> — from the actual host, as the actual user. The -f flag matters: without it, curl exits 0 even on an HTTP error, so a misconfigured URL looks like it “worked.”

Curl succeeds but the check never updates. Check you’re pinging the right UUID and that SITE_ROOT is correct. A ping to a non-existent check returns a perfectly cheerful 200 by design (so attackers can’t enumerate valid UUIDs), which means a typo’d URL fails utterly silently — the one place Healthchecks’ own failure mode is itself silent. Copy the URL from the dashboard; never type it.

Alerts don’t arrive. Send a test notification from the integration’s settings page. If email specifically is dead, it’s the SMTP config nine times in ten — wrong port, missing TLS, or a relay that needs authentication you didn’t supply.

The features that make it more than curl

Beyond the basic schedule, two things earn their keep. Grace periods let you say “this job should run hourly, but don’t panic until it’s 15 minutes late,” which stops a slightly slow run from paging you. And cron expressions mean you can describe a complicated schedule — “weekdays at 6am” — and Healthchecks understands exactly when to expect the next ping, rather than guessing from a simple interval.

There’s also a tidy dashboard showing every check’s status at a glance, the last few pings, and how long each run took. The first time you see a job that’s quietly drifted from “30 seconds” to “11 minutes” over a few months, you’ll understand why having the history matters — it’s an early warning that a dataset has grown, a disk is slowing, or a network mount is flaking, long before the job actually times out.

Two more features pay for themselves once you have more than a handful of checks. Auto-provisioning via the management API lets you create and configure checks programmatically, so the same script that deploys a new service can register its monitoring in the same breath — no clicking around a UI, and no “I forgot to set up monitoring for that one” six months later. And the start/success pattern (the /start ping in the better script above) gives Healthchecks a measured run duration and lets you set an alert if a job runs too long as well as not at all — a backup that normally takes ten minutes and is suddenly still going after two hours is often a sign of trouble that pure success-or-silence monitoring would miss entirely.

It’s also worth grouping checks with tags from day one. A flat list of three checks is fine; a flat list of forty is a mess. Tag them by host, by criticality, or by what wakes you versus what can wait until morning, and the dashboard stays legible as your collection grows — and it will grow, because once you have this, you start wanting it on everything.

The verdict

This is one of those rare tools where the effort-to-payoff ratio is almost insulting. An afternoon to self-host it, a one-line change per cron job, and you have closed off an entire category of silent, catastrophic failure — the backup that wasn’t, the certificate renewal that didn’t, the sync that stopped. If you run literally anything on a schedule that you’d be upset to discover had stopped, you want this. I added it to my backups years ago and the single most reassuring sound in my homelab is the one I never hear: the alert that doesn’t fire, because everything actually ran. Set it up before you need it, because by the time you need it, it’s already too late.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#self-hosting #monitoring #cron #backups

Contents

Healthchecks.io (Self-Hosted): Making Sure Your Cron Jobs Actually Ran

Dead-man's-switch monitoring for the backups and scripts you never think about

The pattern: a ping you expect

Adding it to a script

Self-hosting it

Troubleshooting: when checks misbehave

The features that make it more than curl

The verdict

Related Content

Self-Hosted Dead-Man Switches for Your Cron Jobs

Beszel: Lightweight Server Monitoring Without the Grafana Overhead

Wazuh: A Self-Hosted SIEM for the Home Lab (Is It Worth the RAM?)

SNMP Is Not Dead: Monitoring the Dumb Boxes