Edge Computing vs. Cloud: Choosing the Right Architecture for Mission-Critical IoT

Where your data gets processed is an architecture decision, not a default — and getting it wrong is expensive

The first time I sent every sensor reading from a home-lab project straight to a cloud endpoint, the architecture looked clean on a whiteboard and fell apart in a week. The broadband had a thirty-second wobble one evening, a dozen sensors queued their readings, the reconnect storm hammered the endpoint, and an automation that was supposed to cut power to a load when a temperature crossed a threshold simply didn’t fire — because the decision lived in the cloud, and the cloud wasn’t reachable. Nobody got hurt; it was a hobby rig. But it taught me the single most important question in IoT architecture, which has nothing to do with which cloud you like: where does the decision get made, and what happens to that decision when the network is gone?

That question is the whole game. “Edge vs. cloud” gets framed as a religious choice, and it isn’t. It’s three boring engineering trade-offs — latency budget, failure behaviour, and the cost of moving bytes — and once you’ve quantified those for your workload, the architecture mostly designs itself.

Advertisement

Sending a reading to a regional cloud region and getting a response back is rarely under 20–50ms of round-trip time on a good connection, and that’s before any processing. For a dashboard that updates every few seconds, nobody cares. For a control loop that has to react inside, say, 10ms — a safety interlock, a motor controller, a closed-loop industrial process — the cloud is physically disqualified by the speed of light and the realities of the public internet, no matter how much you spend.

So write the number down first. What is the maximum acceptable time between “sensor observes a thing” and “system reacts to the thing”? If that budget is comfortably larger than your network round-trip plus jitter plus a safety margin, the cloud is on the table. If it isn’t, the decision logic must live at the edge, near the sensor, and that’s not a preference — it’s a constraint. Everything else in the design follows from honestly answering that one question.

This is the bit my whiteboard diagram quietly assumed away: it drew a solid line from sensor to cloud and never asked what the line does when it breaks. For anything you’d call mission-critical, “the link is down” is not an edge case, it’s a Tuesday. WAN links flap, ISPs have outages, a digger finds a fibre.

An edge node keeps deciding when the link is gone. A cloud-dependent design stops. That single difference is usually what pushes a genuinely critical control path to the edge regardless of latency: you want the local box to enforce the safety rule autonomously, buffer its telemetry, and reconcile with the cloud when connectivity returns. The cloud becomes the place you do analysis and long-term storage, not the place you make time-or-safety-critical decisions.

If you’ve ever lost data because you assumed a remote service would always be there, the discipline here is the same one I bang on about in why every side project needs a backup plan you can actually restore from: design for the failure first, because the failure is the only part guaranteed to happen.

In practice the answer for most mission-critical IoT is neither pure edge nor pure cloud — it’s a split where the time-sensitive loop runs locally and the heavy, latency-tolerant work runs centrally. The clean way to wire that is a local MQTT broker at the edge that buffers when the uplink is down and forwards when it’s up. Mosquitto does exactly this with a bridge:

# /etc/mosquitto/conf.d/bridge.conf — edge broker bridging to the cloud
connection cloud-uplink
address mqtt.example.com:8883

# only forward telemetry up; keep control decisions local
topic sensors/# out 1
topic commands/# in 1

bridge_protocol_version mqttv5
bridge_insecure false

# THIS is the line that saves you: queue while the link is down
notifications true
restart_timeout 10 30
max_queued_messages 100000
queue_qos0_messages true

The edge broker holds up to 100,000 queued messages while the WAN is unreachable and drains them when the bridge reconnects, so an outage costs you a delay in the dashboards, not a hole in the data. Crucially, the local automations subscribe to the local broker, so they keep firing throughout — the cloud being unreachable never blocks a local decision. A tiny rule engine alongside the broker (Node-RED, or a few lines of Python) handles the “if temperature > threshold, cut the relay” logic without ever consulting anything off-box.

This is the same instinct behind keeping services close to where they’re used rather than centralising everything by reflex — the kind of bare-metal, no-cloud-dependency thinking I lean on when running load balancing on a Kubernetes cluster with MetalLB instead of a cloud LB. The edge node is just that principle applied to the place the data is born.

None of this is anti-cloud. The cloud is the right home for everything the edge is bad at: training models on aggregated history, fleet-wide dashboards, long-term storage, software updates, and correlating across hundreds of sites. A constrained edge node with 4GB of RAM is never going to retrain a model or hold a year of high-resolution history. Push the summaries and the cold data up; keep the hot path down. If your workload has no hard latency budget and tolerates connectivity gaps gracefully, a cloud-first design is simpler to operate and you should prefer it — fewer boxes in the field is fewer boxes to maintain.

The temptation, once you’ve decided on edge, is to put a beefy box in the field “to be safe.” Resist it. The edge node’s job is narrow — run the control loop, buffer telemetry, do light pre-processing — and every watt and every gigabyte you add is something that has to survive heat, dust, power cuts, and the fact that you can’t easily reach it to fix it. A fanless industrial mini PC or a Raspberry Pi-class board with industrial-grade storage handles a startling amount of real work. Match the silicon to the latency budget and the data rate, not to your aspirations.

Two hardware traps catch people repeatedly. The first is storage: consumer SD cards and cheap SSDs die fast under the constant small writes that telemetry buffering generates, and they die in the field where replacing them is painful. Use industrial or high-endurance storage, mount logs and the message queue with write-minimising options, and assume the storage is the part most likely to fail. The second is power: an edge node that can’t ride out a brief brownout and come back cleanly will corrupt its queue at the worst moment. A small UPS or supercapacitor hat, plus a filesystem that survives sudden power loss, is cheap insurance.

For orchestration, decide honestly whether you need it. A single edge node running a couple of containers via docker compose and a systemd unit is gloriously simple and almost never the wrong answer for one site. The moment you have a fleet — dozens of nodes you need to update, monitor, and keep in sync — a lightweight Kubernetes distribution starts earning its keep, and the patterns from adding a second machine to a k3s cluster map neatly onto a small edge fleet: a control plane somewhere central, lightweight agents on the field nodes, rolling updates you can stage and roll back. But don’t reach for orchestration to run one container on one box. That’s complexity you’ll maintain forever to solve a problem you don’t have yet.

Edge nodes live in physically exposed places — a cabinet in a remote building, a pole, a factory floor — so “someone can walk up to it” is part of the threat model in a way a locked cloud region isn’t. That means disk encryption at rest, no secrets in plaintext config, signed firmware, and an assumption that the device may be stolen or tampered with. Against that, the cloud concentrates everything into one juicy target; segment it badly and a single breach reaches the lot.

The non-negotiables for the field side: encrypt data in transit (TLS on that MQTT bridge — note bridge_insecure false above), encrypt at rest, and give each device its own credential so a compromised node can be revoked without re-keying the fleet. The same defence-in-depth habits from hardening a Linux server from zero to SSH hero apply directly to an edge node — it is a small Linux server, just one you can’t physically trust.

The failure modes of a buffered edge/cloud split are specific and worth knowing before they bite.

Queued messages silently dropped. If max_queued_messages is too low for the length of your outages, the broker discards the overflow and you get a gap you never notice until you go looking. Size the queue for your worst realistic outage times your message rate, and monitor the queue depth as a first-class metric.

The reconnect storm. When a long outage ends, every edge node reconnects and floods the cloud endpoint at once — the exact thing that took down my early rig. Stagger reconnects with a randomised restart_timeout backoff (the 10 30 above is min/max seconds) so the fleet doesn’t synchronise into a thundering herd.

Clock skew. Edge nodes that lose power or NTP drift will timestamp buffered readings wrong, and your “real-time” dashboard ends up with events arriving out of order. Run a local NTP source and timestamp at the edge, not on cloud arrival.

The “it works until the link drops” trap. Test the disconnected path on purpose. Pull the WAN cable and confirm the local automations still fire and the queue fills and drains cleanly. An edge design you’ve never tested offline is just a cloud design with extra hardware.

If you have a hard latency budget, a safety-critical control loop, or a site where connectivity is unreliable and downtime is unacceptable, the decision logic belongs at the edge — full stop — with the cloud as the analysis and storage tier behind it. If your workload tolerates seconds of latency and the occasional gap, and you’d rather not maintain a fleet of field hardware, go cloud-first and don’t apologise for it; the operational simplicity is worth real money.

For most genuinely mission-critical IoT, you’ll land on the hybrid: autonomous edge decisions, buffered telemetry, cloud-side intelligence. It’s more moving parts than either extreme, and you pay for that in operational complexity. But it’s the only shape that survives a network outage without either melting down or going silent — and surviving the outage is, in the end, the entire definition of “mission-critical.” Decide where the decision is made, test what happens when the link dies, and the rest is just plumbing.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.