Building Psychological Safety in DevOps: Lessons from Flight Decks and Firefighting

Why the team that admits its mistakes fastest is the one that stays up

26-05-2025 26-05-2025 1824 words 9 minutes

Building Psychological Safety in DevOps: Lessons from Flight Decks and Firefighting

Contents

On 28 December 1978, United Airlines Flight 173 ran out of fuel and crashed into a Portland suburb while the crew was preoccupied with a landing-gear light. The flight engineer knew the fuel was critically low. He mentioned it. He did not push, because the captain outranked him and the cockpit culture of the era did not encourage a junior officer to insist. Ten people died within sight of the runway. That accident, more than any management theory, is why modern aviation drilled into itself a discipline called crew resource management — the deliberate construction of a cockpit where the most junior person can say “we are about to crash” and be heard.

I think about Flight 173 every time a DevOps team has a quiet incident review where everyone already knows what went wrong but nobody says it. The cost of silence in an outage is measured in money and sleep rather than lives, which is exactly why we tolerate it — but the mechanism is identical. The engineer who saw the disk filling up, the contractor who knew the deploy was risky, the new hire who didn’t understand the runbook and was too embarrassed to ask: every unspoken concern is a fuel gauge nobody acted on.

Psychological safety — Amy Edmondson’s term, from her 1999 study of hospital teams — is the shared belief that you won’t be punished or humiliated for speaking up with questions, concerns, mistakes, or bad news. Google’s Project Aristotle, which spent years dissecting what made its teams effective, landed on it as the single largest factor, ahead of who was on the team or how senior they were. For systems that page you at 2am, it isn’t a nicety. It’s a reliability control.

1 The flight-deck lesson: rank must not silence information

Crew resource management worked because it changed who was allowed to talk and when. The breakthrough wasn’t a new checklist; it was making it expected that a first officer challenges a captain, that any crew member can call a go-around, that “I’m not comfortable with this” is a complete and respected sentence regardless of stripes.

The DevOps translation is uncomfortable for a lot of organisations, because our hierarchies are quieter but just as real. The principal engineer’s design “obviously” works. The manager’s deadline is “obviously” fixed. The vendor’s appliance is “obviously” configured correctly. Every one of those obviouslys is a rank-silences-information moment waiting to cause an outage. The fix is the same one aviation reached for: make challenge routine and low-cost, so that nobody has to be brave to raise a concern — they just have to be doing their job.

Concretely, that means pre-mortems before risky changes (“assume this deploy fails — what killed it?”), it means an on-call engineer can halt a release without escalating to a VP, and it means the loudest person in the room actively soliciting the quietest one’s view. None of that is soft. It is the cheapest reliability investment available, and most teams under-spend on it wildly.

There’s a measurable tell for whether you’ve achieved this. Watch what happens in the first ten minutes of an incident. On a team without safety, those minutes are spent establishing that it wasn’t anybody’s fault, hedging, and waiting for someone senior to commit to a diagnosis first. On a team with it, someone says “I think my deploy did this, rolling back now” within ninety seconds and the room moves on to recovery. The second team isn’t smarter — it has simply removed the social cost of being the person who broke production, so the information that ends the outage surfaces immediately instead of after everyone has finished protecting themselves. That gap, measured in minutes of mean-time-to-recovery, is psychological safety showing up directly on a graph you already track.

2 The firefighting lesson: rehearse the roles before the fire

Fireground command works because the roles are assigned and drilled long before anyone is on a ladder. Nobody debates who runs the incident while the building burns. The incident commander commands, others execute, and crucially the structure makes it safe to report bad news up the chain because reporting is the job, not an admission of failure.

The DevOps equivalent is the incident command structure that mature teams adopt: a single incident commander who coordinates rather than fixes, a communications lead who handles stakeholders, and responders who own specific systems. The point isn’t the org chart — it’s that when the structure is rehearsed, nobody is improvising authority during the worst hour, and a responder can say “I broke this trying to fix it” without it becoming a career event.

You rehearse this the way fire crews do: deliberately. Game days, chaos drills, and tabletop exercises where you walk through a plausible failure and discover that your runbook references a dashboard that no longer exists. Those rehearsals do double duty — they harden the system and they normalise calm, honest communication under pressure, so it’s available when you need it. The same instinct that makes a good real-time SIEM pipeline valuable — having the information in front of you before you’re in crisis — applies to your people, not just your logs.

3 The blameless postmortem, done properly

Everyone says “blameless postmortem” and a depressing number of teams run a blame ritual with the word “blameless” stapled to the top. The difference is structural, not tonal. A real blameless postmortem assumes that everyone acted reasonably given what they knew at the time, and treats every human error as a signal that a system allowed that error to happen and to matter.

If your conclusion is “Dave should have been more careful,” you have failed. The honest version is “an engineer could run this command against production with no confirmation, no canary, and no easy rollback — so we fix the system that made Dave’s careful-or-not irrelevant.” The test I apply: if the action items only make sense if the same person is involved next time, they’re blame in disguise. Good action items would have prevented the incident no matter who was at the keyboard, because they harden the system rather than exhort the human.

Here’s a template I actually use. It is deliberately boring, because boring is repeatable:

# Incident: <short title>     Date: <yyyy-mm-dd>     Severity: <S1–S4>

## Impact
- Who/what was affected, for how long, and how we measured it.

## Timeline (UTC, factual, no blame)
- 14:02  Deploy of v1.8 begins
- 14:09  Error rate climbs to 30%; first alert fires
- 14:18  On-call acknowledges, begins investigation
- 14:41  Rollback to v1.7 completes; error rate recovers

## Root cause(s)
- The technical chain. Use "5 whys" until you hit a SYSTEM, not a person.

## What went well
- The alert fired correctly. Rollback worked. (Name what saved us.)

## Action items (owner + due date + tracked)
- [ ] Add a pre-deploy schema-compat check        @alex   by 06-05
- [ ] Require canary stage for prod deploys        @sam    by 06-12

Two parts make or break it. The timeline is strictly factual — times and events, no adjectives, no “unfortunately Dave.” And the action items are owned, dated, and tracked, because a postmortem whose recommendations evaporate is just expensive theatre that quietly teaches people their honesty changed nothing — the fastest way to destroy psychological safety.

4 Troubleshooting your culture: the failure modes

Cultural change goes wrong in predictable ways, and it’s worth naming them so you can catch yourself.

Blamelessness mistaken for no accountability. Safety to report mistakes is not freedom from consequences for negligence or repeated recklessness. The line is: we don’t punish honest reporting of mistakes; we absolutely still address competence and conduct. Conflating the two is how leaders talk themselves out of the whole idea.
The performative postmortem. Action items get written and never done. Fix this by tracking them like any other work, in the same backlog, with the same visibility. If they don’t get prioritised, say so honestly rather than pretending.
Safety that stops at the manager’s door. If the team is candid but leadership punishes the bad-news messenger, the team learns fast and reverts to silence. Psychological safety is built top-down or not at all; the most senior person in the room sets the actual rules, regardless of the posters.
One loud sceptic poisoning the well. A single person who responds to admissions with contempt can undo months of work, because safety is fragile and asymmetric — easy to break, slow to build. Address it directly; tolerating it signals that the stated values are negotiable.
Mistaking comfort for safety. A team that never disagrees isn’t safe, it’s silent or conflict-avoidant. Psychological safety is what lets people have productive disagreement — challenging a design, pushing back on a deadline — without it turning personal. If your postmortems are uniformly pleasant and your design reviews never surface a hard objection, that’s not harmony; it’s the same silence that doomed Flight 173, wearing a friendlier face. Healthy teams argue about the work precisely because they trust the relationships underneath enough to risk it.

This is the same lesson self-hosting teaches at human scale: the consequences are yours, and pretending otherwise just delays the reckoning — a point I keep relearning while running my own infrastructure in public, where every outage is a small, honest postmortem with an audience of one.

5 Is it worth it — and who is this for?

If you run anything that can wake someone at 2am, this is not optional culture work — it is uptime work by another name. The teams that recover fast from incidents are the ones where information moves freely under pressure, and that only happens when people have learned, through repeated experience, that speaking up is safe and acting on it is rewarded. You cannot buy that with a tool. You build it the way aviation and fire services did: by deliberately structuring how people communicate, rehearsing it before the crisis, and proving over and over that honesty is met with fixes rather than blame.

It is genuinely hard, it is mostly leadership behaviour rather than process, and it is slow — you’re building trust, and trust compounds quietly over months. But the alternative is a team that already knows what’s wrong and won’t tell you until it’s a smoking crater near the runway. Flight 173 had all the information it needed in the cockpit. The flight engineer had the fuel figures, said them aloud, and was not heard — not because anyone was incompetent, but because the culture made insistence feel like insubordination. The only thing missing was an environment where saying it again, louder, felt safe rather than career-limiting. Aviation spent decades and real money engineering that environment into existence, and the accident rate fell accordingly. Build that culture first; the runbooks, the dashboards, and the clever tooling all come second, because none of them help if the person who already knows the answer won’t say it out loud.

Contents

Building Psychological Safety in DevOps: Lessons from Flight Decks and Firefighting

Why the team that admits its mistakes fastest is the one that stays up

1 The flight-deck lesson: rank must not silence information

2 The firefighting lesson: rehearse the roles before the fire

3 The blameless postmortem, done properly

4 Troubleshooting your culture: the failure modes

5 Is it worth it — and who is this for?

Related Content

One-Click Everything: Deploying Self-Hosted Apps with Coolify

Earthly: Containerized Build Pipelines That Combine Dockerfile and Makefile

Devcontainers: Reproducible Development Environments in VS Code

Pre-commit Hooks: Catching Mistakes Before They Reach the Repo