Whisper: Self-Hosted Speech-to-Text That Runs on a Raspberry Pi

Private, offline transcription on hardware that costs less than a month of cloud STT

Smarc Included in

26-03-2024 1817 words 9 min read

Whisper: Self-Hosted Speech-to-Text That Runs on a Raspberry Pi

Contents

I have a drawer full of Raspberry Pis that I bought with grand plans and then quietly retired. So when OpenAI released Whisper as an open model — actual weights, MIT licence, no API key required — my first thought was not “this will revolutionise transcription”. It was “can I make the saddest Pi in the drawer earn its keep”. The answer, with some caveats I’ll be honest about, is yes.

Whisper is a speech-to-text model that turns audio into text, and it’s genuinely good — multilingual, robust to accents and background noise, and trustworthy enough that I now run all my voice memos and meeting recordings through it instead of paying a cloud service to read my private conversations. The catch most tutorials gloss over is that the original Python implementation is a pig on modest hardware. The fix is a faster runtime.

The motivation, for me, is the same one behind most of the things I self-host: I would rather my private audio never leave the building. A cloud transcription API means uploading recordings of meetings, family voice notes, and half-formed ideas to someone else’s servers under someone else’s retention policy. Running the model locally is the same instinct that leads me to keep my own photos on my own hardware — I made the full argument for that in self-hosting your photo library with Immich — applied to the microphone instead of the camera.

The model sizes, and which one your Pi can stomach

Whisper ships in sizes from tiny (39M parameters) up through base, small, medium, and large (1.5B). On a desktop with a GPU you’d reach straight for large and never think about it. On a Raspberry Pi 4 or 5, that’s not happening — you’ll wait minutes per sentence, if you don’t run out of memory first.

The sweet spot on a Pi is base or small. The tiny model is fast but makes the kind of transcription errors that turn a meeting summary into surrealist poetry. base is the lowest size I’d actually trust, and small is noticeably better if you can spare the patience. Forget medium and large on the Pi entirely; that’s desktop territory.

Two numbers make the choice concrete. base.en is roughly 140 MB on disk in its quantised form and holds a couple of hundred megabytes of RAM while running; small.en is about three times that on both counts. On a 4 GB Pi that share of memory is the real constraint, not the CPU, because if you push the board into swap the whole thing crawls. If the Pi does nothing but transcribe, small.en is comfortable. If it is also running a couple of containers, stay on base.en and leave yourself headroom.

The quantised ggml models are worth a word, because that quantisation is a large part of why this works on a Pi at all. Instead of full 32-bit floating-point weights, whisper.cpp ships models compressed to smaller integer representations, which shrinks the file, cuts the memory footprint, and speeds up the arithmetic — at a cost in accuracy so small that on base.en and above you will struggle to notice it in ordinary speech. On a resource-starved board this is not an optimisation you take if you have time; it is the entire reason the model fits.

Use whisper.cpp, not the Python original

The single most important decision is your runtime. The reference Python implementation pulls in PyTorch, which is heavy, slow to start, and not built with a quad-core ARM board in mind. Instead, use whisper.cpp — a C++ reimplementation (now maintained under the ggml-org org) that compiles to a tight native binary, uses quantised models, and runs the small sizes on a Pi at a speed that’s actually usable.

One thing to know before you copy an old tutorial: whisper.cpp deprecated the plain make build and renamed its binaries. The build is now driven by CMake, and the command-line tool that used to be ./main is whisper-cli. If you follow a 2023 guide you will hit a deprecation warning at best and a missing target at worst, so use the current invocation:

1
2
3
4
5
6
git clone https://github.com/ggml-org/whisper.cpp
cd whisper.cpp
cmake -B build
cmake --build build -j --config Release
# download the quantised base model (English-only is smaller and faster)
sh ./models/download-ggml-model.sh base.en

That leaves the tool at build/bin/whisper-cli. You need cmake and a C++ toolchain first (sudo apt install build-essential cmake on Raspberry Pi OS), but nothing exotic. The .en English-only variants are smaller and faster than the multilingual ones, so if you only ever transcribe English, take the win.

Transcribing something

Whisper expects 16kHz mono WAV input, so you’ll usually pipe through ffmpeg first to normalise whatever the recorder produced:

1
2
3
4
5
6
7
8
9
# convert any audio to the format whisper.cpp wants
ffmpeg -i meeting.m4a -ar 16000 -ac 1 -c:a pcm_s16le meeting.wav

# transcribe, with all four cores, output to SRT subtitles
./build/bin/whisper-cli \
       -m models/ggml-base.en.bin \
       -f meeting.wav \
       -t 4 \
       --output-srt

The -t 4 tells it to use all four cores, which on the Pi matters enormously. It’ll print the transcript live as it goes, which is oddly satisfying to watch scroll past.

What the performance actually feels like

Let me be straight with you, because the benchmarks people post are often run on quietly overclocked Pi 5s with active cooling. On a stock Raspberry Pi 4 with the base.en model, transcription runs slower than real time — roughly, a ten-minute recording takes longer than ten minutes to process. That sounds like a dealbreaker until you remember that almost nobody needs live transcription. You drop a file in, walk away, and the text is waiting when you come back.

A Raspberry Pi 5 is meaningfully quicker and brings base.en close to real time. If transcription is more than an occasional thing, the 5 is worth the upgrade. Either way, you want a heatsink; the Pi will happily peg all four cores for the entire job and thermal-throttle itself if you let it cook.

It helps to reframe what “slower than real time” actually costs you, because the phrase sounds worse than it is. If a ten-minute recording takes twelve minutes to transcribe, that is twelve minutes during which you are doing something else entirely — the Pi is a background appliance, not something you sit and watch. The only workloads where the wall-clock speed genuinely matters are the ones where a human is waiting on the output right now: live captioning, a voice assistant, an interactive dictation tool. For the far more common case — a queue of recordings that need to become searchable text by tomorrow morning — throughput per hour is the number that matters, and a Pi churning quietly through a backlog overnight is perfectly adequate. Measure the thing you actually care about (files processed per day) rather than the benchmark everyone quotes (seconds per minute of audio), and the Pi looks a great deal more capable.

Wiring it into something useful

The natural next step is a watched-folder service: drop audio into a directory, a small script picks it up, transcribes it, writes the text alongside, and moves on. A handful of lines of shell plus a systemd path unit turns the Pi into a quiet little transcription appliance that asks nothing of you. I’ve got mine fed by a synced folder, so a voice memo recorded on my phone becomes a searchable text file by the time I’m back at my desk.

The mechanism is a systemd .path unit that watches a directory and triggers a service whenever a file lands:

1
2
3
4
5
# /etc/systemd/system/transcribe.path
[Path]
PathExistsGlob=/srv/inbox/*.m4a
[Install]
WantedBy=multi-user.target

1
2
3
4
# /etc/systemd/system/transcribe.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/transcribe-inbox.sh

The script it calls converts each new file, runs whisper-cli, drops the .txt next to a copy of the audio, and removes the original from the inbox so the same file is never processed twice. Because it is a oneshot service, systemd handles the queueing for you — land three files at once and they transcribe in turn without stepping on each other. Feed the inbox from whatever you already run: a synced folder, an scp from your phone, or a downstream hook from a monitoring stack. If you want to know whether the appliance is quietly keeping up, point a health check at it — a stalled transcription queue is exactly the kind of silent failure that a self-hosted uptime monitor like Uptime Kuma is good at catching.

Troubleshooting the things that actually go wrong

A few failure modes recur often enough to be worth naming.

Gibberish or empty output. Almost always a sample-rate problem. whisper.cpp wants 16 kHz mono 16-bit WAV and will produce nonsense from anything else, so if the transcript is garbage, check that ffmpeg actually re-sampled — ffprobe meeting.wav should report 16000 Hz, mono. Skipping the ffmpeg step and feeding it a 48 kHz stereo file is the single most common mistake.

It stalls or gets killed mid-file. That is memory. A stock 4 GB Pi 4 running small alongside a browser or a container stack can be pushed into swap, and if the kernel OOM-killer fires, whisper-cli just vanishes. Drop to base.en, close everything else, and watch free -h during a run. If it is the only workload on the Pi, it will be fine; if the Pi is doing five other jobs, give Whisper its own board.

Thermal throttling halfway through. A long recording pegs all four cores for minutes, the SoC hits ~80 °C, and clock speed drops, so a job that started fast crawls to the finish. vcgencmd measure_temp during a run tells you instantly. A cheap heatsink or a fan fixes it; without one, expect the back half of a long file to run slower than the front.

Wrong language detected. The multilingual models auto-detect language and occasionally guess wrong on the first few seconds. Pin it with -l en (or the relevant code) rather than letting it improvise, and the problem disappears.

Is it worth it?

For private, occasional, batch transcription on hardware you already own — completely. Nothing leaves your network, there are no per-minute charges, and the accuracy at base or small is good enough for meeting notes, voice memos, and podcast drafts. The Pi makes it almost free, and the smugness of running a serious ML model on a £50 board is included at no extra cost.

For real-time captioning, high-volume professional transcription, or anything where speed is the whole point, the Pi is the wrong tool and you should run Whisper on a proper GPU or, grudgingly, pay someone else. But for the rest of us, with a drawer full of neglected single-board computers and a healthy distrust of cloud microphones, this is one of the most genuinely useful things you can self-host. Go rescue a Pi.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#machine-learning #whisper #raspberry-pi #self-hosting

Contents

Whisper: Self-Hosted Speech-to-Text That Runs on a Raspberry Pi

Private, offline transcription on hardware that costs less than a month of cloud STT

The model sizes, and which one your Pi can stomach

Use whisper.cpp, not the Python original

Transcribing something

What the performance actually feels like

Wiring it into something useful

Troubleshooting the things that actually go wrong

Is it worth it?

Related Content

Label Studio: Self-Hosted Data Annotation for Training Your Own Models

Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

LoRA Fine-Tuning on Consumer Hardware: Adding Skills to a Model Without Retraining It

ComfyUI: Node-Based Image Generation for People Who Want Control