Whisper: Self-Hosted Speech-to-Text That Runs on a Raspberry Pi

Private, offline transcription on hardware that costs less than a month of cloud STT

I have a drawer full of Raspberry Pis that I bought with grand plans and then quietly retired. So when OpenAI released Whisper as an open model — actual weights, MIT licence, no API key required — my first thought was not “this will revolutionise transcription”. It was “can I make the saddest Pi in the drawer earn its keep”. The answer, with some caveats I’ll be honest about, is yes.

Whisper is a speech-to-text model that turns audio into text, and it’s genuinely good — multilingual, robust to accents and background noise, and trustworthy enough that I now run all my voice memos and meeting recordings through it instead of paying a cloud service to read my private conversations. The catch most tutorials gloss over is that the original Python implementation is a pig on modest hardware. The fix is a faster runtime.

Advertisement

Whisper ships in sizes from tiny (39M parameters) up through base, small, medium, and large (1.5B). On a desktop with a GPU you’d reach straight for large and never think about it. On a Raspberry Pi 4 or 5, that’s not happening — you’ll wait minutes per sentence, if you don’t run out of memory first.

The sweet spot on a Pi is base or small. The tiny model is fast but makes the kind of transcription errors that turn a meeting summary into surrealist poetry. base is the lowest size I’d actually trust, and small is noticeably better if you can spare the patience. Forget medium and large on the Pi entirely; that’s desktop territory.

The single most important decision is your runtime. The reference Python implementation pulls in PyTorch, which is heavy, slow to start, and not built with a quad-core ARM board in mind. Instead, use whisper.cpp — a C++ reimplementation that compiles to a tight native binary, uses quantised models, and runs the small sizes on a Pi at a speed that’s actually usable.

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j4
# download the quantised base model (English-only is smaller and faster)
bash ./models/download-ggml-model.sh base.en

The build is plain make with no exotic dependencies, which on a Pi is a blessing. The .en English-only variants are smaller and faster than the multilingual ones, so if you only ever transcribe English, take the win.

Whisper expects 16kHz mono WAV input, so you’ll usually pipe through ffmpeg first to normalise whatever the recorder produced:

# convert any audio to the format whisper.cpp wants
ffmpeg -i meeting.m4a -ar 16000 -ac 1 -c:a pcm_s16le meeting.wav

# transcribe, with all four cores, output to SRT subtitles
./main -m models/ggml-base.en.bin \
       -f meeting.wav \
       -t 4 \
       --output-srt

The -t 4 tells it to use all four cores, which on the Pi matters enormously. It’ll print the transcript live as it goes, which is oddly satisfying to watch scroll past.

Let me be straight with you, because the benchmarks people post are often run on quietly overclocked Pi 5s with active cooling. On a stock Raspberry Pi 4 with the base.en model, transcription runs slower than real time — roughly, a ten-minute recording takes longer than ten minutes to process. That sounds like a dealbreaker until you remember that almost nobody needs live transcription. You drop a file in, walk away, and the text is waiting when you come back.

A Raspberry Pi 5 is meaningfully quicker and brings base.en close to real time. If transcription is more than an occasional thing, the 5 is worth the upgrade. Either way, you want a heatsink; the Pi will happily peg all four cores for the entire job and thermal-throttle itself if you let it cook.

The natural next step is a watched-folder service: drop audio into a directory, a small script picks it up, transcribes it, writes the text alongside, and moves on. A handful of lines of shell plus a systemd path unit turns the Pi into a quiet little transcription appliance that asks nothing of you. I’ve got mine fed by a Nextcloud folder, so a voice memo recorded on my phone becomes a searchable text file by the time I’m back at my desk.

For private, occasional, batch transcription on hardware you already own — completely. Nothing leaves your network, there are no per-minute charges, and the accuracy at base or small is good enough for meeting notes, voice memos, and podcast drafts. The Pi makes it almost free, and the smugness of running a serious ML model on a £50 board is included at no extra cost.

For real-time captioning, high-volume professional transcription, or anything where speed is the whole point, the Pi is the wrong tool and you should run Whisper on a proper GPU or, grudgingly, pay someone else. But for the rest of us, with a drawer full of neglected single-board computers and a healthy distrust of cloud microphones, this is one of the most genuinely useful things you can self-host. Go rescue a Pi.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.