Voice Assistants Without the Cloud: Whisper, Piper, and Home Assistant

A talking house that doesn't phone home

Smarc Included in

22-06-2025 1930 words 10 min read

Voice Assistants Without the Cloud: Whisper, Piper, and Home Assistant

Contents

I have always found something faintly insulting about commercial smart speakers. You buy a microphone, put it on your kitchen counter, and then pay a subscription to a company that would dearly love to know what brand of coffee you argue about at breakfast. The convenience is real, but the bargain is rotten. So a couple of years ago I tore the cloud out of my voice setup entirely, and the technology has matured to the point where I can finally recommend it without a disclaimer the length of a mortgage agreement.

The stack is three open-source pieces glued together by Home Assistant: Whisper for speech-to-text, Piper for text-to-speech, and an intent engine that decides what “turn off the lounge lamp” actually means. None of it touches the internet. The microphone audio never leaves the house. That is the whole point, and it is the thing no commercial speaker can offer you at any price.

Why local, beyond the obvious privacy line

The privacy argument is the headline, but there are two quieter reasons I have come to value more.

The first is resilience. A cloud speaker is a brick when the internet is down or the vendor’s servers hiccup. A local assistant keeps working through an outage, because everything it needs is a metre away on your own hardware. When the fibre goes down and I can still say “kitchen lights on,” that reliability stops feeling like a hobbyist quirk and starts feeling like the correct way to build home infrastructure.

The second is permanence. Cloud voice products get discontinued, features get paywalled, and the box you bought quietly loses functionality through no fault of its own. A local stack does exactly what it did on the day you set it up, indefinitely. Nobody can reach in and take a feature away. For anyone who has watched a smart-home ecosystem get abandoned by its manufacturer, that alone justifies the effort.

The pipeline, end to end

When you say a wake word, audio gets captured, streamed to Whisper, which returns text. That text goes to the intent recogniser, which matches it against your devices and areas. The matched intent fires an action — flipping a switch, reading a sensor — and the response text is handed to Piper, which speaks it back. Home Assistant calls this whole arrangement an Assist pipeline, and the brilliant decision was making each stage pluggable. You can run everything locally, or offload the heavy bits to a beefier machine, and swap any stage without rebuilding the rest.

The two STT/TTS engines ship as add-ons if you’re running Home Assistant OS, or as plain Docker containers otherwise. Here’s the container approach, which is what I use because my Home Assistant lives in a VM and I refuse to run the supervised install:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
services:
  whisper:
    image: rhasspy/wyoming-whisper:latest
    command: --model small-int8 --language en
    volumes:
      - ./whisper-data:/data
    ports:
      - "10300:10300"
    restart: unless-stopped

  piper:
    image: rhasspy/wyoming-piper:latest
    command: --voice en_GB-alba-medium
    volumes:
      - ./piper-data:/data
    ports:
      - "10200:10200"
    restart: unless-stopped

Both speak the Wyoming protocol, a lightweight scheme Home Assistant uses to talk to voice services over a socket. Once the containers are up, you add them under Settings → Devices & Services → Add Integration → Wyoming Protocol, point at the host and port, and they appear as selectable engines in your pipeline. If you already run other add-ons, this pattern will feel familiar — it is the same building-block approach behind ESPHome custom sensors: small services that each do one job and plug into the central hub.

Choosing models, and the CPU vs GPU question

Whisper comes in sizes from tiny to large-v3. The int8 quantised variants are the ones you want for CPU inference — roughly half the memory and a noticeable speed-up for a small accuracy cost. On a mid-range CPU, small-int8 transcribes a short command in well under a second, which is fast enough that the latency feels conversational. The base model is faster still but starts mangling anything with an accent or a slightly unusual device name — which, if half your entities are named after rooms and appliances, it will encounter constantly.

If you have a GPU lying idle, the medium model becomes practical and the accuracy jump is real, especially for longer or noisier utterances. There is a genuine trade-off here: a bigger model is more forgiving of mumbling and background noise, but costs latency and memory. My advice is to start with small-int8, live with it for a week, and only reach for a larger model if you find yourself repeating commands. Most people never need to.

You can verify the engine is actually answering with a quick netcat poke at the Wyoming port:

1
2
$ nc -z -v localhost 10300
Connection to localhost 10300 port [tcp/*] succeeded!

Piper voices are the underrated half. The British “alba” and “northern_english” voices sound human enough that house guests stop asking why the kitchen is talking to me. Each voice is a small ONNX file; you can audition them on the Piper samples page before committing. The medium quality voices strike the best balance — the high ones sound marginally better but take longer to synthesise, and for one-line responses that latency is more noticeable than the quality gain.

Wake words and the satellite problem

Whisper handles transcription, but something has to listen for the wake word first, and you do not want a multi-gigabyte model spinning constantly on every device in the house. openWakeWord fills that gap — a tiny model that runs on a microcontroller-class device and only wakes the heavy pipeline when it hears “Hey Jarvis” or whatever phrase you’ve chosen. The wake word runs at the edge; the expensive transcription runs centrally, only when triggered.

The cleanest hardware is an ESP32-S3 flashed with ESPHome’s voice assistant firmware, which Home Assistant sells pre-built as the Voice Preview Edition. One sits in each room as a satellite, all the inference happens on the central box, and the satellites cost about the price of a takeaway. This architecture — dumb, cheap satellites feeding a single capable brain — is what makes a whole-house voice setup affordable. You are not buying an Alexa for every room; you are buying a microphone and a speaker for every room.

For a DIY satellite, the ESPHome config fragment is mercifully short:

1
2
3
4
5
6
voice_assistant:
  microphone: mic
  speaker: spk
  use_wake_word: true
  on_wake_word_detected:
    - light.turn_on: status_led

Troubleshooting: the failure modes you’ll actually hit

The satellite hears the wake word but nothing happens. Almost always the pipeline can’t reach Whisper or Piper. Re-run the nc -z -v check against both ports. If the containers are on a different host or a segmented network, confirm the firewall lets the satellite reach them — a common trip-up if you have split your network into VLANs the way I described in VLAN segmentation at home, because your IoT satellites may be walled off from the server running the models.

Transcription is slow. If a command takes several seconds to register, the model is too big for the CPU. Drop to small-int8 or base-int8, or move inference to a machine with more headroom. Watch the container’s CPU during a command — if it pins a core for seconds, that’s your answer.

It mishears device names constantly. Rename entities to distinct, phonetically clear words. “Lounge lamp” and “lounge lights” will get confused; “reading lamp” and “ceiling lights” won’t. This is far cheaper than upgrading the model.

No audio comes back. Piper is up but the speaker is silent — usually a media-player target problem, not a Piper one. Confirm the pipeline’s TTS output is routed to a speaker entity that actually plays, and test that entity with a manual tts.speak call.

Wake-word false triggers. If the house keeps waking at the television, either raise the detection threshold or switch to a less common wake phrase. Generic words trigger on ambient speech; a distinctive one does not.

Tuning for the latency that makes or breaks it

The single thing that decides whether people in the house actually use a voice assistant is latency. If there is a two-second pause between finishing a sentence and the lights coming on, everyone reverts to the wall switch within a week. The whole pipeline has to feel instant, and getting there is a matter of shaving delay out of each stage.

The biggest lever is the Whisper model, as covered above — small-int8 on a capable CPU is the sweet spot for most people. The second is where the inference runs relative to where the audio is captured: keep the satellites and the model host on a fast, low-latency link. Wired gigabit between the satellite’s access point and the server beats a congested Wi-Fi hop every time. The third is Piper’s voice quality setting; a medium voice synthesises a one-line response fast enough to feel immediate, where a high voice can add a perceptible beat.

A useful trick is to keep responses short. A verbose confirmation — “I have turned on the reading lamp in the living room for you” — takes noticeably longer to synthesise and speak than a curt “done.” Configure terse responses and the whole thing feels snappier, quite apart from being less irritating to live with. The goal is for the house to feel responsive, not chatty.

Adding a brain: the local LLM option

If you want the assistant to answer open-ended questions rather than just flip switches, you can wire a local large language model into the pipeline via Ollama, which Home Assistant supports as a conversation agent. This turns “make it cosy in here” from a baffled refusal into something the model can interpret and act on. The catch is resources: a usefully capable model wants a GPU with real VRAM, and every query now costs LLM inference time on top of transcription. It is a genuine upgrade, but it roughly doubles the hardware ask, and for a house whose main job is turning lights on and off it is overkill. Add it only if you actually want conversation, not just control.

Where it falls short

This is not Alexa, and you should not pretend it is. Open-ended questions — “what’s the capital of Peru” — get a polite refusal unless you add the Ollama layer above, which adds latency and a whole new pile of VRAM requirements. The intent matching is otherwise literal: it understands “turn on the desk lamp” but trips over anything metaphorical unless you define that mapping yourself. Multi-turn conversations are clumsy. And the initial setup is genuinely a weekend, not an afternoon.

The verdict

If you want a thing that answers trivia and orders dog food, buy the commercial speaker and make your peace with the surveillance. But if your ambition is “control my house by voice, reliably, without a microphone in my kitchen reporting to a data centre,” this stack is now genuinely good. I trust it with my smart home precisely because I can see every packet it doesn’t send.

Who is this for? Privacy-minded tinkerers with a spare machine and a free weekend, and anyone who has been burned by a discontinued smart-home product and wants infrastructure that can’t be taken away. For everyone else, the cloud speaker is still easier — it just costs you something you can’t get back. For me, it’s the most satisfying project I’ve shipped in years, and the only voice assistant in my house I’d actually trust with a microphone.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#home-assistant #whisper #self-hosting #privacy