Voice Assistants Without the Cloud: Whisper, Piper, and Home Assistant

A talking house that doesn't phone home

I have always found something faintly insulting about commercial smart speakers. You buy a microphone, put it on your kitchen counter, and then pay a subscription to a company that would dearly love to know what brand of coffee you argue about at breakfast. The convenience is real, but the bargain is rotten. So a couple of years ago I tore the cloud out of my voice setup entirely, and the technology has matured to the point where I can finally recommend it without a disclaimer the length of a mortgage agreement.

The stack is three open-source pieces glued together by Home Assistant: Whisper for speech-to-text, Piper for text-to-speech, and an intent engine that decides what “turn off the lounge lamp” actually means. None of it touches the internet. The microphone audio never leaves the house. That is the whole point.

Advertisement

When you say a wake word, audio gets captured, streamed to Whisper, which returns text. That text goes to the intent recogniser, which matches it against your devices and areas. The matched intent fires an action — flipping a switch, reading a sensor — and the response text is handed to Piper, which speaks it back. Home Assistant calls this whole arrangement an Assist pipeline, and the brilliant decision was making each stage pluggable. You can run everything locally, or offload the heavy bits to a beefier machine.

The two STT/TTS engines ship as add-ons if you’re running Home Assistant OS, or as plain Docker containers otherwise. Here’s the container approach, which is what I use because my Home Assistant lives in a VM and I refuse to run the supervised install:

services:
  whisper:
    image: rhasspy/wyoming-whisper:latest
    command: --model small-int8 --language en
    volumes:
      - ./whisper-data:/data
    ports:
      - "10300:10300"
    restart: unless-stopped

  piper:
    image: rhasspy/wyoming-piper:latest
    command: --voice en_GB-alba-medium
    volumes:
      - ./piper-data:/data
    ports:
      - "10200:10200"
    restart: unless-stopped

Both speak the Wyoming protocol, a lightweight scheme Home Assistant uses to talk to voice services over a socket. Once the containers are up, you add them under Settings → Devices & Services → Add Integration → Wyoming Protocol, point at the host and port, and they appear as selectable engines in your pipeline.

Whisper comes in sizes from tiny to large-v3. The int8 quantised variants are the ones you want for CPU inference — roughly half the memory and a noticeable speed-up for a small accuracy cost. On a mid-range CPU, small-int8 transcribes a short command in well under a second, which is fast enough that the latency feels conversational. The base model is faster still but starts mangling anything with an accent or a slightly unusual device name.

If you have a GPU lying idle, the medium model becomes practical and the accuracy jump is real, especially for longer or noisier utterances. You can verify the engine is actually answering with a quick netcat poke at the Wyoming port:

$ nc -z -v localhost 10300
Connection to localhost 10300 port [tcp/*] succeeded!

Piper voices are the underrated half. The British “alba” and “northern_english” voices sound human enough that house guests stop asking why the kitchen is talking to me. Each voice is a small ONNX file; you can audition them on the Piper samples page before committing.

Whisper handles transcription, but something has to listen for the wake word first, and you do not want a multi-gigabyte model spinning constantly. openWakeWord fills that gap — a tiny model that runs on a microcontroller-class device and only wakes the heavy pipeline when it hears “Hey Jarvis” or whatever phrase you trained. The cleanest hardware is an ESP32-S3 flashed with ESPHome’s voice assistant firmware, which Home Assistant sells pre-built as the Voice Preview Edition. One sits in each room as a satellite, all the inference happens on the central box, and the satellites cost about the price of a takeaway.

For a DIY satellite, the ESPHome config fragment is mercifully short:

voice_assistant:
  microphone: mic
  speaker: spk
  use_wake_word: true
  on_wake_word_detected:
    - light.turn_on: status_led

This is not Alexa, and you should not pretend it is. Open-ended questions — “what’s the capital of Peru” — get a polite refusal unless you wire in a local LLM via Ollama, which adds latency and a whole new pile of VRAM requirements. The intent matching is literal: it understands “turn on the desk lamp” but trips over “make it cosy in here” unless you define that yourself. Multi-turn conversations are clumsy. And the initial setup is genuinely a weekend, not an afternoon.

If you want a thing that answers trivia and orders dog food, buy the commercial speaker and make your peace with the surveillance. But if your ambition is “control my house by voice, reliably, without a microphone in my kitchen reporting to a data centre,” this stack is now genuinely good. I trust it with my smart home precisely because I can see every packet it doesn’t send. For privacy-minded tinkerers with a spare machine and a free weekend, it’s the most satisfying project I’ve shipped in years. For everyone else, the cloud speaker is still easier — it just costs you something you can’t get back.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.