Running Gemma 3 Locally: Google's Small Model on Consumer Hardware

A capable open-weights model that fits on the GPU you already own

Every few months a new open-weights model lands and the homelab forums fill with breathless claims that this one finally dethrones the cloud. Most of the time it’s hype. Gemma 3, which Google released in March, is one of the rare cases where the claims are roughly fair — not because it beats the frontier models, but because it’s the first small model I’ve run that I actually leave switched on.

The interesting trick Gemma 3 pulls is the size ladder: it comes in 1B, 4B, 12B, and 27B parameter variants, all sharing the same architecture, so you can pick whichever one your hardware tolerates without learning a new prompting style. The 4B fits comfortably on an 8 GB card. The 12B is the sweet spot for a 16 GB GPU. The 27B wants 24 GB or some patience. And from the 4B up, they’re natively multimodal — you can feed them an image, which the previous Gemma generation couldn’t do.

Advertisement

There are fancier ways to serve a model, but for “I just want it answering on my LAN,” nothing beats Ollama. It handles the GGUF download, quantisation selection, and an OpenAI-compatible API in one binary.

$ curl -fsSL https://ollama.com/install.sh | sh
$ ollama pull gemma3:12b
pulling manifest
pulling 4a7c... 100% ▕████████████▏ 8.1 GB
verifying sha256 digest
success

$ ollama run gemma3:12b "Explain ZFS scrub vs resilver in two sentences."
A scrub reads every block in the pool and verifies it against its
checksum, repairing silent corruption while all disks are healthy.
A resilver rebuilds data onto a replacement disk after a failure,
reconstructing only the missing copies from redundancy.

The default gemma3:12b tag pulls a 4-bit quantised build (Q4_K_M), which is what makes it fit. If you want the unquantised weights for maximum fidelity and have the VRAM, gemma3:12b-it-fp16 exists, but honestly the quality gap at Q4 is small enough that I never bother.

This is the part nobody explains properly. “Quantisation” means storing the model’s weights at lower precision — 4 bits instead of 16 — so the model fits in a fraction of the memory. The naive assumption is that you’re throwing away three-quarters of the model’s intelligence. You aren’t. The Q4_K_M scheme keeps the most sensitive layers at higher precision and squeezes the rest, and the measurable quality loss on most tasks is a couple of percent. What you lose is mostly at the margins: it’s slightly more likely to fumble a tricky reasoning chain or hallucinate a citation.

You can watch the memory pressure directly while it runs:

$ ollama ps
NAME          ID            SIZE     PROCESSOR    UNTIL
gemma3:12b    a9f8c2e1      9.8 GB   100% GPU     4 minutes from now

If you see “CPU” or a split there, the model spilled out of VRAM and you’ll get a brutal slowdown — drop to the next size down or a tighter quant.

Because Ollama speaks the OpenAI API, anything that talks to OpenAI talks to Gemma with a base-URL swap. That’s the whole appeal. Here’s a minimal Python call that runs entirely on your own metal:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="gemma3:12b",
    messages=[{"role": "user", "content": "Draft a one-line cron comment for a nightly borg backup."}],
)
print(resp.choices[0].message.content)

The same endpoint plugs into Open WebUI for a ChatGPT-style front end, into Home Assistant’s conversation agent, or into editor plugins. I run a 12B instance behind my reverse proxy and point a handful of tools at it. The 128k context window — large for a model this size — means it’ll happily chew through a long config file or a meeting transcript without truncating.

Let’s be honest about the ceiling. Gemma 3 is not going to refactor a large codebase or reason through a genuinely hard problem the way the big hosted models do. The 4B variant in particular is fine for summarising and classifying but wobbly on anything requiring multi-step logic. Tool-calling support exists but is finicky compared to purpose-built agentic models. And the multimodal vision is competent, not spectacular — good enough to describe a photo, not good enough to read a dense scanned document reliably.

It’s also worth saying the licence is Google’s own Gemma terms, not a clean Apache or MIT, so read it if you’re shipping a commercial product on top.

For local, private, “good enough” inference, Gemma 3 at 12B is the model I’d hand someone setting up their first homelab AI box. It’s small enough to fit real hardware, multimodal, fast, and the quality is genuinely useful for summarising, drafting, classifying, and answering. If your benchmark is the frontier cloud models, you’ll be disappointed — that’s not the game here. But if your benchmark is “useful AI that never sends my data anywhere,” this is the easiest recommendation I’ve made all year.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.