Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work
Three open model families, one GPU, and an honest look at what each is good for

There is a particular flavour of disappointment unique to running a local LLM for the first time. You’ve read the benchmarks, you’ve seen the leaderboard, you spin up a model on your own GPU, ask it something real, and it produces confidently structured nonsense. Then you try a different model and it nails the same task. The benchmarks didn’t lie exactly — they just don’t tell you which model is good at your work. After months of using all three of the big open families for actual tasks, here’s what I’ve learned about Llama, Mistral, and Gemma when the novelty wears off.
1 The contenders, briefly
Llama is Meta’s family, and the Llama 3.1 release earlier this year is the one to care about. It comes in 8B, 70B, and a frankly silly 405B size. For self-hosters, the 8B is the workhorse and the 70B is the “I have a serious GPU” upgrade.
Mistral is the French outfit that punched far above its weight with a 7B model that embarrassed bigger ones, and the Mixtral mixture-of-experts models that followed. Their reputation is efficiency: more competence per parameter than seems fair.
Gemma is Google’s open family, released back in February. The Gemma 2 models — 9B and 27B — are the current ones, and the 27B in particular is the most capable thing I can comfortably run on a single consumer card.
2 How I actually run them
All three live happily under Ollama, which has become the path of least resistance for local models. One tool, one command, and it handles the quantisation and memory juggling for you:
# pull and run each, swapping models is one command
ollama run llama3.1:8b
ollama run mistral:7b
ollama run gemma2:9b
# or hit the API directly, which is how real work happens
curl http://localhost:11434/api/generate -d '{
"model": "gemma2:9b",
"prompt": "Summarise this changelog into release notes:\n...",
"stream": false
}'Quantisation matters more than people admit. A 4-bit quantised model fits where the full-precision one won’t, and the quality loss for most tasks is smaller than the size difference suggests. On a 12GB card I run the 8–9B models comfortably; the 27B Gemma needs a 24GB card or some patience with offloading.
3 Where each one actually shines
After enough real tasks, clear personalities emerge.
Llama 3.1 8B is my default for general chat, summarisation, and following multi-step instructions. It’s the best-rounded of the small models, rarely surprises me badly, and has the largest ecosystem of fine-tunes if I want a specialised variant. When I don’t know which model to reach for, I reach for this one.
Mistral 7B is the speed demon. For high-volume, lower-stakes work — classifying a pile of support tickets, extracting fields from documents, churning through a batch job overnight — its throughput-to-quality ratio is excellent. It’s a touch more willing to go off-script than Llama, which is occasionally a feature and occasionally a menace.
Gemma 2 is the surprise of the year for me. The 9B is sharp, but the 27B is the first locally-runnable model that consistently handles nuanced writing and longer reasoning chains without falling apart. For drafting prose, rewriting in a particular tone, or anything where the quality of the language matters, Gemma 27B is what I open. It’s slower, and it’s hungrier, and it’s worth it.
4 The boring truths nobody puts in the comparison table
Context window is where local models still bite you. The advertised numbers are generous, but quality degrades long before you hit the limit, and stuffing a huge document in often produces worse results than chunking it. Test with your real document lengths, not a toy prompt.
Prompt format is non-negotiable. Each family was trained with specific chat templates, and Ollama mostly handles this, but if you’re hitting the raw API you can get markedly worse output simply by formatting the conversation wrong. When a model seems dumber than it should be, suspect the template before you suspect the weights.
And licences differ. Llama and Gemma carry their own community licences with acceptable-use terms; Mistral’s smaller models are Apache 2.0, which is the most permissive of the lot. If you’re building something commercial, read them — properly, not the summary.
5 Is it worth it?
For real work, on your own hardware, yes — with realistic expectations. None of these will match the frontier hosted models on the hardest tasks, and pretending otherwise sets you up for that first-disappointment moment. But for summarisation, drafting, classification, extraction, and the great mass of everyday text-shuffling, a local 8–27B model is more than good enough, costs only electricity, and keeps your data on your own machine.
My actual setup, if you want a starting point: Llama 3.1 8B as the daily driver, Mistral 7B for batch jobs where speed wins, and Gemma 2 27B for anything where the writing has to be good. Install Ollama, pull all three, and spend an afternoon throwing your own real tasks at each. The leaderboards will tell you which is “best”; only your own work will tell you which is best for you.




