Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

Three open model families, one GPU, and an honest look at what each is good for

Smarc Included in

24-09-2024 1797 words 9 min read

Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

Contents

There is a particular flavour of disappointment unique to running a local LLM for the first time. You’ve read the benchmarks, you’ve stared at the leaderboard, you spin a model up on your own GPU, ask it something real — and it produces confidently structured nonsense. Then you try a different model and it nails the same task on the first go. The benchmarks didn’t lie, exactly. They just don’t tell you which model is good at your work, on your prompts, in your format. I’ve spent well over a year now using all three of the big open families as everyday tools rather than toys, and this is what I’ve learned about Llama, Mistral and Gemma once the novelty has thoroughly worn off.

The short version: there is no single winner, the leaderboard rankings barely predict which one you’ll actually reach for, and the differences that matter in daily use are almost never the ones the benchmarks measure. Let me explain why, and give you a setup you can copy.

The contenders, briefly

Llama is Meta’s family, and the Llama 3.x line is the one to care about for self-hosting. It comes in an 8B workhorse, a 70B “I own a serious GPU” tier, and a frankly silly 405B that exists mostly to make a point. For anyone running on a single desktop card, the 8B is the one you’ll live with.

Mistral is the French outfit that made its name with a 7B model that embarrassed larger ones, followed by the Mixtral mixture-of-experts models. Its reputation is efficiency: more competence per parameter than seems entirely fair, and famously permissive Apache 2.0 licensing on the smaller weights.

Gemma is Google’s open family, derived from the same research as Gemini. The mid-sized Gemma models — roughly 9B and 27B in the Gemma 2 generation, with Gemma 3 pushing further — are the current ones, and the ~27B in particular is the most capable thing I can comfortably run on a single 24GB consumer card. If you want a deeper look at that one on its own, I wrote up running Gemma locally on consumer hardware separately.

How I actually run them

All three live happily under Ollama, which has become the path of least resistance for local inference. One tool, one command, and it handles pulling the weights, quantisation, and the memory juggling of loading and unloading models. If you’re starting from scratch, my guide to running LLMs on your own metal with Ollama covers the install; here I’ll assume it’s already there.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# pull and run each; swapping models is one command
ollama run llama3.1:8b
ollama run mistral:7b
ollama run gemma2:9b

# real work happens against the API, not the chat prompt
curl http://localhost:11434/api/generate -d '{
  "model": "gemma2:9b",
  "prompt": "Summarise this changelog into release notes:\n...",
  "stream": false
}'

Quantisation matters more than people like to admit, and it’s the first thing to understand before you judge any of these models. A 4-bit quantised model fits where the full-precision one won’t, and for most everyday tasks the quality loss is far smaller than the size reduction suggests — you’re trading a barely-perceptible drop in fidelity for the difference between “runs on my card” and “doesn’t.” On a 12GB card I run the 8–9B models comfortably at their default quant with headroom to spare; the ~27B Gemma wants a 24GB card, or a good deal of patience while surplus layers offload to system RAM and inference slows to a crawl. Know your VRAM budget before you fall in love with a model that won’t fit, because there is nothing more deflating than pulling a 40GB model onto a 12GB card and watching it grind.

One more practical note on running them: Ollama keeps a model resident in VRAM for a few minutes after use and unloads it to make room for the next, which means switching between all three during a working session is seamless but the first prompt to a freshly-loaded model always pays a loading tax. If you’re benchmarking, warm the model up and discard the first response, or you’ll be timing disk reads rather than the model.

Where each one actually shines

Run enough real tasks through all three and clear personalities emerge — the sort of thing no single benchmark number captures.

Llama 3.1 8B is my default for general chat, summarisation and following multi-step instructions. It’s the best-rounded of the small models, rarely surprises me badly, and has by far the largest ecosystem of community fine-tunes if I want a specialised variant. When I genuinely don’t know which model to reach for, I reach for this one, and I’m rarely sorry.

Mistral 7B is the speed demon. For high-volume, lower-stakes work — classifying a pile of support tickets, extracting fields from a stack of documents, grinding through a batch job overnight — its throughput-to-quality ratio is excellent, and the permissive licence means you can build on it commercially without a lawyer in the room. It’s a touch more willing to go off-script than Llama, which is occasionally a feature and occasionally a menace.

Gemma 2 27B is the one that changed how I work. The 9B is sharp for its size, but the 27B is the first locally-runnable model that consistently handles nuanced writing and longer reasoning chains without quietly falling apart halfway through. For drafting prose, rewriting in a specific tone, or anything where the quality of the language is the whole point, the 27B Gemma is what I open. It’s slower, it’s hungrier, and for that class of task it’s worth every second.

The uncomfortable truth those three paragraphs hide is that the leaderboard ordering barely predicts any of it. On a given day one model will top an aggregate benchmark by a point or two, and that ranking tells you almost nothing about which one will draft a decent changelog summary in your house style or reliably extract the fields you need from your messy documents. The only benchmark that has ever helped me is a folder of a dozen real tasks I care about, run through each candidate, read with my own eyes. Build that folder once and re-run it whenever a new model drops; it’ll save you weeks of chasing leaderboard positions that don’t map to your work.

The boring truths nobody puts in the comparison table

Context window is where local models still bite you. The advertised numbers are generous, but quality degrades long before you reach the stated limit, and stuffing a huge document into the prompt frequently produces worse answers than chunking it into pieces and asking about each. Test with your real document lengths, not a two-line toy prompt, or you’ll ship something that works in the demo and fails in production. This is the single most common way a “great” local model disappoints in practice.

Prompt format is non-negotiable. Each family was trained with a specific chat template — the exact tokens that mark where a system message, a user turn and an assistant reply begin and end. Ollama handles this for you when you use its chat endpoint, but the moment you hit the raw completion API and hand-assemble the conversation, a mismatched template will make a capable model look stupid. When a model seems dumber than it should be, suspect the template before you suspect the weights. This same gotcha comes back to bite people when they fine-tune with LoRA and serve with a different format than they trained on.

Licences genuinely differ, and it matters if you build. Llama and Gemma ship under their own community licences with acceptable-use terms and, historically, clauses that kick in at large user counts. Mistral’s smaller models are Apache 2.0, the most permissive of the lot. If you’re putting a local model into a product, read the actual licence text rather than a summary — the summary is not the licence, and “open” is not a legal term.

Troubleshooting: when a good model behaves badly

A few failure modes recur often enough to be worth naming, because the fix is rarely “use a bigger model.”

Output is garbled or repeats itself. Almost always a broken chat template or a wrong stop token. Switch back to Ollama’s chat endpoint and confirm the model runs correctly there before blaming the weights.
The model is unbearably slow. Check whether it actually fits in VRAM. ollama ps shows how a loaded model is split between GPU and CPU; anything spilling to system RAM will drag. Drop to a smaller quant or a smaller model.
Answers are worse than the same model gives elsewhere. Compare quant levels. A heavily quantised build on your machine is not the same model as the full-precision one behind a hosted API, and for reasoning-heavy tasks the gap widens. Bump up one quant level if you have the VRAM headroom before concluding the model is simply weak.
Long documents give vague, hedge-everything answers. You’ve overrun the useful context. Chunk the input and summarise in passes; don’t trust the advertised window.
The model refuses reasonable requests or moralises. Different families have different safety tuning, and the small open ones vary a lot. If one is being precious about a legitimate task, another family often simply isn’t — this is one of the clearest day-to-day differences between them, and worth keeping all three around for.
Output quality swings wildly between runs. Check your sampling settings. A high temperature turns a competent model into a creative-writing generator; for extraction and classification, drop the temperature low and the erratic answers usually settle down.

Is it worth it?

For real work, on your own hardware, yes — with honest expectations. None of these small open models will match the frontier hosted systems on the hardest reasoning tasks, and pretending otherwise sets you up for exactly that first-disappointment moment. But for summarisation, drafting, classification, extraction and the vast everyday mass of text-shuffling, a local 8–27B model is more than good enough, costs only electricity, and keeps your data on a machine you physically own. If you’re weighing that trade-off in pounds and hours, I’ve done the sums in the real cost of self-hosting.

My actual setup, if you want a starting point: Llama 3.1 8B as the daily driver, Mistral 7B for batch jobs where speed wins, and Gemma 2 27B for anything where the writing has to be good. Install Ollama, pull all three, and spend one focused afternoon throwing your own real tasks at each of them rather than reading someone else’s benchmark. The leaderboard will tell you which model is “best” in the abstract. Only your own work will tell you which is best for you — and those two answers are almost never the same.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#machine-learning #llm #self-hosting #ollama

Contents

Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

Three open model families, one GPU, and an honest look at what each is good for

The contenders, briefly

How I actually run them

Where each one actually shines

The boring truths nobody puts in the comparison table

Troubleshooting: when a good model behaves badly

Is it worth it?

Related Content

Running Gemma 3 Locally: Google's Small Model on Consumer Hardware

llama.cpp vs Ollama vs vLLM: Picking a Local Inference Server

Local AI on Your Own Metal: Running LLMs Offline with Ollama

Label Studio: Self-Hosted Data Annotation for Training Your Own Models