Fine-Tuning vs Prompting vs RAG: Picking the Right Tool Without Wasting GPU Hours

A decision guide for grounding your model

09-06-2026

Fine-Tuning vs Prompting vs RAG: Picking the Right Tool Without Wasting GPU Hours

Contents

I once watched someone spend the better part of a weekend, and roughly £40 of rented A100 time, fine-tuning a 7B model to “answer questions about our internal wiki.” The result was a model that confidently invented policy numbers that had never existed. The fix took twenty minutes: paste the relevant wiki page into the prompt. That was a retrieval problem the whole time, and no amount of training was ever going to solve it. It is the single most common and most expensive mistake I see people make with local LLMs, and it comes from a simple confusion about what each tool actually does.

When a model misbehaves there is a powerful temptation to reach straight for the heaviest implement in the shed. People hear “fine-tuning,” picture a model retrained on their data, and book a pile of GPU hours before they have worked out what the actual problem is. More often than not the result is wasted money and a model that is no better. Prompting, retrieval, and fine-tuning solve genuinely different problems, and choosing well saves you both effort and grief. This is the framework I wish I had been handed before that wasted weekend.

The Three Tools, Defined

Start by being precise about what each approach actually changes.

Prompting is everything you can do by writing better instructions. You clarify the task, give examples of the output you want (few-shot prompting), specify the format, set the tone, and provide guardrails — all in the text you send the model. You change nothing about the model itself; you simply ask better. It is the cheapest, fastest lever and the one most people under-use.

RAG, or Retrieval-Augmented Generation, gives the model access to external knowledge at query time. You store your documents, retrieve the relevant passages when a question arrives, and include them in the prompt so the model answers from real, current, private data rather than from memory. It changes what the model knows in the moment, without changing the model.

Fine-tuning adjusts the model’s own weights by training it further on examples of the behaviour you want. It changes how the model behaves by default — its style, its format, its tone, its instinct for a particular kind of task. It is the only one of the three that permanently alters the model, and the only one that demands real training compute.

What Each Is Good At

Prompting shines when the model is capable but you have not asked clearly. If a few well-chosen examples and a precise instruction fix the output, you are done — for the price of writing a paragraph. It is also the right first move for almost any new problem, because it tells you how much of the gap is simply communication.

RAG shines when the issue is knowledge: the model needs facts it does not have. Internal documentation, the latest figures, customer-specific details, anything that postdates training or was never public. RAG keeps that knowledge fresh, because updating a document updates the answers, and it keeps answers citable, because you know which sources you supplied.

Fine-tuning shines when the issue is behaviour, format, or style, and prompting alone cannot make it consistent. If you need every response in a rigid JSON shape, or in a very specific house voice, or following a specialised classification scheme that no amount of instruction reliably enforces, fine-tuning bakes that behaviour in. It is also useful for squeezing strong performance from a smaller, cheaper model on a narrow task.

What Each Is Bad At

Each tool has a sharp edge, and knowing it is half the battle.

Prompting hits a ceiling. Stuffing dozens of examples into every request is wasteful and eventually unreliable, and there are behaviours no instruction can pin down firmly enough. When you find yourself with a sprawling, brittle mega-prompt, prompting is telling you it has run out of road.

RAG is only as good as its retrieval. Fetch the wrong passages and the model grounds its answer in the wrong facts, confidently. It adds moving parts — an embedding model, a vector store, a chunking strategy — and the latency of a lookup before every answer. It is the wrong tool for changing how the model writes; pasting documents into a prompt does not teach it your house style.

Fine-tuning is the costliest to get wrong, and the cardinal error is reaching for it to add knowledge. It is slow, it needs a curated dataset, it consumes GPU hours, and the moment your facts change the model is stale again with no easy fix. Fine-tuning teaches patterns of behaviour, not a reliable store of facts, and confusing the two is the single most expensive mistake in this whole field.

Cost and Effort, Compared

The three approaches sit on a clear ladder of cost and effort.

Prompting is cheapest by a wide margin: minutes of work, no infrastructure, no training, instant iteration. Change your mind and you simply rewrite the text.
RAG is moderate: you build and maintain a retrieval pipeline and pay a little extra latency and token cost per query, but there is no training run and updates are as easy as editing a document.
Fine-tuning is the heaviest: you must assemble a quality dataset, run training on real hardware, evaluate the result, and repeat when it disappoints. Every meaningful change means another training cycle.

The sensible instinct is to climb this ladder only as far as you must, and no further. Most problems are solved on the bottom two rungs.

What This Looks Like in Practice

To make the difference concrete, here are the same task tackled three ways against a locally hosted model. First, prompting — you change nothing but the text you send:

1
2
3
4
5
6
7
curl http://192.168.1.50:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a support triage bot. Reply ONLY with valid JSON: {\"category\": ..., \"urgency\": 1-5}."},
    {"role": "user", "content": "My invoice export is failing with a 500 error."}
  ]
}'

Second, RAG — before the model answers, you retrieve the relevant docs and inject them. The pattern is always retrieve-then-stuff:

1
2
3
4
chunks = vectordb.query(embed(user_question), top_k=4)
context = "\n\n".join(c.text for c in chunks)
prompt = f"Answer using ONLY the context below.\n\nContext:\n{context}\n\nQuestion: {user_question}"
answer = llm.chat(prompt)

Third, fine-tuning — you assemble a dataset of input/output pairs and train. Even a lightweight LoRA run is a different order of effort:

1
2
3
4
5
6
# a JSONL training file: hundreds of {"prompt": ..., "completion": ...} examples
llamafactory-cli train \
  --model_name_or_path meta-llama/Llama-3.1-8B \
  --dataset triage_examples \
  --finetuning_type lora \
  --output_dir ./out/triage-lora

Notice the gradient of commitment. The first is a one-line edit; the second is a pipeline you build and maintain; the third is a dataset, a training run, and an evaluation loop. If you want the retrieval side spelled out properly, I have walked through a full pipeline in a practical RAG pipeline, and the LoRA route in LoRA fine-tuning on consumer hardware.

A Decision Flow

When you are stuck, walk through these questions in order:

Does the model need fresh, private, or frequently changing facts? If yes, reach for RAG. This is the knowledge problem, and retrieval is its answer. Do not fine-tune facts in.
Do you need a consistently enforced output format, a specific domain tone, or a narrow specialised behaviour that prompting cannot pin down? If yes, consider fine-tuning. This is the behaviour problem.
Do you mostly just need clearer instructions or better examples? If yes — and this covers more cases than people admit — improve your prompt first. It is free, fast, and frequently sufficient.

Always start at the bottom. Try prompting before RAG, and RAG before fine-tuning, because each step up costs more and the cheaper rungs solve a surprising share of problems outright.

Combining Them

These are not rivals; they are layers, and the strongest systems use all three together. A typical production setup prompts well as a baseline, uses RAG to feed in the relevant private knowledge, and fine-tunes the underlying model so it reliably produces the house format and tone.

The division of labour is clean. Fine-tuning sets the model’s default behaviour — how it writes, what shape its output takes. RAG supplies the knowledge it reasons over for each specific query. Prompting orchestrates the whole exchange and handles per-request nuance. Far from competing, they cover different gaps, and a thoughtful combination beats any one of them used alone.

Common Mistakes

A few errors recur often enough to name plainly. The biggest, already flagged, is fine-tuning to add facts. It is expensive, it goes stale the moment the facts change, and the model still hallucinates around the edges. Use RAG for knowledge, every time.

The mirror-image mistake is reaching for RAG when the real problem is behaviour. If responses come out in the wrong format or the wrong voice, no quantity of retrieved documents will fix it; that is a fine-tuning or prompting job. Another classic is skipping prompting entirely and jumping to heavy machinery before discovering the issue was simply an unclear instruction — a costly way to learn you needed one good sentence. And finally, fine-tuning on a thin or messy dataset: the technique is only as good as its examples, and a small, noisy training set produces a model that is confidently worse than where you started.

Troubleshooting: How to Tell Which Problem You Actually Have

When you are genuinely unsure, run these diagnostics before spending anything.

Symptom: the model gets facts wrong or makes things up. Paste the correct facts directly into the prompt by hand and ask again. If the answer is now right, you have a knowledge problem and the answer is RAG, not training. If it is still wrong even with the facts in front of it, the model cannot reason over them — a larger model or better chunking is your lever, still not fine-tuning.

Symptom: the output format drifts. Sometimes it returns JSON, sometimes prose, sometimes JSON wrapped in an apology. First tighten the prompt: a hard system instruction plus one or two worked examples fixes this far more often than people expect. Only if the drift survives a genuinely strict prompt across dozens of trials is it a fine-tuning candidate. A common trap here is testing once, seeing failure, and concluding prompting “doesn’t work” — measure across a batch, not a single roll of the dice.

Symptom: latency or token cost is unacceptable. If a sprawling few-shot prompt is correct but expensive, fine-tuning a smaller model to internalise that behaviour can pay off, because you stop paying for the examples on every call. This is one of the few cases where you climb the ladder for economics rather than capability.

Symptom: retrieval returns rubbish. Before blaming the model, log what your retriever actually fetched. Nine times in ten the embedding model or chunk size is wrong and the LLM is faithfully grounding its answer in irrelevant passages. Fix retrieval first; the generation almost always follows.

The meta-lesson: every one of these starts with a cheap experiment that tells you which problem you have, so you never commit GPU hours on a hunch. If you are weighing up whether a home GPU is even worth it for this kind of work, that is a separate calculation I have made elsewhere about adding a desktop GPU to a mini PC.

Conclusion

Prompting, RAG, and fine-tuning are three distinct answers to three distinct questions: am I asking clearly, does the model have the right facts, and does it behave the right way by default? Match the tool to the problem — prompt for instructions, retrieve for knowledge, fine-tune for behaviour — and climb the cost ladder only as far as you genuinely need. Do that, and you will solve more problems, combine the tools where it counts, and never again burn a stack of GPU hours teaching a model facts you could simply have handed it. The right tool, chosen deliberately, is almost always the cheaper one.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ai #finetuning #rag #guide