RAG Explained: How AI Stops Making Things Up

Giving language models a library card

07-04-2026

RAG Explained: How AI Stops Making Things Up

Contents

Imagine a brilliant colleague who has read most of the internet, speaks with unshakeable confidence, and occasionally invents a fact so smoothly that you only catch it because you happen to know the truth. That is a large language model on a bad day. It is not lying, exactly; it simply does not know what it does not know. Retrieval-Augmented Generation, or RAG, is the technique that hands that colleague a library card and a quiet instruction: before you answer, go and look it up. The result is an AI that grounds its words in real documents rather than in the foggy recollections of its training data.

Why Models Make Things Up

To understand the fix, you first need to understand the failure. A language model does not store facts the way a database stores rows. During training it absorbs enormous quantities of text and learns the statistical shape of language: which words tend to follow which, how arguments are structured, what a plausible sentence looks like. When you ask it a question, it does not retrieve an answer from a filing cabinet. It predicts, token by token, the most likely continuation of your prompt.

This is a marvellous trick for fluency and a terrible one for accuracy. If the model has seen the answer many times during training, the most plausible continuation usually is the correct one. But if you ask about something obscure, recent, or private — your company’s internal policy, last week’s news, the specifics of a niche library — the model has no relevant memory to draw on. Rather than shrug, it does what it always does: it generates the most plausible-sounding text. Plausible and true are not the same thing, and that gap is where hallucinations live.

There are two structural reasons for this. First, the model has no live knowledge. Its training was frozen at some cutoff date, and it has been blissfully unaware of events since. Second, it has no concept of citation. It cannot point to where it learned something, because it did not learn discrete things; it learned patterns. Asking it to be sure of a fact is a bit like asking someone to remember exactly which conversation taught them that Paris is in France.

The Library Card Analogy

Picture a sharp graduate student sitting an exam with no notes. Faced with a question they half-remember, they will write something confident and mostly-right, padding the gaps with reasonable guesses. Now picture the same student sitting an open-book exam in a well-stocked library. Before answering, they walk to the right shelf, pull the relevant book, find the passage, and write an answer anchored to the page in front of them. Same student, same brain, dramatically better answers.

RAG turns the closed-book exam into an open-book one. The model’s reasoning ability stays exactly the same — what changes is that you place the relevant source material directly in front of it before it speaks. The intelligence was never the problem; the missing ingredient was access to the right facts at the right moment.

What RAG Actually Does

Retrieval-Augmented Generation is, despite the imposing name, a fairly simple sandwich of three actions. First you retrieve: given a user’s question, you search a collection of documents for the passages most likely to contain the answer. Then you augment: you take those passages and paste them into the prompt, alongside the original question and an instruction such as “answer using only the context below.” Finally you generate: the model produces its answer, now grounded in the supplied text rather than in its hazy internal memory.

The elegance is that nothing about the model itself has to change. You are not retraining it, not editing its weights, not teaching it new facts in any permanent sense. You are simply being a good librarian — fetching the right book and laying it open at the right page — and letting the model do what it is already excellent at: reading and summarising.

Embeddings and Vector Search, Gently

The clever part is the retrieval step. Old-fashioned keyword search would look for documents containing the exact words in your question, but language is slippery. A user might ask about “annual leave” while the document says “holiday entitlement,” and a keyword search would miss the match entirely.

This is where embeddings come in. An embedding model reads a chunk of text and converts it into a list of numbers — a vector — that captures its meaning. Texts about similar ideas end up with similar vectors, sitting close together in a high-dimensional space, even when they share no words. You can think of it as a map where every passage gets a coordinate, and passages about the same topic are neighbours regardless of vocabulary.

To answer a question, you embed the question into the same space and then look for the document vectors nearest to it. That nearest-neighbour lookup is what a vector database is built to do quickly. “Annual leave” and “holiday entitlement” land in roughly the same neighbourhood, so the right passage surfaces even though the wording differs. Meaning, not spelling, drives the match.

The Basic Pipeline

A working RAG system follows a tidy five-step pipeline, and it helps to hold the whole shape in your head:

Chunk. Break your documents into bite-sized passages, perhaps a few hundred words each. Whole documents are too large to embed meaningfully and too big to stuff into a prompt.
Embed. Run every chunk through the embedding model to produce its vector.
Store. Save those vectors, along with the original text, in a vector database.
Retrieve. When a question arrives, embed it and fetch the handful of nearest chunks.
Generate. Hand those chunks to the language model as context and ask for a grounded answer.

Steps one to three happen once, ahead of time, whenever your documents change. Steps four and five happen live, every time a user asks something. The offline indexing is the slow, heavy lifting; the online query is fast.

What You Gain

The payoff is substantial. Your answers become fresh: update the documents and the system immediately reflects the change, no retraining required. They become citable: because you know exactly which chunks you fed the model, you can show users the sources and let them verify, which transforms a black box into something auditable. And they can draw on private data: your contracts, your wiki, your support tickets — material the model never saw in training and never will. You keep your knowledge in your own store and merely lend it to the model one query at a time.

This combination is why RAG has become the default approach for building AI over proprietary or fast-moving information. It is cheaper than retraining, faster to update than fine-tuning, and far more trustworthy than asking a model to recall things from memory. If you are weighing it against the alternatives, I lay out the trade-offs in detail in fine-tuning vs prompting vs RAG: picking the right tool — the short version is that fine-tuning teaches a model how to behave, while RAG gives it what to know, and most of the time the thing you actually lack is knowledge, not behaviour.

Where the Documents Come From

There is a step upstream of everything above that quietly decides how good your system can be: the quality of the documents you feed it. RAG can only retrieve what you have collected, and it can only ground answers in text that is accurate in the first place. Point it at a stale wiki full of contradictions and it will confidently cite the contradictions. The unglamorous work of curating, cleaning, and structuring your source material is where much of the real quality lives, and it is easy to skip because it feels like janitorial work rather than AI.

For some applications the documents need labelling or structuring before they are useful — deciding what counts as a distinct answerable unit, tagging passages by topic or freshness, marking which sources supersede which. That kind of human-in-the-loop preparation is a discipline of its own; I have written about self-hosting the tooling for it in Label Studio: self-hosted data annotation. The lesson that carries over is the same in both worlds: the model is only ever as good as the data you put in front of it, and no amount of clever retrieval rescues a corpus you have not tended.

Where It Strains

RAG is not magic, and its weak point is the retrieval step. If the search fetches the wrong passages, the model will faithfully ground its answer in irrelevant text and confidently mislead you. Garbage in, grounded garbage out. Retrieval quality is therefore the single most important thing to get right.

Chunking is a quiet art. Cut passages too small and you sever the context that gives them meaning; cut them too large and you dilute the relevant sentence among padding, or blow past the model’s limits. There is also the matter of the context window — the maximum amount of text a model can read at once. You cannot simply pour in fifty documents and hope; you must select the most relevant few, which loops you straight back to the quality of your retrieval. RAG shifts the hard problem from “what does the model know” to “can you find the right page,” and that is a problem worth having, because it is one you can actually solve.

The usual remedies are worth knowing, because you will reach for them the moment a real system disappoints you. A reranker is a second, more careful model that re-scores the top candidates retrieval returned, promoting the genuinely relevant passages above the merely similar-sounding ones. Hybrid search blends the semantic vector match with old-fashioned keyword matching, which rescues the cases where an exact term — a product code, a name, a statute number — matters more than meaning. And you cannot improve what you do not measure: build a small evaluation set of real questions with known-good answers, and check what fraction of the time your retrieval actually surfaces the right passage. That single number tells you more about your system’s future accuracy than any amount of prompt-tweaking, because it measures the step where RAG succeeds or fails.

Conclusion

Retrieval-Augmented Generation does not make language models smarter; it makes them honest. By fetching real documents and laying them open before the model generates a word, RAG swaps confident guesswork for grounded, citable answers drawn from data you control. The model brings the reading and reasoning; you bring the library. Get the retrieval right and you turn a fluent improviser into a reliable researcher — one that, refreshingly, can show its working and point you at the page it worked from.

Frequently asked questions

What is RAG (Retrieval-Augmented Generation)?

RAG is a technique that has a language model look up real documents before it answers. It retrieves the passages most likely to contain the answer, pastes them into the prompt, and then asks the model to generate a response grounded in that supplied text rather than its internal memory.

Why do language models hallucinate or make things up?

A model does not store facts like a database; it predicts the most plausible continuation of your prompt token by token. When asked about something obscure, recent or private it has no relevant memory, so it generates plausible-sounding text instead, and plausible is not the same as true.

How does RAG reduce hallucinations?

It turns a closed-book exam into an open-book one by placing the relevant source material in front of the model before it speaks. The model’s reasoning stays the same; it simply answers from real documents you control rather than from hazy training data.

What is the main weakness of RAG?

Its weak point is the retrieval step. If the search fetches the wrong passages, the model will faithfully ground its answer in irrelevant text and confidently mislead you, so retrieval quality is the single most important thing to get right.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ai #rag #llm #explainer