RAG Explained: How AI Stops Making Things Up
Giving language models a library card

Imagine a brilliant colleague who has read most of the internet, speaks with unshakeable confidence, and occasionally invents a fact so smoothly that you only catch it because you happen to know the truth. That is a large language model on a bad day. It is not lying, exactly; it simply does not know what it does not know. Retrieval-Augmented Generation, or RAG, is the technique that hands that colleague a library card and a quiet instruction: before you answer, go and look it up. The result is an AI that grounds its words in real documents rather than in the foggy recollections of its training data.
1 Why Models Make Things Up
To understand the fix, you first need to understand the failure. A language model does not store facts the way a database stores rows. During training it absorbs enormous quantities of text and learns the statistical shape of language: which words tend to follow which, how arguments are structured, what a plausible sentence looks like. When you ask it a question, it does not retrieve an answer from a filing cabinet. It predicts, token by token, the most likely continuation of your prompt.
This is a marvellous trick for fluency and a terrible one for accuracy. If the model has seen the answer many times during training, the most plausible continuation usually is the correct one. But if you ask about something obscure, recent, or private — your company’s internal policy, last week’s news, the specifics of a niche library — the model has no relevant memory to draw on. Rather than shrug, it does what it always does: it generates the most plausible-sounding text. Plausible and true are not the same thing, and that gap is where hallucinations live.
There are two structural reasons for this. First, the model has no live knowledge. Its training was frozen at some cutoff date, and it has been blissfully unaware of events since. Second, it has no concept of citation. It cannot point to where it learned something, because it did not learn discrete things; it learned patterns. Asking it to be sure of a fact is a bit like asking someone to remember exactly which conversation taught them that Paris is in France.
2 The Library Card Analogy
Picture a sharp graduate student sitting an exam with no notes. Faced with a question they half-remember, they will write something confident and mostly-right, padding the gaps with reasonable guesses. Now picture the same student sitting an open-book exam in a well-stocked library. Before answering, they walk to the right shelf, pull the relevant book, find the passage, and write an answer anchored to the page in front of them. Same student, same brain, dramatically better answers.
RAG turns the closed-book exam into an open-book one. The model’s reasoning ability stays exactly the same — what changes is that you place the relevant source material directly in front of it before it speaks. The intelligence was never the problem; the missing ingredient was access to the right facts at the right moment.
3 What RAG Actually Does
Retrieval-Augmented Generation is, despite the imposing name, a fairly simple sandwich of three actions. First you retrieve: given a user’s question, you search a collection of documents for the passages most likely to contain the answer. Then you augment: you take those passages and paste them into the prompt, alongside the original question and an instruction such as “answer using only the context below.” Finally you generate: the model produces its answer, now grounded in the supplied text rather than in its hazy internal memory.
The elegance is that nothing about the model itself has to change. You are not retraining it, not editing its weights, not teaching it new facts in any permanent sense. You are simply being a good librarian — fetching the right book and laying it open at the right page — and letting the model do what it is already excellent at: reading and summarising.
4 Embeddings and Vector Search, Gently
The clever part is the retrieval step. Old-fashioned keyword search would look for documents containing the exact words in your question, but language is slippery. A user might ask about “annual leave” while the document says “holiday entitlement,” and a keyword search would miss the match entirely.
This is where embeddings come in. An embedding model reads a chunk of text and converts it into a list of numbers — a vector — that captures its meaning. Texts about similar ideas end up with similar vectors, sitting close together in a high-dimensional space, even when they share no words. You can think of it as a map where every passage gets a coordinate, and passages about the same topic are neighbours regardless of vocabulary.
To answer a question, you embed the question into the same space and then look for the document vectors nearest to it. That nearest-neighbour lookup is what a vector database is built to do quickly. “Annual leave” and “holiday entitlement” land in roughly the same neighbourhood, so the right passage surfaces even though the wording differs. Meaning, not spelling, drives the match.
5 The Basic Pipeline
A working RAG system follows a tidy five-step pipeline, and it helps to hold the whole shape in your head:
- Chunk. Break your documents into bite-sized passages, perhaps a few hundred words each. Whole documents are too large to embed meaningfully and too big to stuff into a prompt.
- Embed. Run every chunk through the embedding model to produce its vector.
- Store. Save those vectors, along with the original text, in a vector database.
- Retrieve. When a question arrives, embed it and fetch the handful of nearest chunks.
- Generate. Hand those chunks to the language model as context and ask for a grounded answer.
Steps one to three happen once, ahead of time, whenever your documents change. Steps four and five happen live, every time a user asks something. The offline indexing is the slow, heavy lifting; the online query is fast.
6 What You Gain
The payoff is substantial. Your answers become fresh: update the documents and the system immediately reflects the change, no retraining required. They become citable: because you know exactly which chunks you fed the model, you can show users the sources and let them verify, which transforms a black box into something auditable. And they can draw on private data: your contracts, your wiki, your support tickets — material the model never saw in training and never will. You keep your knowledge in your own store and merely lend it to the model one query at a time.
This combination is why RAG has become the default approach for building AI over proprietary or fast-moving information. It is cheaper than retraining, faster to update than fine-tuning, and far more trustworthy than asking a model to recall things from memory.
7 Where It Strains
RAG is not magic, and its weak point is the retrieval step. If the search fetches the wrong passages, the model will faithfully ground its answer in irrelevant text and confidently mislead you. Garbage in, grounded garbage out. Retrieval quality is therefore the single most important thing to get right.
Chunking is a quiet art. Cut passages too small and you sever the context that gives them meaning; cut them too large and you dilute the relevant sentence among padding, or blow past the model’s limits. There is also the matter of the context window — the maximum amount of text a model can read at once. You cannot simply pour in fifty documents and hope; you must select the most relevant few, which loops you straight back to the quality of your retrieval. RAG shifts the hard problem from “what does the model know” to “can you find the right page,” and that is a problem worth having, because it is one you can actually solve.
8 Conclusion
Retrieval-Augmented Generation does not make language models smarter; it makes them honest. By fetching real documents and laying them open before the model generates a word, RAG swaps confident guesswork for grounded, citable answers drawn from data you control. The model brings the reading and reasoning; you bring the library. Get the retrieval right and you turn a fluent improviser into a reliable researcher — one that, refreshingly, can show its working.