Talking to Your Documents: A Practical RAG Pipeline with Open-Source Tools

Ask your PDFs anything

27-05-2026

Talking to Your Documents: A Practical RAG Pipeline with Open-Source Tools

Contents

There is a particular kind of frustration in knowing that the answer you need is somewhere in a forty-page PDF, and that finding it means reading all forty pages. Retrieval-Augmented Generation turns that pile of documents into something you can simply talk to. Ask a question in plain English, and the system finds the relevant passages and answers from them. The very best part is that you can build a working version yourself, on your own machine, using only open-source tools and a modest Python script. This guide walks through exactly that — a small but complete RAG pipeline that lets you interrogate your own documents.

The Shape of What You Are Building

Before any code, hold the plan in your head. A RAG system has an offline phase and an online phase. Offline, you read your documents, slice them into manageable chunks, convert each chunk into a numerical vector that captures its meaning, and store those vectors. Online, you take a user’s question, convert it into a vector too, find the stored chunks whose vectors sit nearest to it, paste those chunks into a prompt, and ask a language model to answer using them.

Every tool here is local and free. Ollama serves both an embedding model and a chat model on your own machine. Chroma is a lightweight open-source vector store that runs in-process with no separate server to babysit. And a short Python script ties the whole thing together. Nothing leaves your laptop, which is rather the point when your documents are private.

If you want the conceptual grounding before the code — what an embedding actually is, why vectors capture meaning, and where the whole approach hits its ceiling — I dug into that separately in semantic search on your own documents. And if you’re still deciding whether RAG is even the right tool, as opposed to fine-tuning a model or just writing a better prompt, fine-tuning vs prompting vs RAG is the argument I’d read first. This guide assumes you’ve landed on RAG and want something running today.

Installing the Pieces

Get your dependencies in place first:

1
pip install chromadb ollama pypdf

Then pull the two models you need from Ollama — one specialised for turning text into vectors, one for generating answers:

1
2
ollama pull nomic-embed-text
ollama pull llama3.1:8b

The embedding model nomic-embed-text is small and fast and exists purely to produce good vectors. The chat model does the actual answering. With both pulled and Ollama running, you have everything the script needs.

Loading and Chunking the Documents

A PDF is no use to the model as a single forty-page blob. You need to extract its text and cut it into chunks small enough to embed meaningfully and to fit several at a time into a prompt. A few hundred words per chunk is a sensible starting point, with a little overlap between consecutive chunks so that a sentence split across a boundary is not lost.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from pypdf import PdfReader

def load_pdf(path):
    reader = PdfReader(path)
    text = "\n".join(page.extract_text() or "" for page in reader.pages)
    return text

def chunk_text(text, chunk_size=800, overlap=150):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunks.append(" ".join(words[start:end]))
        start += chunk_size - overlap
    return chunks

This is deliberately simple: split on whitespace, take windows of 800 words, and step forward by 650 so that each chunk shares 150 words with the previous one. The overlap is your insurance against an important sentence being sliced clean in half at a boundary.

Embedding and Storing

Now turn each chunk into a vector and put it in Chroma. Ollama exposes an embedding endpoint, and Chroma gives you a collection to store vectors alongside their original text.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import chromadb
import ollama

client = chromadb.PersistentClient(path="./vectordb")
collection = client.get_or_create_collection("documents")

def embed(text):
    # ollama.embed() is the current API; the old ollama.embeddings(prompt=...)
    # is deprecated. embed() takes `input` and returns a list under `embeddings`.
    response = ollama.embed(model="nomic-embed-text", input=text)
    return response["embeddings"][0]

def index_document(path):
    chunks = chunk_text(load_pdf(path))
    for i, chunk in enumerate(chunks):
        collection.add(
            ids=[f"{path}-{i}"],
            embeddings=[embed(chunk)],
            documents=[chunk],
        )
    print(f"Indexed {len(chunks)} chunks from {path}")

Because the client is persistent, your index survives between runs — you index a document once, and it is ready to query forever after. Re-indexing is only needed when the underlying documents change.

Retrieving and Generating

With documents indexed, answering a question is the satisfying part. Embed the question, ask Chroma for the nearest chunks, build a prompt that includes them, and hand it to the chat model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def answer(question, top_k=4):
    q_vector = embed(question)
    results = collection.query(
        query_embeddings=[q_vector],
        n_results=top_k,
    )
    context = "\n\n---\n\n".join(results["documents"][0])

    prompt = f"""Answer the question using only the context below.
If the answer is not in the context, say you don't know.

Context:
{context}

Question: {question}
Answer:"""

    response = ollama.chat(
        model="llama3.1:8b",
        messages=[{"role": "user", "content": prompt}],
    )
    return response["message"]["content"]

if __name__ == "__main__":
    index_document("handbook.pdf")
    print(answer("What is the policy on remote work?"))

That is the entire pipeline. The top_k=4 fetches the four most relevant chunks; the prompt instructs the model to stick to that context and to admit ignorance rather than invent. The instruction to say “I don’t know” is not optional politeness — it is the line that keeps a grounded system honest when retrieval comes up empty.

Tuning Chunk Size and Overlap

The numbers above are starting points, not gospel, and tuning them is where a mediocre RAG system becomes a good one. Chunk size is a balancing act. Small chunks are precise — they let retrieval pinpoint a single relevant fact — but they can lack the surrounding context that makes that fact meaningful. Large chunks carry more context but dilute the relevant sentence among padding, which both confuses retrieval and wastes room in your prompt.

Overlap guards against boundary effects. With no overlap, a definition that begins at the end of one chunk and finishes at the start of the next is split and half-lost. A 10-to-20 per cent overlap usually patches this cheaply. As for top_k, retrieving more chunks improves the odds of catching the answer but spends more of your context window and risks burying the signal in noise. Start with four, and adjust based on whether answers feel under- or over-informed.

Choosing the Models

The two models do very different jobs, and picking them well matters more than any single line of code. The embedding model decides retrieval quality — it maps text to vectors, and if that mapping is poor, no chat model can rescue an answer built on the wrong chunks. nomic-embed-text is a sensible default: small, fast, and good enough that retrieval is rarely the bottleneck for general documents. If your corpus is specialised — legal, medical, code — it is worth trying a couple of embedding models and measuring which retrieves better on your questions, because general-purpose embeddings can miss domain jargon.

The chat model decides how good the final answer reads once the right context is in front of it. Here the trade-off is size against speed: llama3.1:8b runs comfortably on a machine with a modest GPU or even a capable CPU, and for grounded question-answering — where the model is summarising supplied context rather than reasoning from scratch — an 8-billion-parameter model is often plenty. Reach for a larger model only when answers are correct-but-clumsy despite good retrieval. One practical note: keep the embedding model fixed once you’ve indexed. Change it and every stored vector is now in a different space from your queries, and you must re-index the entire corpus. Swapping the chat model, by contrast, is free — it never touches the index.

Evaluation and the Pitfalls

How do you know it works? Assemble a small set of questions for which you already know the correct answers, run them through, and read the results critically. When an answer is wrong, diagnose why. Did retrieval fetch the wrong chunks? Then the problem is your embedding or chunking, not the language model. Did it fetch the right chunks but the model still answered poorly? Then look at your prompt or reach for a stronger chat model.

The cardinal pitfall is the oldest one in computing: garbage in, garbage out. If your PDFs extract as mangled text — scanned images with no real text layer, tables flattened into nonsense, headers and footers polluting every chunk — no amount of clever retrieval will save you. Clean extraction matters more than people expect. A scanned document with no text layer needs OCR (Tesseract, or a document-parsing library) before it ever reaches pypdf, and a table that matters is often better pulled out and stored as structured text than left to flatten into word-salad.

The second great pitfall is assuming retrieval is solved. It is the part that most rewards patient tuning, and the part most likely to be quietly letting you down when answers disappoint. Treat retrieval quality as your prime suspect whenever the system underwhelms.

A few concrete failure modes and their fixes:

Every answer is “I don’t know”, even for questions clearly covered. Retrieval is fetching the wrong chunks. Print results["documents"][0] and read what actually came back — if it’s irrelevant, your embedding model or chunking is at fault, not the LLM. Check you’re embedding the query with the same model you embedded the documents; a mismatch produces vectors in incompatible spaces and retrieval returns noise.
Answers are confidently wrong and cite nothing. The model is ignoring your context and answering from its own training. Strengthen the prompt’s instruction to use only the supplied context, lower the temperature, and consider asking the model to quote the passage it relied on so you can spot when it’s improvising.
The index seems stale after you edit a document. Chroma doesn’t know your PDF changed. You must delete the old chunks for that document (they share an id prefix — f"{path}-{i}" above makes this easy) and re-index. Naive re-adding leaves duplicate, contradictory chunks that poison retrieval.
First query is painfully slow. Ollama loads the model into memory on first use. Send one throwaway embed and one throwaway chat call at startup to warm both models, and subsequent queries feel instant.

Where this goes next

The script above is the honest core of RAG, and it is deliberately bare so you can see every moving part. Once it works, the obvious next steps are a nicer front end and better retrieval. Rather than build a chat UI from scratch, you can bolt this logic onto an existing local-LLM interface — I wire retrieval into a chat interface via Open WebUI pipelines, which turns “a Python script I run in a terminal” into “a chat box the household can use”. From there, adding a re-ranking step, source citations in the answer, and hybrid keyword-plus-vector search are each an afternoon’s work that measurably improves results.

Conclusion

You have built a complete RAG pipeline: load and chunk PDFs, embed them with a local model, store the vectors in Chroma, retrieve the most relevant chunks for any question, and generate a grounded answer with a local LLM — all on your own machine, with open-source tools and a script you can read in one sitting. The skeleton here is genuinely usable, and it is also a foundation. Improve the chunking, swap in better models, tune top_k, add source citations, and you steadily move from a weekend toy towards something you would trust with real work. The hardest part was never the code; it is the patient tuning of retrieval. Get that right and your documents will finally answer back.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ai #rag #embeddings #howto