Talking to Your Documents: A Practical RAG Pipeline with Open-Source Tools
Ask your PDFs anything

There is a particular kind of frustration in knowing that the answer you need is somewhere in a forty-page PDF, and that finding it means reading all forty pages. Retrieval-Augmented Generation turns that pile of documents into something you can simply talk to. Ask a question in plain English, and the system finds the relevant passages and answers from them. The very best part is that you can build a working version yourself, on your own machine, using only open-source tools and a modest Python script. This guide walks through exactly that — a small but complete RAG pipeline that lets you interrogate your own documents.
1 The Shape of What You Are Building
Before any code, hold the plan in your head. A RAG system has an offline phase and an online phase. Offline, you read your documents, slice them into manageable chunks, convert each chunk into a numerical vector that captures its meaning, and store those vectors. Online, you take a user’s question, convert it into a vector too, find the stored chunks whose vectors sit nearest to it, paste those chunks into a prompt, and ask a language model to answer using them.
Every tool here is local and free. Ollama serves both an embedding model and a chat model on your own machine. Chroma is a lightweight open-source vector store that runs in-process with no separate server to babysit. And a short Python script ties the whole thing together. Nothing leaves your laptop, which is rather the point when your documents are private.
2 Installing the Pieces
Get your dependencies in place first:
pip install chromadb ollama pypdfThen pull the two models you need from Ollama — one specialised for turning text into vectors, one for generating answers:
ollama pull nomic-embed-text
ollama pull llama3.1:8bThe embedding model nomic-embed-text is small and fast and exists purely to produce good vectors. The chat model does the actual answering. With both pulled and Ollama running, you have everything the script needs.
3 Loading and Chunking the Documents
A PDF is no use to the model as a single forty-page blob. You need to extract its text and cut it into chunks small enough to embed meaningfully and to fit several at a time into a prompt. A few hundred words per chunk is a sensible starting point, with a little overlap between consecutive chunks so that a sentence split across a boundary is not lost.
from pypdf import PdfReader
def load_pdf(path):
reader = PdfReader(path)
text = "\n".join(page.extract_text() or "" for page in reader.pages)
return text
def chunk_text(text, chunk_size=800, overlap=150):
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunks.append(" ".join(words[start:end]))
start += chunk_size - overlap
return chunksThis is deliberately simple: split on whitespace, take windows of 800 words, and step forward by 650 so that each chunk shares 150 words with the previous one. The overlap is your insurance against an important sentence being sliced clean in half at a boundary.
4 Embedding and Storing
Now turn each chunk into a vector and put it in Chroma. Ollama exposes an embedding endpoint, and Chroma gives you a collection to store vectors alongside their original text.
import chromadb
import ollama
client = chromadb.PersistentClient(path="./vectordb")
collection = client.get_or_create_collection("documents")
def embed(text):
response = ollama.embeddings(model="nomic-embed-text", prompt=text)
return response["embedding"]
def index_document(path):
chunks = chunk_text(load_pdf(path))
for i, chunk in enumerate(chunks):
collection.add(
ids=[f"{path}-{i}"],
embeddings=[embed(chunk)],
documents=[chunk],
)
print(f"Indexed {len(chunks)} chunks from {path}")Because the client is persistent, your index survives between runs — you index a document once, and it is ready to query forever after. Re-indexing is only needed when the underlying documents change.
5 Retrieving and Generating
With documents indexed, answering a question is the satisfying part. Embed the question, ask Chroma for the nearest chunks, build a prompt that includes them, and hand it to the chat model.
def answer(question, top_k=4):
q_vector = embed(question)
results = collection.query(
query_embeddings=[q_vector],
n_results=top_k,
)
context = "\n\n---\n\n".join(results["documents"][0])
prompt = f"""Answer the question using only the context below.
If the answer is not in the context, say you don't know.
Context:
{context}
Question: {question}
Answer:"""
response = ollama.chat(
model="llama3.1:8b",
messages=[{"role": "user", "content": prompt}],
)
return response["message"]["content"]
if __name__ == "__main__":
index_document("handbook.pdf")
print(answer("What is the policy on remote work?"))That is the entire pipeline. The top_k=4 fetches the four most relevant chunks; the prompt instructs the model to stick to that context and to admit ignorance rather than invent. The instruction to say “I don’t know” is not optional politeness — it is the line that keeps a grounded system honest when retrieval comes up empty.
6 Tuning Chunk Size and Overlap
The numbers above are starting points, not gospel, and tuning them is where a mediocre RAG system becomes a good one. Chunk size is a balancing act. Small chunks are precise — they let retrieval pinpoint a single relevant fact — but they can lack the surrounding context that makes that fact meaningful. Large chunks carry more context but dilute the relevant sentence among padding, which both confuses retrieval and wastes room in your prompt.
Overlap guards against boundary effects. With no overlap, a definition that begins at the end of one chunk and finishes at the start of the next is split and half-lost. A 10-to-20 per cent overlap usually patches this cheaply. As for top_k, retrieving more chunks improves the odds of catching the answer but spends more of your context window and risks burying the signal in noise. Start with four, and adjust based on whether answers feel under- or over-informed.
7 Evaluation and the Pitfalls
How do you know it works? Assemble a small set of questions for which you already know the correct answers, run them through, and read the results critically. When an answer is wrong, diagnose why. Did retrieval fetch the wrong chunks? Then the problem is your embedding or chunking, not the language model. Did it fetch the right chunks but the model still answered poorly? Then look at your prompt or reach for a stronger chat model.
The cardinal pitfall is the oldest one in computing: garbage in, garbage out. If your PDFs extract as mangled text — scanned images with no real text layer, tables flattened into nonsense, headers and footers polluting every chunk — no amount of clever retrieval will save you. Clean extraction matters more than people expect. The second great pitfall is assuming retrieval is solved. It is the part that most rewards patient tuning, and the part most likely to be quietly letting you down when answers disappoint. Treat retrieval quality as your prime suspect whenever the system underwhelms.
8 Conclusion
You have built a complete RAG pipeline: load and chunk PDFs, embed them with a local model, store the vectors in Chroma, retrieve the most relevant chunks for any question, and generate a grounded answer with a local LLM — all on your own machine, with open-source tools and a script you can read in one sitting. The skeleton here is genuinely usable, and it is also a foundation. Improve the chunking, swap in better models, tune top_k, add source citations, and you steadily move from a weekend toy towards something you would trust with real work. The hardest part was never the code; it is the patient tuning of retrieval. Get that right and your documents will finally answer back.