Semantic Search on Your Own Documents: Embeddings, Vector DBs, and Practical Limits

How to search by meaning, and where it quietly lets you down

Smarc Included in

28-06-2025 2053 words 10 min read

Semantic Search on Your Own Documents: Embeddings, Vector DBs, and Practical Limits

Contents

I once spent twenty minutes hunting for a note I knew I had written about backing up a database. I searched “how to back up the database” and got nothing, because the note was titled “nightly Postgres dump cron” and shared not a single word with my query. Keyword search has that glaring weakness baked in: it only finds documents that literally contain the words you typed. Semantic search fixes this by matching on meaning rather than spelling, and — this is the part that surprised me — you can run the whole stack on your own hardware over your own documents, with no cloud API in sight. I built exactly that over a few thousand markdown notes on a modest home server, and it has genuinely changed how I find things. It has also taught me, painfully and repeatedly, where the approach quietly breaks.

The idea, mechanically

The trick is embeddings: a model converts each chunk of text into a vector — a list of, typically, a few hundred numbers — positioned so that texts with similar meaning land near each other in that high-dimensional space. “Back up the database” and “nightly Postgres dump” produce vectors that sit close together even with zero shared words, because the model has learned, from vast amounts of text, that those phrases tend to appear in the same contexts. To search, you embed the query the same way and find the nearest document vectors by distance. That is the entire concept. Everything else — the storage, the chunking, the ranking — is plumbing, and the plumbing is where most of the real engineering effort ends up going.

It helps to be concrete about “nearest”. The standard measure is cosine similarity: the cosine of the angle between two vectors, running from 1 (identical direction, i.e. very similar) down through 0 (unrelated) to -1 (opposite). It ignores magnitude and cares only about direction, which is exactly what you want when comparing meaning. Every vector database offers it, and for text it is almost always the right default over raw Euclidean distance. Keep that in the back of your mind, because when results look wrong, the cosine scores are the first thing you will stare at.

If you have read my walkthrough on building a practical RAG pipeline, the retrieval half of that system is exactly this: semantic search is the engine that decides which chunks get fed to a language model. But semantic search is useful entirely on its own, without any generative model bolted on top — as a better way to find things. That is the use I care about here.

You don’t need an API for the embedding model. Small sentence-transformer models run fine on a CPU:

1
2
3
4
5
6
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["Nightly Postgres dump via cron", "Annual leave policy", "ZFS scrub schedule"]
embeddings = model.encode(docs)   # shape: (3, 384)
print(embeddings.shape)

all-MiniLM-L6-v2 is the workhorse: 384 dimensions, fast, and good enough for most homelab-scale corpora. It embeds thousands of short passages a minute on an ordinary CPU, so you do not need a GPU to get started. Larger models — bge-base-en-v1.5, gte-base, or the multilingual e5 family — give measurably better retrieval at a speed and memory cost, and they output larger vectors (768 dimensions is common), which your storage layer needs to know about. My advice is to start with MiniLM, get the pipeline working end to end, and only reach for a heavier model once you can measure that retrieval quality is the bottleneck rather than your chunking. Swapping the model later is a one-line change plus a re-index; over-engineering the model choice on day one is a classic way to spend a weekend and learn nothing.

One non-obvious constraint: whatever model you embed your documents with, you must embed your queries with the same model. The vectors only live in a shared space if they came from the same encoder. Change the model and you have to re-embed the entire corpus, or your search silently returns garbage — the query vector and the stored vectors no longer speak the same language.

Storing and searching the vectors

For a few thousand documents you genuinely don’t need a database — hold every vector in a NumPy array in memory, and a single cosine-similarity computation against the whole set is effectively instant. A brute-force scan of ten thousand 384-dimension vectors is a few milliseconds; you will not notice it. The temptation to reach straight for infrastructure is strong and usually wrong at small scale. But once you’re past tens of thousands of chunks, or you want persistence across restarts, metadata filtering, or approximate-nearest-neighbour indexing so search stays fast as the corpus grows, a proper vector database earns its place. I run Qdrant, which is a single Rust binary, self-hostable, and refreshingly free of the operational ceremony that some of its competitors demand. It stores to disk, exposes a clean HTTP and gRPC API, and does not need a cluster of supporting services just to hold a few million vectors.

Standing it up is a docker-compose file and nothing more:

1
2
3
4
5
6
7
8
services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"
    volumes:
      - ./qdrant-data:/qdrant/storage
    restart: unless-stopped

Inserting and querying is a few lines through its client:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(url="http://localhost:6333")
client.recreate_collection(
    "notes",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
client.upsert("notes", points=[
    PointStruct(id=i, vector=v.tolist(), payload={"text": d})
    for i, (v, d) in enumerate(zip(embeddings, docs))
])

hits = client.search("notes", query_vector=model.encode("how do I back up the db").tolist(), limit=3)
for h in hits:
    print(round(h.score, 3), h.payload["text"])

1
2
3
0.612 Nightly Postgres dump via cron
0.241 ZFS scrub schedule
0.118 Annual leave policy

Note the top hit shares no words with the query. That’s the whole payoff.

Chunking, the part everyone underestimates

You don’t embed whole documents — you embed chunks, because a single vector can only represent so much meaning before it turns to mush. A 4,000-word page averaged into one vector retrieves badly: the one paragraph you actually wanted is diluted by everything else on the page, and its distinctive signal disappears into an averaged fog. So you split documents into passages of a few hundred tokens, often with a little overlap so a sentence isn’t cut mid-thought, and embed each one separately. Get chunking wrong and your search quality collapses regardless of how good the model is. This is the single biggest lever in the whole system, and it is almost never the model’s fault when results are poor — it’s the chunks.

A few hard-won rules. First, chunk on semantic boundaries where you can — paragraphs, markdown headings, list items — rather than blindly slicing every N characters. A chunk that straddles two unrelated sections embeds to a muddled vector that is close to nothing. Second, keep an overlap of maybe 10–20 per cent between adjacent chunks so a thought split across a boundary still appears whole somewhere. Third, size chunks to the model: MiniLM was trained on sentence-length input and degrades on very long passages, so 200–400 tokens is a sane range. Here is a deliberately simple splitter that respects paragraph boundaries and adds overlap:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def chunk(text, target_tokens=300, overlap=50):
    paras = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buf, count = [], [], 0
    for p in paras:
        words = p.split()
        buf.append(p)
        count += len(words)
        if count >= target_tokens:
            chunks.append("\n\n".join(buf))
            # carry the tail forward as overlap
            tail = " ".join(" ".join(buf).split()[-overlap:])
            buf, count = [tail], overlap
    if buf:
        chunks.append("\n\n".join(buf))
    return chunks

It counts words rather than true tokens — close enough for prose, and it keeps the dependency list short. Attach the source path and a heading to each chunk’s payload so that when a result comes back you can actually locate it in the original document.

Where it quietly lets you down

Now the honest part, because the demos never show this.

Semantic search is bad at exact matches. Search for an error code, a part number, or a specific function name, and embeddings will cheerfully return things that are thematically similar while missing the literal string you needed. The fix is hybrid search: run a keyword (BM25) search alongside the vector search and merge the results. Qdrant and friends support this, but it’s more plumbing, and most “just use embeddings” tutorials skip it. For anything with identifiers, names, or codes, pure semantic search will betray you.

It’s also only as good as the embedding model’s training. A general model knows little about your domain’s jargon, so two terms that are synonyms in your world may sit far apart in vector space. And there’s no notion of recency or authority — the most semantically similar chunk might be an outdated note you wrote three years ago. Embeddings have no idea which document is correct, only which is similar.

Finally, it can feel slightly opaque. When keyword search misses, you know exactly why — the word wasn’t there. When semantic search returns something baffling, debugging means staring at cosine scores and second-guessing your chunk boundaries, with no crisp explanation of why the model thought two things were similar.

Troubleshooting: when the results are bad

Because the failures are quiet rather than loud — you get plausible-looking wrong answers, not an error — you need a method for diagnosing them. Here is the order I work through.

Results are irrelevant across the board. Almost always a model mismatch: you embedded documents with one model and queries with another, or you changed the model and forgot to re-index. Confirm both paths call the identical model name, and if in doubt, re-embed the corpus from scratch. This one wastes an afternoon if you don’t check it first.

One topic returns nonsense, the rest is fine. Suspect chunking for that document — a giant unbroken chunk, or a chunk that merged two unrelated sections. Print the raw text of the top-scoring chunks and read them. Nine times out of ten the chunk boundary is the culprit, not the model.

Exact identifiers, codes, and names go missing. This is the semantic-search blind spot, not a bug. Add keyword search alongside it (below).

Search is slow at scale. A brute-force NumPy scan is linear in corpus size and will crawl past a few hundred thousand vectors. That is precisely when the approximate index inside a real vector database pays off — Qdrant’s HNSW index turns a linear scan into something sub-linear. Migrate before it hurts, not after.

Everything scores suspiciously high, around 0.7–0.9. Some models compress all cosine scores into a narrow band, so the absolute number is meaningless and only the ranking matters. Don’t set an absolute score threshold without first checking what “unrelated” actually scores on your model; calibrate against a query you know has no good answer.

The verdict

For searching a personal knowledge base — notes, documentation, a wiki, anything where you remember the gist but not the words — self-hosted semantic search is a genuine upgrade, and the whole stack runs on a modest machine with no cloud dependency and no per-query bill. Build it if your problem is “I can never find the note I know I wrote.” The same corpus, incidentally, is what you would point a retrieval-augmented pipeline at if you later want a model to answer from your notes rather than merely surface them, and it also happens to live nicely alongside a self-hosted file store like the one in my Nextcloud on Linux guide — point the indexer at the files directory and your documents become searchable by meaning.

But go in clear-eyed. Combine it with keyword search for anything involving exact identifiers, sweat the chunking harder than you think you need to, and don’t expect it to know which answer is right — only which is related. It has no sense of recency, authority, or truth. Treat it as a better way to find documents rather than a source of truth in itself, budget a weekend for the chunking and hybrid-search work that the tutorials skip, and it will serve you well for years on hardware you already own.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#embeddings #rag #self-hosting #ai

Contents

Semantic Search on Your Own Documents: Embeddings, Vector DBs, and Practical Limits

How to search by meaning, and where it quietly lets you down

The idea, mechanically

Storing and searching the vectors

Chunking, the part everyone underestimates

Where it quietly lets you down

Troubleshooting: when the results are bad

The verdict

Related Content

Talking to Your Documents: A Practical RAG Pipeline with Open-Source Tools

Running Embeddings Locally for Self-Hosted Search

Fine-Tuning vs Prompting vs RAG: Picking the Right Tool Without Wasting GPU Hours

RAG Explained: How AI Stops Making Things Up