Self-Hosted AI Search: Replacing Google with Perplexica and a Local Model

An answer engine that runs on your own box and doesn't sell your queries

Smarc Included in

19-03-2025 1817 words 9 min read

Self-Hosted AI Search: Replacing Google with Perplexica and a Local Model

Contents

Searching the web has become a chore. You type a question, scroll past a screen of ads, then past four articles that are themselves just AI-generated SEO sludge, and somewhere on the second page you find the actual answer — if it’s there at all. The cloud “answer engines” fix the experience but trade away your privacy: every query goes to someone else’s server to be logged, profiled, and monetised, which for the kind of questions I actually search (health, finances, half-formed project ideas) is not a trade I want to make. I wanted the good bit — a model that reads the web and answers the question with sources — without the surveillance. That’s where Perplexica comes in.

Perplexica is an open-source project, an answer engine you host yourself. Point it at a local language model and a self-hosted search backend, and you get a private little research assistant that reads live web results and writes you a cited answer. I’ve been running it for a few weeks, and it’s earned its place in my browser’s keyword bar. This is squarely in the same territory as the rest of my self-hosting write-ups — if you’ve already stood up something like self-hosted monitoring with Uptime Kuma or Immich for your photos, the Docker Compose pattern here will feel familiar.

What it actually is

Perplexica is not a search engine of its own and not a chatbot. It’s the glue between three things: a metasearch engine that fetches real web results, a language model that reads and synthesises them, and an embedding model that helps rank which results are actually relevant to your question.

The flow is essentially retrieval-augmented generation pointed at the live web rather than at a fixed document store. You ask a question; it queries the search backend; it pulls the top results; it ranks them with embeddings; it feeds the relevant chunks to the language model; and the model writes an answer with citations back to the sources. Those citations are the difference between a useful tool and a confident liar — you can click through and check, which you absolutely should. A local model with 8 billion parameters does not “know” as much as a frontier cloud model, and left to its own devices it will hallucinate cheerfully. The retrieval step is what keeps it honest: it isn’t recalling facts from training, it’s summarising text you can see it was handed.

This is also why the architecture matters more than the raw intelligence of the model. Perplexica offers a couple of modes — a fast one that leans on the search results directly, and a deeper one that does more retrieval and reasoning per query. The deep mode is slower but noticeably better on the messy questions, and it’s where a self-hosted setup earns its keep, because you’re paying in your own GPU seconds rather than someone else’s per-token bill.

The pieces you need

The search half is provided by SearXNG, a self-hosted metasearch engine that queries dozens of upstream sources — Google, Bing, DuckDuckGo, Wikipedia, and many more — aggregates the results, and returns them without tracking you or fingerprinting your browser. It’s a genuinely useful thing to run on its own even without the AI layer on top; plenty of people use SearXNG as their day-to-day search box. Perplexica leans on its JSON API to pull structured results it can rank and feed to the model.

The model half can point at a cloud provider like OpenAI if you want, but the whole appeal here is keeping it local, so I point Perplexica at Ollama running on the same network. Ollama is the path of least resistance for local models — one ollama pull and the model is ready, with an OpenAI-compatible API that Perplexica speaks natively. One model does the chat synthesis; a small embedding model does the relevance ranking. Keeping both local means the entire pipeline, from the search query to the final cited paragraph, happens on hardware you own.

The cleanest way to stand it all up is Docker Compose, with Perplexica and SearXNG side by side and Ollama reachable on the host. The most robust arrangement is to run Ollama in the same Compose stack so everything talks over the Docker network by service name — no host-gateway trickery required:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
services:
  searxng:
    image: searxng/searxng:latest
    ports:
      - "4000:8080"
    volumes:
      - ./searxng:/etc/searxng

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama

  perplexica:
    image: itzcrazykns1337/perplexica:main
    ports:
      - "3000:3000"
    environment:
      - SEARXNG_API_URL=http://searxng:8080
    depends_on:
      - searxng
      - ollama

With Ollama in the stack, Perplexica reaches it at http://ollama:11434 — the service name resolves over Docker’s internal network, and nothing needs to know the host’s address. In Perplexica’s own config you set that Ollama API base and choose your chat and embedding models from the dropdown. (If you instead keep Ollama on the bare host — say because it needs direct GPU access without the container runtime in the way — point Perplexica at the host’s LAN IP, e.g. http://192.168.1.20:11434, rather than fishing for a special gateway hostname.)

Choosing the model

This is where expectations need managing. The quality of the answer is bounded by the model doing the synthesis, and a small local model is not going to match a frontier cloud model at complex reasoning. But — and this is the pleasant surprise — the task here is easier than open-ended chat. The model isn’t being asked to know things; it’s being handed relevant text and asked to summarise and cite it. That’s a job a modest local model does perfectly well.

I run a 7-to-8-billion-parameter instruction model for synthesis and a dedicated small embedding model for ranking. On a machine with a mid-range GPU, answers come back in a handful of seconds. The embedding model is doing quiet but important work: get the ranking wrong and the synthesiser is summarising irrelevant pages, so don’t skip it in favour of letting the chat model do everything. A dedicated embedding model like nomic-embed-text is a few hundred megabytes and pays for itself immediately in answer relevance.

If you’re VRAM-constrained, this is a friendlier workload than open-ended chat because the context is short and bounded — you’re feeding the model a handful of retrieved snippets, not asking it to hold a long conversation. A quantised 7B model at Q4 will run comfortably on an 8GB card, which is the same budget-GPU territory I cover in running Stable Diffusion below 8GB of VRAM; the constraints and the mindset are identical.

Troubleshooting the setup

Almost every problem I’ve hit lands in one of a few buckets.

Empty or terrible answers with no citations. Nine times out of ten SearXNG isn’t returning results. Its JSON API is disabled by default in some builds — check that search.formats in SearXNG’s settings.yml includes json, then restart. Test it directly with curl 'http://192.168.1.20:4000/search?q=test&format=json' before blaming Perplexica.
Perplexica can’t reach the model. If you kept Ollama on the host rather than in the stack, a container cannot reach localhost — that’s the container’s own loopback, not the host’s. Use the host’s LAN IP. Confirm the model is actually pulled with ollama list; Perplexica won’t pull it for you.
Painfully slow first response. The model is loading into VRAM on the first request. It’ll be cached afterwards. If it’s slow every time, the model is being evicted between requests because something else wants the GPU — give Ollama a longer keep-alive or stop the competing workload.
Answers ignore recent events. That’s expected and correct: Perplexica answers from live search results, so its “knowledge” is only as fresh as what SearXNG surfaces. If results are stale, the problem is upstream in the search engines SearXNG is querying, not the model.

The honest limitations

It is slower than Google, and there is simply no getting around that. Fetching results, embedding them, ranking them and then generating whole paragraphs of prose takes meaningfully longer than serving you a cached search page ever will. For a quick “what’s the capital of Peru” lookup, it’s overkill and you’ll feel the latency.

Where it shines is the messy, multi-source question — “compare these three approaches,” “what changed between these two versions,” “summarise the current thinking on X” — the sort of query where you’d otherwise open eight tabs, skim each one, and hold the synthesis in your head. There, the cited synthesis genuinely saves time, and the fact that every claim is footnoted means you can spot-check the two or three that actually matter instead of reading everything.

And it’s only as good as the underlying search; if SearXNG’s sources are thin on a topic, the answer is thin too. Garbage in, confident garbage out. SearXNG can also get rate-limited or temporarily blocked by upstream engines if you hammer it, which shows up as suddenly empty results — spreading queries across more engines in its config helps. The local model is another ceiling: an 8B model summarises well but reasons poorly, so for a question that needs the model to connect facts rather than just relay them, you’ll feel the gap against a frontier model. Knowing which questions fall on which side of that line is most of the skill in using it well.

There’s also the honest matter of upkeep. This is three moving parts — Perplexica, SearXNG, and Ollama — each with its own update cadence, and Perplexica in particular is a young project that changes quickly. That’s real maintenance time, and it’s worth being clear-eyed that self-hosting a stack like this is a hobby cost as much as a privacy win, not a free lunch.

Is it worth it?

If you value your search privacy and already self-host, this is one of the more rewarding things you can stand up. The whole stack runs on hardware you own, no query ever leaves your network, and the cited-answer format is genuinely better than wading through SEO sludge for the right kind of question. It won’t replace a quick reflexive search — keep a plain SearXNG bookmark for those — but as a private research assistant for the questions that actually need thinking about, it’s excellent.

For the privacy-indifferent who just want the fastest, smartest possible answer and don’t mind who reads their queries, the cloud engines are faster and smarter and you should just use them — there’s no shame in it. For everyone who’d rather their curiosity wasn’t a data point in someone else’s ad model, and who already has the GPU and the tolerance for a bit of upkeep, Perplexica plus a local model is a quietly brilliant bit of kit. It won’t be the last self-hosted service you add, but it’s one of the few that changes a daily habit rather than just replacing a subscription line item.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#AI #self-hosting #search #privacy

Contents

Self-Hosted AI Search: Replacing Google with Perplexica and a Local Model

An answer engine that runs on your own box and doesn't sell your queries

What it actually is

The pieces you need

Choosing the model

Troubleshooting the setup

The honest limitations

Is it worth it?

Related Content

LibreTranslate: Private, Offline Translation at Home

Running Embeddings Locally for Self-Hosted Search

Stable Diffusion on Consumer Hardware: The Home Image Lab

Paperless + a Local LLM: Auto-Tagging Scans Offline