Open WebUI Pipelines: Chaining Local Models with Tools and RAG

Where the chat UI stops and your own Python takes over

Smarc Included in

05-06-2025 1912 words 9 min read

Open WebUI Pipelines: Chaining Local Models with Tools and RAG

Contents

Open WebUI is the front end most people slap in front of Ollama and call it a day — a tidy ChatGPT-alike that talks to local models. That’s fine right up until you want the model to do something: hit your internal docs, call an API, run a query, or chain a couple of models together. I hit that wall the week I wanted my local assistant to answer from a folder of PDFs instead of making things up. The built-in RAG and function features cover some of this, but the real escape hatch is Pipelines: a mechanism that lets you insert arbitrary Python into the request flow. It’s the difference between “chat with a model” and “wire a model into your systems.”

This post is about why you’d reach for Pipelines, the two shapes they come in, and the two or three things that will trip you up — because they tripped me up first.

Why the built-ins run out of road

Open WebUI ships with document upload and a knowledge feature that does perfectly respectable retrieval-augmented generation for the common case. If all you need is “let me chat with these five PDFs”, use it — don’t add moving parts you don’t need. The built-ins fall short the moment your logic stops fitting a form field: you want to query a specific vector store you already run, call an internal API mid-conversation, route cheap questions to a small model and expensive ones to a big model, or redact secrets before they ever reach the LLM. None of that is expressible in a settings panel. It’s code, and Pipelines is where the code goes. If you’re still deciding whether RAG is even the right approach versus fine-tuning or plain prompting, my breakdown of picking the right tool is the place to start before you write a line of this.

Two flavours: filters and pipes

Pipelines come in two shapes and the distinction matters more than the docs let on.

A filter sits beside an existing model and hooks the request on the way in (inlet) and the response on the way out (outlet). The model still does the talking; you’re modifying context or post-processing around it. Filters are the right tool for injecting RAG context, redacting secrets, logging conversations to a monitoring system, enforcing a token budget, or blocking toxic input before it costs you a generation. The user still picks a normal model in the dropdown; your filter runs invisibly on top.

A pipe is the model, from Open WebUI’s point of view. It shows up in the model dropdown as if it were an Ollama model, but behind it is your code. The pipe decides everything: which real model to call, whether to call several, whether to hit a tool first. This is how you chain models or build an agent. When you want a named “Docs Assistant” in the dropdown that does its own retrieval, that’s a pipe.

A rule of thumb: if you’re decorating an existing model, write a filter. If you’re replacing it with your own orchestration, write a pipe.

The two ways to run them

There are two deployment models, and it’s worth knowing which you’re using because the older tutorials assume the first and the newer ones assume the second.

The original design is a standalone Pipelines container that Open WebUI connects to as an OpenAI-compatible endpoint. You run a separate service, drop Python files into its volume, and register it as a connection. More recently, both pipes and filters have in-process equivalents built directly into Open WebUI (“Functions”), which support the same Valves configuration and need no separate worker container. For a small homelab I now prefer the in-process route — one fewer container to patch. But the standalone server is still the right call when your pipeline is heavy (large dependencies, its own GPU work) and you’d rather it not share the web UI’s process. I’ll show the standalone setup here because it generalises to both.

Running the pipelines server

The standalone Pipelines server is a container Open WebUI talks to over the OpenAI API. Compose keeps it tidy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports: ["3000:8080"]
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui:/app/backend/data

  pipelines:
    image: ghcr.io/open-webui/pipelines:main
    ports: ["9099:9099"]
    volumes:
      - pipelines:/app/pipelines

volumes:
  webui:
  pipelines:

Then in Open WebUI under Settings → Connections, add an OpenAI-API connection pointing at http://pipelines:9099 with the default API key 0p3n-w3bu! — and change that key, because it is public and identical on every install. Any pipeline you drop into the pipelines volume now appears as a selectable model.

A pipe that does RAG and tool use

Here’s the shape of a pipe that retrieves context from a vector store, then answers with a local model. Open WebUI loads any file exposing a Pipeline class with a pipe method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests

class Pipeline:
    def __init__(self):
        self.name = "Docs Assistant (RAG)"
        self.ollama = "http://ollama:11434"

    def retrieve(self, query: str) -> str:
        # Hit a local vector DB and return the top chunks
        r = requests.post("http://qdrant-api:8000/search",
                          json={"q": query, "k": 4}, timeout=10)
        return "\n\n".join(c["text"] for c in r.json()["chunks"])

    def pipe(self, user_message, model_id, messages, body):
        context = self.retrieve(user_message)
        prompt = (
            "Answer using ONLY the context below. "
            "If it isn't there, say you don't know.\n\n"
            f"Context:\n{context}\n\nQuestion: {user_message}"
        )
        r = requests.post(f"{self.ollama}/api/generate", json={
            "model": "llama3.1:8b",
            "prompt": prompt,
            "stream": False,
        }, timeout=120)
        return r.json()["response"]

Drop that file in, select “Docs Assistant (RAG)” from the model menu, and every question now gets grounded in your own documents before the model sees it. Swap the retrieve call for a requests.get against any internal API and you’ve got tool use; call two different models in sequence and you’ve got a chain — a small fast model to classify intent, then route to a larger one only when needed.

Note the strict instruction in the prompt: answer only from the context and admit ignorance otherwise. That one line does most of the work in keeping a local model honest — without it, an 8B model will cheerfully hallucinate over thin retrieval. Grounding is only as good as both the chunks you fetch and the discipline you impose on how they’re used. If your retrieval quality is the weak link, the plumbing that feeds it matters as much as the pipe; my walkthrough of a practical RAG pipeline covers chunking, embeddings and the vector store this code assumes already exists.

A filter, for the other half of the job

Not everything wants to be a pipe. When you want the user to keep picking a normal model but you need to decorate the conversation — strip secrets, cap tokens, log to your monitoring stack — a filter is the lighter tool. A filter exposes inlet (runs on the incoming message before the model sees it) and outlet (runs on the model’s reply before the user sees it):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import re

class Pipeline:
    def __init__(self):
        self.type = "filter"
        self.name = "Redactor"

    async def inlet(self, body: dict, user: dict) -> dict:
        # Scrub anything that looks like an API key before it reaches the model
        for m in body.get("messages", []):
            m["content"] = re.sub(r"sk-[A-Za-z0-9]{20,}", "[REDACTED]", m["content"])
        return body

    async def outlet(self, body: dict, user: dict) -> dict:
        return body

Attach that filter to a model and every message flowing through it gets scrubbed, invisibly, without the user choosing a different model. That separation — pipes replace, filters decorate — is the whole mental model, and getting it straight up front saves you rewriting a pipe you should have made a filter.

Streaming and parameters

Returning a string works, but for a responsive UI you want to stream. Yield strings instead of returning, and set stream=True on the Ollama call, forwarding each chunk:

1
2
3
4
5
6
7
def pipe(self, user_message, model_id, messages, body):
    r = requests.post(f"{self.ollama}/api/generate", json={
        "model": "llama3.1:8b", "prompt": user_message, "stream": True,
    }, stream=True, timeout=120)
    for line in r.iter_lines():
        if line:
            yield __import__("json").loads(line).get("response", "")

You can also expose tunables to the UI via a Valves class, so things like the model name or retrieval k become editable settings rather than hard-coded constants. Valves are runtime configuration — you can change them from the UI without redeploying the pipeline — which is exactly what you want when you’re tuning k against a wall of test questions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from pydantic import BaseModel

class Pipeline:
    class Valves(BaseModel):
        MODEL: str = "llama3.1:8b"
        TOP_K: int = 4

    def __init__(self):
        self.name = "Docs Assistant (RAG)"
        self.valves = self.Valves()

Now MODEL and TOP_K show up as editable fields in the pipeline’s settings, no redeploy required.

Where it goes wrong

A pipeline that “does nothing” is almost always one of these.

The pipeline doesn’t appear in the dropdown. The connection is misconfigured — wrong URL, or the API key wasn’t changed on both sides. Check that Open WebUI can actually reach http://pipelines:9099 from inside its container (docker compose exec open-webui curl -s http://pipelines:9099), not from your laptop.

A syntax error in one file kills the model silently. The Pipelines server loads every file in the volume; one broken import and your pipe just won’t register, with the reason buried in the container logs. docker compose logs -f pipelines is your friend — tail it while you save the file.

Timeouts on the first request. A cold Ollama model can take longer to load than your timeout allows, so the very first call fails and every one after works. Set a generous timeout (120s) on the generate call, or warm the model with a throwaway request on startup.

Retrieval returns nothing, so the model refuses to answer. With the strict “only from context” prompt, empty retrieval produces “I don’t know” — which looks like the pipe is broken when it’s actually the vector store that’s empty or the query embedding that’s wrong. Log the retrieved chunks while developing.

Streaming shows raw JSON. You forgot to parse each line and yielded the whole payload. The iter_lines loop above extracts just the response field per chunk.

Is it worth it, and who’s it for?

For plain chat against a local model, no — you don’t need Pipelines at all, and the extra container is just maintenance. Reach for the built-in knowledge feature first and only move on when it genuinely can’t express what you need. This is the same “don’t over-build” instinct behind giving models hands and eyes with MCP servers: pick the lightest mechanism that solves the actual problem.

The value appears the instant you want grounding in private data, a real tool call, or model chaining the built-ins can’t express cleanly. The cost is real Python you have to maintain and, in the standalone setup, an extra service to keep patched — but you get the full programmability of a backend with the polish of a finished chat UI in front of it. For a homelab where you want local models actually wired into your stuff rather than answering trivia, it’s the right tool. For a single user who just wants to chat with a model, it’s overkill you’ll regret patching. Know which one you are before you write the first class Pipeline.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#llm #rag #selfhosted