Open WebUI Pipelines: Chaining Local Models with Tools and RAG

Where the chat UI stops and your own Python takes over

Open WebUI is the front end most people slap in front of Ollama and call it a day — a tidy ChatGPT-alike that talks to local models. That’s fine until you want the model to do something: hit your internal docs, call an API, run a query, or chain a couple of models together. The built-in features cover some of this, but the real escape hatch is Pipelines: a separate plugin server that lets you insert arbitrary Python into the request flow. It’s the difference between “chat with a model” and “wire a model into your systems.”

Advertisement

Pipelines come in two shapes and the distinction matters.

A filter sits beside an existing model and hooks the request on the way in (inlet) and the response on the way out (outlet). The model still does the talking; you’re modifying context or post-processing. Good for injecting RAG context, redacting secrets, logging, or enforcing a token budget.

A pipe is the model from Open WebUI’s point of view — it shows up in the model dropdown as if it were an Ollama model, but behind it is your code. The pipe decides everything: which real model to call, whether to call several, whether to hit a tool first. This is how you chain models or build an agent.

Pipelines is a standalone container that Open WebUI connects to as an OpenAI-compatible endpoint. Compose makes it tidy:

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports: ["3000:8080"]
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - webui:/app/backend/data

  pipelines:
    image: ghcr.io/open-webui/pipelines:main
    ports: ["9099:9099"]
    volumes:
      - pipelines:/app/pipelines

volumes:
  webui:
  pipelines:

Then in Open WebUI under Settings → Connections, add an OpenAI-API connection pointing at http://pipelines:9099 with the API key 0p3n-w3bu! (the default — change it). Any pipeline you drop into the pipelines volume now appears as a selectable model.

Here’s the shape of a pipe that retrieves context from a vector store, then answers with a local model. Open WebUI loads any file exposing a Pipeline class with a pipe method:

import requests

class Pipeline:
    def __init__(self):
        self.name = "Docs Assistant (RAG)"
        self.ollama = "http://ollama:11434"

    def retrieve(self, query: str) -> str:
        # Hit a local vector DB and return the top chunks
        r = requests.post("http://qdrant-api:8000/search",
                          json={"q": query, "k": 4}, timeout=10)
        return "\n\n".join(c["text"] for c in r.json()["chunks"])

    def pipe(self, user_message, model_id, messages, body):
        context = self.retrieve(user_message)
        prompt = (
            "Answer using ONLY the context below. "
            "If it isn't there, say you don't know.\n\n"
            f"Context:\n{context}\n\nQuestion: {user_message}"
        )
        r = requests.post(f"{self.ollama}/api/generate", json={
            "model": "llama3.1:8b",
            "prompt": prompt,
            "stream": False,
        }, timeout=120)
        return r.json()["response"]

Drop that file in, select “Docs Assistant (RAG)” from the model menu, and every question now gets grounded in your own documents before the model sees it. Swap the retrieve call for a requests.get against any internal API and you’ve got tool use; call two different models in sequence and you’ve got a chain — say, a small fast model to classify intent, then route to a larger one only when needed.

Note the strict instruction in the prompt: answer only from the context and admit ignorance otherwise. That one line does most of the work in keeping a local model honest — without it, an 8B model will cheerfully hallucinate over thin retrieval. Grounding is only as good as both the chunks you fetch and the discipline you impose on how they’re used.

Returning a string works, but for a responsive UI you want to stream. Yield strings instead of returning, and set stream=True on the Ollama call, forwarding each chunk:

def pipe(self, user_message, model_id, messages, body):
    r = requests.post(f"{self.ollama}/api/generate", json={
        "model": "llama3.1:8b", "prompt": user_message, "stream": True,
    }, stream=True, timeout=120)
    for line in r.iter_lines():
        if line:
            yield __import__("json").loads(line).get("response", "")

You can also expose tunables to the UI via a Valves class, so things like the model name or retrieval k become editable settings rather than hard-coded constants — handy when you don’t want to redeploy to change a number.

For plain chat against a local model, no — you don’t need Pipelines at all, and the extra container is just maintenance. The value appears the instant you want grounding in private data, a tool call, or model chaining that Open WebUI’s built-in RAG and function features can’t express cleanly. Those built-ins handle the common cases well, so reach for Pipelines only when you hit their edges. The cost is real Python you have to maintain, and an extra service to keep patched, but you get the full programmability of a backend with the polish of a finished chat UI in front of it. For a homelab where you want local models actually wired into your stuff rather than just answering trivia, it’s the right tool.

Advertisement

Related Content

Advertisement
Smarc
Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.