Label Studio: Self-Hosted Data Annotation for Training Your Own Models

The unglamorous tool that decides whether your model is any good

Smarc Included in

25-12-2025 1855 words 9 min read

Label Studio: Self-Hosted Data Annotation for Training Your Own Models

Contents

There’s a comforting lie in machine learning circles that the model is the hard part. It isn’t. The model is the bit with the nice papers and the GitHub stars. The hard part — the part that determines whether your classifier works or quietly humiliates you in production — is the labels. Garbage labels, garbage model, no exceptions. And labelling is tedious, error-prone, and almost always done in some horror of a spreadsheet that loses your work when the browser crashes.

Label Studio is the open-source antidote. It’s a web app for annotating data — text, images, audio, video, time series, the lot — built by people who clearly suffered through bad labelling tools first. The community edition is free, self-hostable, and good enough that I’ve never reached for the paid tier. If you’re training your own models on your own data, this is the workbench you’ve been missing.

I’ll be blunt about why self-hosting this specifically matters: your training data is often the most sensitive thing you own. It might be customer records, proprietary documents, medical images, or internal correspondence — precisely the material you cannot paste into a third-party SaaS annotation tool without a legal review and a leap of faith. Running the annotation workbench on your own hardware means the raw data never leaves your control, the labels live in a database you back up, and there’s no per-seat pricing pushing you to cut corners on how many people review each item. For anything involving data you’d be uncomfortable emailing to a stranger, self-hosted annotation isn’t a preference; it’s the only defensible option.

Standing it up

It’s a single Docker container with a Postgres database behind it. The all-in-one image ships with an embedded SQLite database, and it’s tempting to just run that — don’t, beyond a first ten-minute look. SQLite will fall over on any project with real concurrency or volume, and the whole point of an annotation tool is that several people hammer it at once. Wire up Postgres from the start. For a serious project you’ll also want that database on a volume that survives a container rebuild, because the one thing you cannot afford to lose is the labels.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
services:
  label-studio:
    image: heartexlabs/label-studio:latest
    ports:
      - "8080:8080"
    environment:
      - [email protected]
      - LABEL_STUDIO_PASSWORD=change-this-now
      - DJANGO_DB=default
      - POSTGRE_NAME=labelstudio
      - POSTGRE_USER=labelstudio
      - POSTGRE_PASSWORD=change-this-too
      - POSTGRE_HOST=db
      - POSTGRE_PORT=5432
    volumes:
      - ./ls-data:/label-studio/data
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      - POSTGRES_DB=labelstudio
      - POSTGRES_USER=labelstudio
      - POSTGRES_PASSWORD=change-this-too
    volumes:
      - ./pg-data:/var/lib/postgresql/data

Bring it up with docker compose up -d, log in at http://localhost:8080, and you’re looking at the project list. Each project gets a labelling config — a small XML dialect that defines what annotators see and what they produce. This is the clever bit: the same tool does sentiment tagging, bounding boxes, and named-entity recognition, just with a different config. There’s no separate “image tool” and “text tool”; you compose the interface you need from a library of tags, and the same export pipeline works underneath all of them.

One deployment note that will save you a support ticket to yourself: if you put Label Studio behind a reverse proxy on a subpath or a different hostname, set LABEL_STUDIO_HOST to the full external URL. Get it wrong and the login page loads but the API calls behind it point at the wrong origin, so annotations silently fail to save — the single most common “it’s broken and I can’t see why” report from anyone self-hosting this.

The labelling config

Here’s a config for named-entity recognition over text — the sort of thing you’d use to train a model to pull names and organisations out of documents:

1
2
3
4
5
6
7
8
<View>
  <Labels name="label" toName="text">
    <Label value="PERSON" background="#FFA39E"/>
    <Label value="ORG" background="#D4380D"/>
    <Label value="LOCATION" background="#FFC069"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

Annotators drag-select a span, click a label, and Label Studio records the character offsets. Import a JSONL file where each line has a text field, and the tasks populate automatically. The exported annotations come back in a clean JSON shape with the spans, labels, and offsets — exactly what a tokeniser-based training pipeline wants.

The mental model to hold is that a config always pairs an object tag (the thing being shown — Text, Image, Audio) with a control tag (the thing the annotator produces — Labels, Choices, RectangleLabels). The toName attribute wires a control to its object. Once that clicks, building a config for a new task type is mechanical: swap Text for Image, swap Labels for RectangleLabels, and you’ve turned a text-tagging project into a bounding-box project. Here’s that same NER project reshaped for image object detection:

1
2
3
4
5
6
7
<View>
  <Image name="image" value="$image"/>
  <RectangleLabels name="label" toName="image">
    <Label value="Cat" background="#FFA39E"/>
    <Label value="Dog" background="#D4380D"/>
  </RectangleLabels>
</View>

Nothing else about the workflow changes — same import, same export, same review loop. That reuse is the quiet reason to standardise on one tool rather than a different bespoke labelling script per project.

Pre-labelling, which is the actual point

Manual labelling from scratch is soul-destroying at scale. The feature that makes Label Studio worth the setup is the ML backend: you connect a model that pre-fills predictions, and your annotators correct rather than create. Correcting a label takes a fraction of the time of producing one, and the quality goes up because people are reviewing rather than grinding. In practice I’ve seen this cut annotation time by more than half on a mature project, and the savings compound as each round of corrections trains a sharper pre-labeller for the next.

The mechanism is a small HTTP service you write and run alongside Label Studio. You subclass LabelStudioMLBase, implement a predict method that takes tasks and returns predictions in Label Studio’s result format, and register the service’s URL against your project. From then on, Label Studio calls your backend as new tasks arrive and shows whatever it returns as pre-filled, editable annotations. Because it’s just an HTTP endpoint, your model can be anything — a spaCy pipeline, a fine-tuned transformer, a rules engine, even a call out to a hosted LLM — as long as it emits the right JSON shape.

You point the project at that SDK-based backend, and Label Studio sends each task to it and displays the returned predictions as draft annotations:

1
2
3
4
5
6
7
8
9
from label_studio_ml.model import LabelStudioMLBase

class NERBackend(LabelStudioMLBase):
    def predict(self, tasks, **kwargs):
        results = []
        for task in tasks:
            spans = run_my_model(task["data"]["text"])
            results.append({"result": spans, "score": 0.9})
        return results

Even a mediocre first model is useful here — it gets the obvious cases right, your humans fix the rest, and the corrected data trains a better model that pre-labels the next batch. That feedback loop is the whole game in applied ML, and Label Studio makes it concrete.

There’s a subtlety worth flagging so you don’t sabotage yourself: watch out for the model’s confident mistakes. A pre-label that’s wrong but plausible is more dangerous than a blank field, because a tired annotator will rubber-stamp it. Two habits guard against this. First, surface the prediction score and sort the review queue by low confidence, so humans spend their attention where the model is unsure. Second, sample a slice of the high-confidence predictions for manual audit anyway — if the model is quietly wrong in a consistent way, that’s exactly the bias you want to catch before it launders itself into your training set as “human-approved” ground truth.

For large media projects — thousands of images or audio clips — do not import the files directly. Point Label Studio at cloud or S3-compatible object storage and let it stream references, so the database holds pointers and metadata rather than blobs. On a homelab that means standing up something like MinIO next to it; on the same box you’re probably already running for other self-hosted services, that’s a small addition rather than a new burden.

The friction, and how to get past it

It’s not all tidy, and knowing the sharp edges up front saves the afternoons I lost to them.

The labelling config XML is poorly documented, and you’ll spend real time discovering which control pairs with which object tag. The fix is to lean on the built-in template gallery: start from the closest pre-made config and modify it, rather than writing one from a blank page. The templates encode the correct control/object pairings you’d otherwise reverse-engineer.

Multi-annotator workflows — where several people label the same item so you can measure agreement — exist, but the consensus and inter-annotator-agreement tooling is thin in the community edition, and a lot of the genuinely nice review and analytics features sit behind the paid tier. If measuring agreement is core to your project, budget for either the paid edition or a bit of your own scripting against the export JSON, which does contain everything you need to compute agreement yourself.

Performance on very large image projects can crawl if you don’t put the media behind object storage, as covered above — this is the single biggest self-inflicted wound. And the permissions model is basic: projects and organisations exist, but fine-grained access control does not, so if you’re running it for an external team, put it behind your own authentication layer and mind what each account can see. Treat the built-in login as “who can annotate”, not “a security boundary”.

One more operational trap: back up the Postgres database, not just the ls-data volume. The volume holds uploads and config; the database holds the annotations themselves — the entire point of the exercise. A backup that captures the files but not the database is a backup of everything except the thing you can’t recreate.

The verdict

If you are doing any serious supervised learning on your own data, you need an annotation tool, and a spreadsheet is not it. Label Studio is the best self-hostable option I’ve found — flexible across data types, free for the core workflow, and built around the pre-label-then-correct loop that actually scales. It’s overkill if you’re labelling fifty examples for a toy project; just use a script. But the moment you’re labelling thousands of items, or doing it with more than one person, the setup pays for itself within a day.

Who’s it for? Anyone who has crossed from “playing with a pretrained model” into “training or fine-tuning on my own data” — the point where label quality, not model architecture, becomes the thing that decides success. It slots naturally into the same self-hosted stack you’d already be running: it’s the same instinct that makes you keep your photos on your own server with Immich rather than a cloud you don’t control, and the same “own the tool, own the data” logic behind uptime monitoring you host yourself. Your training data is more sensitive than either; keeping the annotation workbench in-house is the obviously right call. It’s the unglamorous infrastructure that quietly decides whether your model is good, and it deserves more love than it gets.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#machine-learning #self-hosting

Contents

Label Studio: Self-Hosted Data Annotation for Training Your Own Models

The unglamorous tool that decides whether your model is any good

Standing it up

The labelling config

Pre-labelling, which is the actual point

The friction, and how to get past it

The verdict

Related Content

Local LLMs: A Practical Comparison of Llama, Mistral, and Gemma for Real Work

LoRA Fine-Tuning on Consumer Hardware: Adding Skills to a Model Without Retraining It

ComfyUI: Node-Based Image Generation for People Who Want Control

Whisper: Self-Hosted Speech-to-Text That Runs on a Raspberry Pi