Paperless-ngx: A Paperless Office That Actually Works

Scan once, find anything, never lose a document again

Smarc Included in

12-02-2024 1867 words 9 min read

Paperless-ngx: A Paperless Office That Actually Works

Contents

I have owned three filing cabinets in my life. Each one followed the same arc: pristine and hopeful for a fortnight, then a graveyard of bank statements I will never read, slowly fossilising under a pile of takeaway menus. The promise of the “paperless office” was sold to me decades ago and never delivered, because the missing piece was never the scanner. It was knowing where anything went afterwards. A scanned PDF named scan_0047.pdf sitting in a folder is not filed; it’s just lost in a tidier way.

Paperless-ngx is the piece that was missing. It is the first system I have used that turns a heap of scanned PDFs into something I can actually search, and it has quietly replaced every filing cabinet, shoebox, and “important_FINAL_v2.pdf” folder I once relied on. Three years in, I have found a warranty receipt mid-argument with a shop assistant, from my phone, in under a minute. That is the whole pitch.

What it actually is

Paperless-ngx is a self-hosted document management system. You feed it documents — scans, PDFs, the odd email export — and it does three useful things in a row. It runs OCR over every page so the text inside is searchable. It indexes that text for full-text search. And it lets you organise everything with tags, correspondents (who sent it), and document types (invoice, payslip, warranty, that sort of thing).

The clever bit is that you rarely file anything by hand. Paperless watches a consume folder: drop a file in, walk away, and a minute later it appears in your library, OCR’d and tagged. Your scanner does the scanning, Paperless does the filing.

It is the spiritual successor to the original Paperless and Paperless-ng projects, both of which went quiet. The “-ngx” fork is the one that is actively maintained, and the one you want — if a guide points you at either of the older repos, close the tab.

Why self-host it at all

The obvious question is why not just use a cloud document service. The answer is that your documents are the single most sensitive pile of data you own: bank statements, tax returns, medical letters, passport scans. Handing all of that to a third party whose business model you don’t control is a bargain worth questioning. Self-hosting keeps it on hardware you own, encrypted at rest if you set it up that way, reachable only inside your own network unless you deliberately expose it. It’s the same reasoning I applied to running my own photo library and to self-hosted personal finance: the app is worth having, but the ownership is the point. The flip side, which I’ll come to, is that ownership means you now hold the backups too.

Standing it up

It runs in Docker and needs a few moving parts: the app itself, a Redis broker for its task queue, and a database. SQLite works for a small single-user setup, but Postgres is the sensible default once you’re past a few hundred documents. Here is a trimmed docker-compose.yml that gets you the lot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
services:
  broker:
    image: redis:7
    restart: unless-stopped
    volumes:
      - redisdata:/data

  db:
    image: postgres:16
    restart: unless-stopped
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: change-me

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - "8000:8000"
    volumes:
      - data:/usr/src/paperless/data
      - media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBPASS: change-me
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_URL: https://paperless.mylab.local
      PAPERLESS_CONSUMPTION_DIR: /usr/src/paperless/consume

volumes:
  redisdata:
  pgdata:
  data:
  media:

Bring it up with docker compose up -d, then create your first user from the host:

1
docker compose run --rm webserver createsuperuser

Log in on port 8000 and you have an empty, slightly intimidating library staring back. The PAPERLESS_OCR_LANGUAGE matters — set it to whatever your documents are actually in (eng, deu, fra, or several at once like eng+deu), because OCR quality is the foundation everything else is built on. Get it wrong and your German utility bills index as gibberish.

The actual workflow

Here is the loop I run every Sunday with the week’s post.

Scan to the consume folder. My document scanner dumps PDFs straight onto a network share that maps to ./consume. No app, no manual upload. If you haven’t got a fancy scanner, your phone’s scan feature and a synced folder do the same job.
Paperless consumes. Within a minute the file is OCR’d and pulled into the library. The original is preserved untouched; Paperless adds a searchable text layer alongside it.
Auto-tagging. This is where it earns its keep. You teach Paperless matching rules on tags, correspondents and types. A correspondent like “British Gas” can be set to match automatically whenever those words appear. There’s also an “auto” matching mode that learns from how you’ve filed before, so the more you correct it, the less you have to.
Search. Full-text, instant, fuzzy enough to forgive a wonky OCR character. “council tax 2023”, “boiler warranty”, “that invoice from the plumber” — all findable in seconds, by content, not by remembering which folder you buried it in.

The thing that genuinely changed my habits is that I stopped caring about folder structure entirely. Tags are not a hierarchy you have to plan; they are labels you can stack, and search covers everything else.

The features that earn their keep once you’re in

Two things turn Paperless from “searchable scan pile” into something you actually rely on. The first is custom fields and storage paths: you can attach a real date, an amount, or an invoice number to a document type and then filter on them, so “every invoice over £100 from last quarter” is a saved view rather than a memory game. The second is the REST API and mobile ingestion. There’s an official mobile app and a share target, so a photo of a receipt taken at a till lands in the consume flow before you’ve left the shop. I’ve wired a scanner, a phone, and an email address that forwards attachments straight into consumption — three input paths, one library. Paperless also keeps an audit trail and versioned edits, so correcting a mis-tagged document never loses the original filing history. None of this is essential on day one, but it’s the difference between a tool you tolerate and one you trust with the paperwork you can’t afford to lose.

Getting OCR right, because everything depends on it

OCR is the load-bearing wall of the whole system: if the extracted text is wrong, search is wrong, matching is wrong, and the library becomes a heap of scans you can’t find. A few things move the needle. Scan at 300 DPI — higher wastes space and processing time for no accuracy gain, lower starts dropping small print. Scan in colour or greyscale rather than harsh black-and-white if your documents have faint stamps or coloured highlighting, because thresholding a bad scan to pure black-and-white destroys detail OCR needs. Paperless leans on Tesseract under the hood, and Tesseract is only as good as the image you feed it; a crumpled, skewed phone photo will produce garbage that no amount of clever tagging can rescue.

Paperless also has a PAPERLESS_OCR_MODE worth knowing about. The default (skip) leaves existing text layers alone and only OCRs pages that lack one — sensible, and fast. If you’re importing a backlog of PDFs that already claim to have text but the text is rubbish (badly produced digital-native PDFs are a real thing), switch to redo or force for that batch so Paperless regenerates a clean text layer rather than trusting the bad one. Get this foundation right on day one and the rest of Paperless just works; get it wrong and you’ll spend weeks wondering why search keeps missing documents you know are in there.

Troubleshooting the common snags

Three things reliably go wrong in the first month.

Documents land but never get OCR’d. Almost always the consume folder permissions. The webserver container has to be able to read (and delete) files dropped by your scanner or sync tool; a mismatched UID between the two silently strands files. Set PAPERLESS_CONSUMER_POLLING if your share doesn’t emit inotify events — network shares frequently don’t, and the default relies on them.

OCR is agonisingly slow. OCR is CPU-hungry. A Raspberry Pi works but will crawl through a big batch; import forty documents and it’ll still be chewing at bedtime. Anything with a real processor is dramatically happier. If you’re bulk-importing a backlog, do it overnight and don’t watch.

The matcher tags everything as the wrong correspondent. Early on the matcher is confidently wrong. Switch aggressive “any word” matches to “all” or exact matching, and lean on the auto mode after you’ve hand-corrected a few dozen documents so it has something to learn from.

A document is searchable by title but not content. That’s an OCR-language mismatch — the text layer is garbage. Fix PAPERLESS_OCR_LANGUAGE and re-run the document through the “redo OCR” action.

The honest trade-offs

It is not free in effort. The first few weeks involve correcting the matcher, and bulk-importing a backlog of old documents is a tedious afternoon you will not enjoy. Because it’s self-hosted, your documents are now your responsibility: back up the media volume and a database dump, or you’ve built a single point of failure for your entire paper life. I run a nightly export to a second machine using the same discipline I described in Borg vs restic, and sleep fine.

There is also a soft lock-in worry that is, on inspection, unfounded — Paperless stores your original files on disk in plain folders. If the project vanished tomorrow, your PDFs are still right there, unencrypted and openable. The database only holds the searchable metadata; the documents themselves never become hostage to the software. Better still, the built-in document exporter dumps everything — files plus a manifest of all your tags, correspondents and types — into a portable archive you can re-import into a fresh instance or simply keep as an off-site copy. Between the plain-file storage and the exporter, the escape hatch is real and I’ve tested it, which is more than I can say for a lot of “your data is yours” cloud promises.

Is it worth it, and who’s it for?

If you deal with more than a trickle of paper — anyone running a household with bills and warranties, a freelancer drowning in invoices, a small business without a fancy DMS — yes, unreservedly. The payoff is the day you need a receipt for a warranty claim and you find it in ten seconds instead of forty minutes and a strop.

If you get three letters a year and live happily out of a Gmail account, this is overkill and you should not bother; the setup and the ongoing backup discipline will outweigh the benefit. Paperless-ngx rewards people with a genuine volume problem and a tolerance for a weekend of Docker. I am, regrettably, exactly that person, and my filing cabinets have gone to the tip. Good riddance.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#self-hosting #documents #docker #productivity #ocr