OCR Pipelines: Tesseract, PaddleOCR, and When to Use Which

Turning pixels into text without paying a cloud API per page

Smarc Included in

05-01-2026 1854 words 9 min read

OCR Pipelines: Tesseract, PaddleOCR, and When to Use Which

Contents

I have a filing cabinet’s worth of scanned documents — receipts, appliance manuals, the occasional important letter from an institution that still believes firmly in paper — and a stubborn refusal to feed them through a cloud OCR service that charges per page and keeps a copy of everything for its own purposes. So I run optical character recognition locally, on my own hardware, and I’ve done it long enough to have opinions. The good news is that self-hosted OCR has quietly become genuinely excellent, good enough that “just use a cloud API” is no longer the obvious answer it was five years ago. The bad news is that there are two serious contenders, they’re good at different things, and the internet is full of people insisting their favourite is universally best. It isn’t, in either direction. Let me save you the afternoon I lost finding that out the hard way.

The two tools are Tesseract, the venerable Google-stewarded engine that traces back to the 1980s at Hewlett-Packard and has been rewritten more than once since, most consequentially into an LSTM neural recogniser in version 4; and PaddleOCR, a much newer deep-learning toolkit from Baidu’s PaddlePaddle ecosystem, now on its 3.x line. The one-sentence difference that explains almost everything else: Tesseract is a single-job tool — give it a clean image of text, get text back — whereas PaddleOCR is a full pipeline that first detects where the text is and only then reads it. That detection stage sounds like a minor implementation detail. It is, in practice, the entire ballgame.

Tesseract: fast, simple, and fussy about input

Tesseract is an apt install away and the command-line interface is trivial. For clean, well-aligned documents — a PDF page exported at 300 DPI, a screenshot, anything with horizontal black text on a white background — it is genuinely brilliant and fast, and reaching for anything heavier is a waste.

1
2
3
4
5
6
$ sudo apt install tesseract-ocr tesseract-ocr-eng
$ tesseract receipt.png stdout --psm 6
SAINSBURY'S
Wholemeal bread        1.10
Semi-skimmed milk      1.45
TOTAL                  2.55

That --psm flag is page segmentation mode, and it is the single difference between a happy Tesseract and a stream of gibberish. --psm 6 says “assume a single uniform block of text”, which is right for a receipt or a page. --psm 3 is fully automatic segmentation, the default, and fine for mixed pages. For a single line — a sign, a label, one field — --psm 7. Getting this one flag wrong is far and away the most common reason people declare Tesseract useless and storm off. The second most common reason is that it expects clean input and has no real text-detection stage of its own: feed it a skewed phone photo of a curved receipt under kitchen lighting and it falls apart, because it assumes you have already isolated and de-skewed the text for it. That assumption is the crux of the whole comparison.

From Python it’s a one-liner via pytesseract, and in practice you pair it with a deskew-and-threshold step in OpenCV first, because doing Tesseract’s presentation work for it is how you get good results out of imperfect scans:

1
2
3
4
5
6
import cv2, pytesseract

img = cv2.imread("receipt.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
text = pytesseract.image_to_string(thresh, config="--psm 6")

Otsu’s method there picks a binarisation threshold automatically from the image’s own histogram, which handles the “this scan is a bit grey” problem without you hand-tuning a magic number. If you need structured output rather than a text blob — bounding boxes, per-word confidence — Tesseract can give it to you via image_to_data, but you’ll do more assembly work than PaddleOCR asks for, which is the natural segue.

PaddleOCR: heavier, but it copes with the real world

PaddleOCR’s headline feature is that separate detection model. It finds text regions — rotated, curved, scattered across a messy layout, stamped diagonally across a form — and feeds each isolated region to a recognition model that reads it. That two-stage design is precisely why it handles photos, signage, and cluttered scans that reduce Tesseract to tears, and it also handles a huge range of languages, including dense non-Latin scripts, considerably better out of the box.

One currency note, because the API changed and stale tutorials will bite you: PaddleOCR 3.x renamed the primary call to predict() and restructured the result into a dictionary of parallel lists (rec_texts, rec_scores, rec_polys). The older ocr.ocr() call from the 2.x line, which returned a nested list you iterate line by line, still appears all over the web, so pin your version and read the docs for that version. Here’s the modern shape:

1
2
3
4
5
6
7
8
from paddleocr import PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang="en")
result = ocr.predict("phone_photo_of_sign.jpg")

res = result[0]
for text, score in zip(res["rec_texts"], res["rec_scores"]):
    print(f"{score:.2f}  {text}")

1
2
3
0.99  CAR PARK
0.97  ENTRANCE
0.94  PERMIT HOLDERS ONLY

You get bounding polygons and per-line confidence scores essentially for free, and that confidence number is more useful than it first looks: you can route anything below, say, 0.85 into a human review queue rather than trusting it blindly and discovering the error six months later when you actually need the document. The use_angle_cls flag enables a small classifier that fixes upside-down and rotated text — exactly the failure mode that quietly ruins a Tesseract run on a photo someone took sideways.

The cost, and there’s always a cost, is weight. PaddleOCR pulls in PaddlePaddle and a stack of models; the first run downloads a few hundred megabytes of weights, and a cold start is noticeably slow. On CPU it’s perfectly usable for overnight batch jobs but distinctly slower per page than Tesseract. With a GPU it flies — and this is genuinely one of those cases where a modest GPU changes the character of the tool rather than just speeding it up. Which is also a warning: it’s tempting to buy hardware for the heavier engine, and that’s worth pausing on. The homelab upgrade trap — spending money to solve a problem “good enough” already solved — is real, and if 90% of your documents are clean PDFs, a GPU for the other 10% is exactly the kind of purchase you talk yourself into and then underuse.

Preprocessing earns its keep on both

Whichever engine you land on, the single highest-leverage thing you can do is fix the image before OCR ever sees it, and it’s the step people skip because it feels like cheating. It isn’t — every serious pipeline does it. Deskewing a scan that’s a couple of degrees off, converting to greyscale, binarising with Otsu, and cropping to the region of interest routinely turn a 70%-accurate mess into a near-perfect read, and none of it requires a fancier model. Tesseract benefits the most, because it has no detection stage to compensate, but even PaddleOCR does measurably better on a clean, high-contrast input than a murky one.

The one preprocessing mistake to avoid is over-processing. Aggressive thresholding on faint or coloured text can erase strokes entirely — turn a thin grey character white and it’s gone, and no recogniser resurrects it. So preview the binarised image before you trust it in a batch; I’ve watched a pipeline silently drop every faint receipt for a week because the threshold was tuned for dark ink and quietly deleted the light. Adaptive thresholding (cv2.adaptiveThreshold) handles uneven lighting better than a single global cut when your inputs vary, which for a filing cabinet they always do.

Troubleshooting the pipeline

The failures cluster predictably, so here’s where to look:

Tesseract returns garbage from a decent-looking image. Wrong --psm first, poor binarisation second. Try --psm 3 and --psm 6, and run the OpenCV threshold step before deciding the engine is broken.
Tesseract mangles a specific language. You didn’t install the language data. apt install tesseract-ocr-deu (or whichever), and pass -l deu. The default English pack does not silently cope with accented or non-Latin text.
PaddleOCR crashes or hangs on first run. It’s fetching model weights. That first download needs network access and disk space; behind a restrictive firewall it’ll appear to freeze. Pre-download the models, or run it once somewhere with open egress.
PaddleOCR misses obvious text. The detection stage has a resolution floor. Very small or very low-contrast text can fall below it — upscale the image or raise the input resolution before blaming the recogniser.
Both engines disagree with each other on the same page. That’s your signal for the review queue, not a bug. Confidence-gate it.

There’s a nice conceptual parallel worth noting for anyone who likes understanding tools rather than incanting them: OCR is a pipeline of stages, each transforming the data for the next, in much the same way that what actually happens under git commit is a chain of discrete, inspectable steps rather than one atomic magic operation. Knowing which stage failed — detection, angle correction, or recognition — is what lets you fix an OCR pipeline instead of shrugging at it, exactly as knowing the object model is what lets you fix a Git repo instead of deleting it and re-cloning.

So which one?

Here’s the honest decision matrix I actually use, and “actually use” is load-bearing — this isn’t theory, it’s the routing logic running against my filing cabinet:

Clean digital PDFs and screenshots, high volume, speed matters → Tesseract. It’s lighter, faster, and the input is already exactly what it wants.
Photos, scans, skewed or rotated text, mixed layouts → PaddleOCR. The detection stage earns its keep the moment the input stops being pristine.
You need bounding boxes or confidence scores natively → PaddleOCR gives them directly; with Tesseract you’d assemble them from image_to_data.
Non-Latin or dense scripts → PaddleOCR, usually by a wide margin.

For my own cabinet I run a hybrid rather than picking a side. A quick heuristic checks whether a page is a clean PDF render or a photo — a handful of tests on resolution, aspect ratio, and pixel-value distribution — and routes accordingly. Clean pages take the fast Tesseract path; phone snaps and dodgy scans go through PaddleOCR. Best of both, and the routing logic itself is about twenty lines. That hybrid is the real recommendation hiding inside this whole post.

The verdict

Don’t pick a winner — pick per input. If your documents are clean and digital, Tesseract is faster, lighter, and entirely sufficient, and reaching for PaddleOCR there is over-engineering you’ll pay for in complexity and cold-start latency. But the instant you’re dealing with the messy physical world — photos, curved receipts, angled signage, sideways phone shots — PaddleOCR’s detection stage is the difference between a working pipeline and a frustrating one. Both run entirely on your own hardware, cost precisely nothing per page, and keep your documents off someone else’s servers, which for anything remotely private is the point that outweighs all the others. For anyone with a backlog of paper and a distaste for per-page cloud billing, the only real mistake is assuming a single tool fits every page. It doesn’t. Route.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#ocr #self-hosting #python

Contents

OCR Pipelines: Tesseract, PaddleOCR, and When to Use Which

Turning pixels into text without paying a cloud API per page

Tesseract: fast, simple, and fussy about input

PaddleOCR: heavier, but it copes with the real world

Preprocessing earns its keep on both

Troubleshooting the pipeline

So which one?

The verdict

Related Content

Paperless-ngx: A Paperless Office That Actually Works

Paperless-ngx OCR Tuning for Awkward Scans

The Home Lab Upgrade Trap: When Good Enough Should Be Good Enough

Self-Hosting Is Not Free: Accounting for Your Own Time