OCR Pipelines: Tesseract, PaddleOCR, and When to Use Which
Turning pixels into text without paying a cloud API per page

I have a filing cabinet’s worth of scanned documents — receipts, manuals, the occasional important letter from an institution that still believes in paper — and a stubborn refusal to feed them through a cloud OCR service that charges per page and keeps a copy. So I run optical character recognition locally. The good news is that self-hosted OCR has quietly become excellent. The bad news is there are two serious contenders, they’re good at different things, and the internet is full of people insisting their favourite is universally best. It isn’t. Let me save you the afternoon I lost finding out.
The two tools are Tesseract, the venerable Google-originated engine that’s been around since the 1980s and rewritten more than once, and PaddleOCR, a newer deep-learning toolkit from Baidu’s PaddlePaddle ecosystem. Tesseract is a single-job tool: give it a clean image of text, get text back. PaddleOCR is a full pipeline — it detects where the text is before reading it, which turns out to matter enormously.
1 Tesseract: fast, simple, and fussy about input
Tesseract is a apt install away and the API is trivial. For clean, well-aligned documents — a PDF page exported at 300 DPI, a screenshot, anything with horizontal black text on white — it’s brilliant and fast.
$ sudo apt install tesseract-ocr tesseract-ocr-eng
$ tesseract receipt.png stdout --psm 6
SAINSBURY'S
Wholemeal bread 1.10
Semi-skimmed milk 1.45
TOTAL 2.55
That --psm flag is page segmentation mode, and it’s the difference between a happy Tesseract and gibberish. --psm 6 says “assume a single uniform block of text.” --psm 3 is fully automatic. For a single line — a sign, a label — --psm 7. Getting this wrong is the single most common reason people declare Tesseract useless. The other reason is that it expects clean input. Feed it a skewed phone photo of a curved receipt under bad lighting and it falls apart, because it has no real text-detection stage — it assumes you’ve already isolated the text.
From Python it’s a one-liner via pytesseract, and you’ll typically pair it with a deskew and threshold step in OpenCV first:
import cv2, pytesseract
img = cv2.imread("receipt.png")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
text = pytesseract.image_to_string(thresh, config="--psm 6")2 PaddleOCR: heavier, but it copes with the real world
PaddleOCR’s headline feature is a separate detection model that finds text regions — rotated, curved, scattered across a messy layout — and feeds each region to a recognition model. That two-stage design is why it handles photos, signage, and cluttered scans that make Tesseract sob. It also handles a huge range of languages, including dense scripts, considerably better.
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang="en")
result = ocr.ocr("phone_photo_of_sign.jpg")
for line in result[0]:
box, (text, confidence) = line
print(f"{confidence:.2f} {text}")0.99 CAR PARK
0.97 ENTRANCE
0.94 PERMIT HOLDERS ONLY
You get bounding boxes and per-line confidence scores for free, which is genuinely useful — you can drop anything below, say, 0.85 confidence into a review queue rather than trusting it blindly. The use_angle_cls flag enables a small classifier that fixes upside-down or rotated text, which is exactly the kind of thing that ruins a Tesseract run.
The cost is weight. PaddleOCR pulls in PaddlePaddle and a stack of models, the first run downloads a few hundred megabytes, and a cold start is slow. On CPU it’s perfectly usable for batch jobs but noticeably slower per page than Tesseract; with a GPU it flies.
3 So which one?
Here’s the honest decision matrix I actually use:
- Clean digital PDFs and screenshots, high volume, speed matters → Tesseract. It’s lighter, faster, and the input is already what it wants.
- Photos, scans, skewed or rotated text, mixed layouts → PaddleOCR. The detection stage earns its keep the moment input gets messy.
- You need bounding boxes or confidence scores → PaddleOCR gives them natively; with Tesseract you’d use
image_to_dataand do more work. - Non-Latin or dense scripts → PaddleOCR, generally by a wide margin.
For my filing cabinet I run a hybrid: a quick heuristic checks whether a page is a clean PDF render or a photo, and routes accordingly. Clean pages take the fast Tesseract path; phone snaps and dodgy scans go through PaddleOCR. Best of both, and the routing logic is twenty lines.
4 The verdict
Don’t pick a winner — pick per input. If your documents are clean and digital, Tesseract is faster, lighter, and entirely sufficient; reaching for PaddleOCR there is over-engineering. But the instant you’re dealing with the messy physical world — photos, curved receipts, signage, rotation — PaddleOCR’s detection stage is the difference between a working pipeline and a frustrating one. Both run entirely on your own hardware, cost nothing per page, and keep your documents off someone else’s servers. For anyone with a backlog of paper and a distaste for per-page cloud billing, the only real mistake is assuming one tool fits every page.



