Wild OCR Experience

deepheat · October 29, 2025, 6:27pm

DEVONthink 4.1 Mac mini M1 running Tahoe 26.0.1

Having added around 600 word to the OCR custom dictionary, I reran the DT OCR on 34 of PDF documents (selected all 34 in the group and then right-clicked for OCR and let it run)

Once OCR processing had completed for the whole group I reinspected the PDFs and was very pleased with the updated results from the OCR Custom Dictionary. However, one PDF in particular had an absolute corker of an OCR experience.

This is a section of a page from the PDF:

It’s not the best for OCR, but I would have thought that it wouldn’t have caused too much of an issue for DT4/ABBYY.

This is what t he text layer contained after the OCR process:
RGO a-y gyp!vmo ecfdyvBqh qw [m!’hmcBfT 2nh’yqc JmdBm!c MO nym-vT
mmryf B-y wqooqp!c’ qffyhamB!qcf hy’mhp!c’ B-y f!vrcyff !c B-y pydpBf pnh!c’
B-y vymh ncpyh hydqh

(The remainder of the text layer from this particular PDF is very similar)

When I reran the OCR process on just that one file making no changes to the file or any settings and not even restarting DT, I eventually received the following text layer:
The Medical Inspector of Emigrants, Surgeon-Captain A. Leahy,
makes the following observations regarding the sickness in the depdts during
the year under report

This PDF is not much different from the others in terms of size, page count or character/word count.

I’ve checked the remaining documents and they are fine.

What happened during the first OCR run on this document that was apparently fixed by the second OCR run?

Cheers!

dp

cgrunenberg · October 29, 2025, 6:28pm

Did you edit, e.g. annotate, the document after the first OCR run?

deepheat · October 29, 2025, 7:17pm

I had ran 3-4 searches for known strings against the document (nothing was found in each case). Other than that, nothing at all!

BLUEFROG · October 29, 2025, 9:23pm

Is the original without OCR in your database’s Trash?
- If so, drag it to your desktop and compress it. Then start a support ticket and send the ZIP for us to inspect. Also, if you zipped your custom OCR dictionary file and sent it along, that may be useful too.