OCR-language information?

WhyO_74 · November 14, 2021, 6:42pm

Hello there!
I bought my first document scanner in 2013 (Canon P-215), and have also used a flatbed scanner with various scan-programs on my Mac. Since 2015, Devonthink has been my digital library and archive, and my Private-database alone have 15.000 documents. Additional databases for scanned magazines, work-related, emails…

Those days, received and downloaded pdf’s were not often OCR’ed at all. And because I’m Norwegian, OCR’ing in Norwegian has only been supported by the ABBYY engine, and the P-215 (no-ABBYY software) has saved my scans with an English text layer.

I have also scanned documents from my studies in Germany, aside German and Swedish magazines, so you can tell I have a multi-language chaos in my archive

So - is there a way to find the OCR-language in a pdf, and can this be used to apply a Smart Rule to re-OCR relevant files with correct language? Devonthink has done this usually perfectly for a long time when converting documents.

Or have maybe my pdfs been rescanned and corrected in the updating process from DTPO2 > DT3 Pro when the OCR engine was updated? I know Devonthink uses magic all the time

jerwin · November 14, 2021, 10:37pm

Select the files, and Select Tools|Create Meta Data Overview.

There should be a Language Field… Then re OCR the ones that aren’t Norwegian.

BLUEFROG · November 14, 2021, 10:37pm

You can’t OCR to a specific language via smart rules or scripting.
If you change the primary language in Preferences > OCR to Norwegian you should be able to re-OCR the files.

Can you ZIP and post an example Norwegian document?

WhyO_74 · November 15, 2021, 8:38am

My preset Primary language has always been Norwegian in Devonthink, with secondary languages chosen to those of my article variations (Swedish, Danish, German, English). DT does a perfect job in detecting the right one, as far as I have seen.

My problem is that many files have been imported into DT without OCRing/delete originals, and I know by random text copying, that the files are “bad Norwegian”, especially those scanned with the Canon.

But @jerwin seems to help me further down the road

WhyO_74 · November 15, 2021, 8:41am

This helps me a lot, thank you!

WhyO_74 · November 15, 2021, 8:43am

WhyO_74 · November 15, 2021, 9:24am

Some comments/fun fact:

Sorting by language did not help me much, because those Canon-pdfs are gibberish Norwegian (OCR-engine library not Norwegian, I suppose).
This was tested on the marked file (OCR > to searchable PDF), and the (renamed) new PDF with DT was both much smaller AND perfect.
I knew from opening pdfs in Preview, that the Info pane showed both Content Producer (Canon) and PDF-Producer (FineReader OCR Pro, from earlier rescans before DT).
I looked and found a Creator metadata field! And now I can easily sort by the older “OCR-versions”
Metadata info is visible only when the Group is opened, so I thought, these metadata could be used in a Smart Group (“English pdfs”). But I realize that the important metadata is Creator, and will create a Smart Group “Bad OCR” (Creator matches Canon) to rescan these files. I just have to remember to activate the “Move original to the trash” in the OCR-settings first (normally deactivated)
I have also noticed a big size difference between OCRs from older DTPO 2.x and the new ABBYY FineReader Engine 12, so I probably just have to re-OCR every pdf older than those…

Hope someone else out there can use this information. Devonthink shows new strengths and solutions again

jerwin · November 19, 2021, 5:33am

Out of curiosity, is Devonthink’s version of Finereader compiled for Apple Silicon? I recently used the OCR to read some screenshots, and it seemed faster than my Finereader PDF installation.

cgrunenberg · November 19, 2021, 8:41am

Not yet.

jerwin · November 19, 2021, 9:34am

Oh OK. I’ll chalk it up to the comparatively small size of the selection captures I was processing.

WhyO_74 · November 19, 2021, 9:52am

I have FineReader OCR Pro too, and that app is slooow. But very good if you need to adjust color, contrast, crop aso a pdf.

jerwin · November 19, 2021, 10:10am

My favorite feature of Finereader PDF (version 15) is that it will read Fraktur and prerevolutionary Cyrillic. The user interface is painfully slow on my m1.