I bought my first document scanner in 2013 (Canon P-215), and have also used a flatbed scanner with various scan-programs on my Mac. Since 2015, Devonthink has been my digital library and archive, and my Private-database alone have 15.000 documents. Additional databases for scanned magazines, work-related, emails…
Those days, received and downloaded pdf’s were not often OCR’ed at all. And because I’m Norwegian, OCR’ing in Norwegian has only been supported by the ABBYY engine, and the P-215 (no-ABBYY software) has saved my scans with an English text layer.
I have also scanned documents from my studies in Germany, aside German and Swedish magazines, so you can tell I have a multi-language chaos in my archive
So - is there a way to find the OCR-language in a pdf, and can this be used to apply a Smart Rule to re-OCR relevant files with correct language? Devonthink has done this usually perfectly for a long time when converting documents.
Or have maybe my pdfs been rescanned and corrected in the updating process from DTPO2 > DT3 Pro when the OCR engine was updated? I know Devonthink uses magic all the time
Select the files, and Select Tools|Create Meta Data Overview.
There should be a Language Field… Then re OCR the ones that aren’t Norwegian.
You can’t OCR to a specific language via smart rules or scripting.
If you change the primary language in Preferences > OCR to Norwegian you should be able to re-OCR the files.
Can you ZIP and post an example Norwegian document?
My preset Primary language has always been Norwegian in Devonthink, with secondary languages chosen to those of my article variations (Swedish, Danish, German, English). DT does a perfect job in detecting the right one, as far as I have seen.
My problem is that many files have been imported into DT without OCRing/delete originals, and I know by random text copying, that the files are “bad Norwegian”, especially those scanned with the Canon.
But @jerwin seems to help me further down the road
This helps me a lot, thank you!
Some comments/fun fact:
Sorting by language did not help me much, because those Canon-pdfs are gibberish Norwegian (OCR-engine library not Norwegian, I suppose).
This was tested on the marked file (OCR > to searchable PDF), and the (renamed) new PDF with DT was both much smaller AND perfect.
I knew from opening pdfs in Preview, that the Info pane showed both Content Producer (Canon) and PDF-Producer (FineReader OCR Pro, from earlier rescans before DT).
I looked and found a Creator metadata field! And now I can easily sort by the older “OCR-versions”
Metadata info is visible only when the Group is opened, so I thought, these metadata could be used in a Smart Group (“English pdfs”). But I realize that the important metadata is Creator, and will create a Smart Group “Bad OCR” (Creator matches Canon) to rescan these files. I just have to remember to activate the “Move original to the trash” in the OCR-settings first (normally deactivated)
I have also noticed a big size difference between OCRs from older DTPO 2.x and the new ABBYY FineReader Engine 12, so I probably just have to re-OCR every pdf older than those…
Hope someone else out there can use this information. Devonthink shows new strengths and solutions again
Out of curiosity, is Devonthink’s version of Finereader compiled for Apple Silicon? I recently used the OCR to read some screenshots, and it seemed faster than my Finereader PDF installation.
Oh OK. I’ll chalk it up to the comparatively small size of the selection captures I was processing.
I have FineReader OCR Pro too, and that app is slooow. But very good if you need to adjust color, contrast, crop aso a pdf.
My favorite feature of Finereader PDF (version 15) is that it will read Fraktur and prerevolutionary Cyrillic. The user interface is painfully slow on my m1.