I constantly listen and read along to PDF files in a useful little app called Voice Dream Reader, and about a year ago (August 26) I started noticing that it would skip all over the place when I imported any files I had run OCR on in DEVONthink. This was the first time since updating to Devonthink 3.9.2 which had been released the month before. The release notes stated that:
the OCR engine has been updated to improve its reliability on Apple Silicon Macs, and when archiving email, DEVONthink manages its resources even better. These modifications, along with several bug fixes, continue to improve performance and reliability.
While DEVONthink seems to do well recognizing its own OCR within DEVONthink, many other apps no longer do. Prior to the 3.9.2 update, I had never noticed an OCR layout issue, but since then, layout detection has been severely broken. I wasnât 100% certain it was DEVONthink until today.
To test if this was unique to DEVONthink, I decided to take the most recent DEVONthink manual, export a page at 600 dpi as a tiff file, and then convert it to pdf and OCR it using
- DEVONthink
- ABBYY Finereader
- Adobe Acrobat Pro
I then opened each of these in Acrobat Reader Pro, PDF Reader Pro, Firefox, Chrome, and Edge. In all of these apps, the layout when selecting text OCRâd in DEVONthink jumps all over the place (as demonstrated in the attached animated gif), whereas selection is perfectly fine for the files I ran OCR on with Abbyy and Acrobat Reader Pro. Given that DEVONthink claims to use ABBYY for OCR, this is surprising to me. After repeating the same process with several files, DEVONthink is consistently broken any time it has to process any kind of layout besides a single column, whereas ABBYY and Acrobat Pro recognize layouts during the OCR process with almost perfect accuracy.
Worth noting is that this issue preceded MacOS Sonoma. I know, because I just found and still have the PDF I OCRd, last modified âAugust 26, 2023â where I first noticed this issue. Sonoma was not released until the next month in September of 2023.
Has anyone else experienced this same layout recognition problem when opening files DEVONthink has OCRâd in other apps?
Iâm on an M1 Max Macbook Pro in case it is an Apple Silicon thing since that is what DEVONthink 3.9.2 was supposed to improve. Iâve attached the three files, named according to the app that ran OCR on them. I am, of course, running the latest version of DEVONthink, which at the time of this post is 3.9.7.
Appendix: Probably related is the fact that other PDF readers often render multi-column Tesseract OCRâd files perfectly, whereas DEVONthink skips around in a similar fashion to the recording below. Though I observe it frequently, Iâve not figured out a way to isolate the issue yet, so thatâs a topic for another time. I simply mention it in case it is related to the issue addressed in this post, since Iâm fairly confident it also emerged at the same time and impairs search accuracy within DEVONthink.
Devonthink Manual AdobeOCR.pdf (112.8 KB)
Devonthink Manual DevonthinkOCR.pdf (1.1 MB)
Devonthink Manual AbbyyOCR.pdf (443.5 KB)