ABBYY OCR Layout Detection is Broken in DEVONthink

Mindstormer · September 18, 2024, 10:34pm

I constantly listen and read along to PDF files in a useful little app called Voice Dream Reader, and about a year ago (August 26) I started noticing that it would skip all over the place when I imported any files I had run OCR on in DEVONthink. This was the first time since updating to Devonthink 3.9.2 which had been released the month before. The release notes stated that:

the OCR engine has been updated to improve its reliability on Apple Silicon Macs, and when archiving email, DEVONthink manages its resources even better. These modifications, along with several bug fixes, continue to improve performance and reliability.

While DEVONthink seems to do well recognizing its own OCR within DEVONthink, many other apps no longer do. Prior to the 3.9.2 update, I had never noticed an OCR layout issue, but since then, layout detection has been severely broken. I wasn’t 100% certain it was DEVONthink until today.

To test if this was unique to DEVONthink, I decided to take the most recent DEVONthink manual, export a page at 600 dpi as a tiff file, and then convert it to pdf and OCR it using

DEVONthink
ABBYY Finereader
Adobe Acrobat Pro

I then opened each of these in Acrobat Reader Pro, PDF Reader Pro, Firefox, Chrome, and Edge. In all of these apps, the layout when selecting text OCR’d in DEVONthink jumps all over the place (as demonstrated in the attached animated gif), whereas selection is perfectly fine for the files I ran OCR on with Abbyy and Acrobat Reader Pro. Given that DEVONthink claims to use ABBYY for OCR, this is surprising to me. After repeating the same process with several files, DEVONthink is consistently broken any time it has to process any kind of layout besides a single column, whereas ABBYY and Acrobat Pro recognize layouts during the OCR process with almost perfect accuracy.

Worth noting is that this issue preceded MacOS Sonoma. I know, because I just found and still have the PDF I OCRd, last modified “August 26, 2023” where I first noticed this issue. Sonoma was not released until the next month in September of 2023.

Has anyone else experienced this same layout recognition problem when opening files DEVONthink has OCR’d in other apps?

I’m on an M1 Max Macbook Pro in case it is an Apple Silicon thing since that is what DEVONthink 3.9.2 was supposed to improve. I’ve attached the three files, named according to the app that ran OCR on them. I am, of course, running the latest version of DEVONthink, which at the time of this post is 3.9.7.

Appendix: Probably related is the fact that other PDF readers often render multi-column Tesseract OCR’d files perfectly, whereas DEVONthink skips around in a similar fashion to the recording below. Though I observe it frequently, I’ve not figured out a way to isolate the issue yet, so that’s a topic for another time. I simply mention it in case it is related to the issue addressed in this post, since I’m fairly confident it also emerged at the same time and impairs search accuracy within DEVONthink.

DevonthinkOCR Text Selection in Acrobat Pro
Devonthink Manual AdobeOCR.pdf (112.8 KB)
Devonthink Manual DevonthinkOCR.pdf (1.1 MB)
Devonthink Manual AbbyyOCR.pdf (443.5 KB)

celsee · September 19, 2024, 2:36am

I’ve had this issue. Just assumed it was the PDF, not the OCR process.

aedwards · September 19, 2024, 8:58am

Looking at your example, for the majority of the page its layout has been correctly identified. There are however 2 layout errors:

Selecting text in the first column, is just missing ’11’ at the end of “Building your Database"

The first part of column 2 is identified correctly

After this the title and the number 11 missing from column 1 have been incorrectly located, this is causing the incorrect highlighting in your example. The rest of the column is correct

I will raise this as an issue with ABBYY support.

Mindstormer · September 19, 2024, 10:38am

As the 11 was selected in my video, this response confuses me. The layout selection varies inconsistently depending on which other app one tries to select text in.

How did you identify that the 11 is causing the layout selection glitching in all the other apps named?

Though I did not check for this file, this bug frequently causes text to speech to skip around too. Essentially misinstructing text to speech engines on read sequence. If that isn’t the case here, I can provide other samples.

All text is accounted for across apps, but there seems to be an incompatibility when other rendering engines try to read the OCR performed by Abbyy within DEVONthink in particular.

aedwards · September 19, 2024, 12:17pm

Within the PDF itself the text layer is series of text blocks and co-ordinates of those words and blocks on the page. When selecting text, it is the PDF viewer that determines how your selection maps the page co-ordinates to the underlying text layers blocks and which blocks are selected. As this is independent of the PDF file it can vary using different PDF viewers. The order of the text block is similar, for example using Preview and/or Apple’s PDFKit the plain text output is different to that in your Adobe example.

Mindstormer · September 19, 2024, 12:31pm

It can’t be 100% independent of the PDF file because the same file OCRd with ABBY Finereader has no issue reading the layout correctly across apps. This issue is unique to ABBYY within DEVONthink.

Apple PDFkit in Preview seems to ignore OCR by any app and appears to just re-OCR it.

cgrunenberg · September 19, 2024, 12:38pm

The Abbyy engine for third-party developers (like us) is not identical to the one used by Abbyy’s own FineReader.

BLUEFROG · September 19, 2024, 12:57pm

Preview doesn’t have OCR capabilities. It’s likely you could be seeing the effects of macOS’ Live Text.

BLUEFROG · September 19, 2024, 2:14pm

This is one of your PDFs in Preview with OCR done via the Prizmo app…

Mindstormer · September 19, 2024, 3:10pm

Valid point. I was opening the pre-OCR’d file in Preview and noting that live text was recognized by default. Though tangential to this issue, I appreciate the clarification.

And that’s why I don’t use Prizmo! This is an area where AI-assisted OCR could shine. The only one I’ve tested is Surya in the command line, and though I’ve only tried it with a couple of files, it was impressive. Not sure if that would hold consistently though.

To the point though, Adobe and ABBY’s official app both do a nearly flawless job, and have compatibility to be read across all apps and various different layout rendering engines in a consistent manner. So I’m not sure how Prizmo relates to this thread? DEVONthink’s Abby integration is not performing OCR in a way that is compatible across rendering platforms like their flagship product. This is probably a bug that they need to address for you if it was not introduced in by something else 3.9.2 updated in July of 2023.

Mindstormer · September 19, 2024, 3:28pm

I figured; hence this bug report.

BLUEFROG · September 19, 2024, 4:15pm

It shows other OCR engines don’t necessarily produce what youo expect it to. It’s not just an ABBYY artifact.

Also, a web search for ocr column detection will provide many articles about how this isn’t a trivial matter to accomplish or an exact science.

Mindstormer · September 19, 2024, 5:09pm

I understand that column detection is not trivial/easy. Of course, when we speak of column detection, there are two separate aspects, and to avoid conflating them, I’ll clarify, because I perceive you may be addressing the first of these two, while I’m talking about the second:

The detection process, in which the OCR has to attempt to do the layout detecting (hopefully accurately).
Once everything is detected, the way the OCR saves the detected areas to the file so that it has cross-platform rendering compatibility with the same detected region sequencing.

ABBYY in DEVONthink, ABBYY Finereader PDF, and Adobe Acrobat Pro all do a somewhat similar and decent job of #1. Some files are just plain complicated or poor quality though, so mileage can vary. This is understandable.

What I’m talking about #2, however, where ABBYY in DEVONthink is clearly buggy and doing something wrong after it performs a comparably and equally good detection of the text and layout. After all, the layout recognizes perfectly in DEVONthink, but not in most other apps. In contrast, ABBYY Finereader PDF and Adobe Acrobat Pro are able to perform #2 flawlessly with cross-platform compatibility in a way that DEVONthink does not.

mr_drlove · September 26, 2024, 9:24am

I am also familiar with this problem. I have also tested several PDF editors. In my experience, Adobe unfortunately delivers the best work. ‘Unfortunately’ because I don’t like the product otherwise. That’s why I’ve switched to using the editor to OCR my PDFs beforehand and only then saving them in DVTP. Unfortunately, it’s inconvenient, but it’s the best way for me.

SebMacV · September 26, 2024, 10:35am

Wrangling PDFs is the bane of my reading life I routinely re-OCR the rubbish (of the file, I hasten to add: not the intellectual quality!) that gets made available through publishers, where layouting of the text layer can be badly out of joint. This sort of thing drives me nuts:
Screenshot 2024-09-26 at 11.23.01

I tend to only discover the problems once my PDFs are synced to iPad and I sit down to read, highlight, annotate. So I then often re-OCR, either using DT’s inbuilt ABBYY engine for little documents, or ABBYY’s stand alone Finereader for bigger jobs (whole books, unusual characters, typographical challenges). I often lose quality (sharpness especially) but gain significant file size (as documented elsewhere on this forum). Sometimes PDF Squeezer comes to the rescue. Then I turn to PDF Expert to put chapter headings (a document map) in, or back in, since OCRing can lose them in case they were supplied by the publisher … und so weiter. I wish there was a one-stop shop to alleviate all my PDF woes (but have no access to Acrobat Pro).

I guess I take the good with the bad: digital reading and annotation has revolutionised my research, but we ain’t there yet in terms of shared PDF standards.

Mindstormer · September 26, 2024, 9:53pm

I made a personal comparison of OCR technologies (Tesseract, ABBYY, and Adobe) in late 2022 or early 2023 across various files, layouts, and image qualities (before this bug), and ABBYY was the clear winner, even within DEVONthink. I have not tried Tesseract again, but I have a hunch that just about any currently updated OCR technology will work with better compatibility than ABBYY in DEVONthink until this bug is reproduced, acknowledged, and fixed. Until then, I will have to advise people to just purchase the base tier of DEVONthink since the distinguishing feature in Pro (OCR) is no longer functioning properly.

I just gave a presentation promoting DEVONthink to a doctoral cohort this week as part of an ideal research workflow, but I now have to avoid demoing OCR functionality within DEVONthink in favor of working solutions. Hopefully, this will get resolved soon!

SebMacV · September 27, 2024, 5:15am

Glad to hear that ABBYY sits at the top.

I think that there are many use cases for the in-built OCR that serve folks just fine (OCRing scanned receipts, bank statements, etc for search purposes). You (and I) have bespoke PDF requirements, work intensively with text to produce professional outputs, and so need high levels of accuracy. We are no stranger to software workaround to get to our desires workflows and outputs. Despite my grumble note above (directed more at the inherent problems of PDF rather than at DT’s ABBYY implementation), I would still wholeheartedly recommend the Pro Edition to my doctoral students. There are other distinguishing features, too, that may be critical for ‘pro’ workflows: see DEVONThink standard vs DEVONThink Pro - #7 by pete31; and here DEVONtechnologies | DEVONthink Editions

As they say, YMMV

cgrunenberg · September 27, 2024, 5:15am

Pro offers more than just OCR (and this list will grow in the future):

Mindstormer · September 27, 2024, 11:46am

I’m aware, but OCR has been at the top of the added feature list for most of whom I interact with in the field of research.

It’s a shame that when one opens or shares a file OCRd by DEVONthink with anyone else, it can’t be read and used correctly. Highlighting gets all messed up, text-to-speech breaks, copying and pasting text from the file gets all jumbled, search-ability breaks, etc. This is a huge issue. I can’t understand why you don’t seem to see that. It only takes a short sentence to acknowledge an issue, or else seek clarification if it is not seen, to reproduce it.

The whole point of PDF (Portable Document Format) is to be a highly sharable, compatible file format, that preserves the content of a document in a consistent way for sharing.

Adobe and ABBYY Finereader are pretty closely tied now. Just not in DEVONthink now that it is broken/retrograded.

chrillek · September 27, 2024, 2:17pm

As true as that is, it’s not related to OCR. A PDF containing only an image, like what a scanner usually delivers, is highly sharable, compatible, and preserves its content. Which just is not text, but an image. Only pixels, no text at all.
OCR creates an invisible text layer (which should contain what you probably refer to as „the content“) and adds that to the PDF. But it is still PDF, namely a sequence of instructions telling the PDF interpreter what it should draw where.
Creating that text layer is not easy. It should correspond with the image, which is easy, if the same font is used as in the original. But how would software determine that? Or what should it do if said original font is not locally available?
How should the software interpret columns? In a text, it should go from the first column‘s top left (in an LTR environment) to its bottom right and then begin with the top left of the second column.
That’s what Abbyy does in DT, and it drives me crazy when applied to invoices or account statements – those look like columns, but they’re tables and meant to be processed from left to right, not top to bottom.

And so on. Frustration is understandable, but sometimes expectations for/at/in/of OCR are just too high.