ABBYY OCR Layout Detection is Broken in DEVONthink

Mindstormer · September 27, 2024, 3:12pm

chrillek:

As true as that is, it’s not related to OCR. A PDF containing only an image, like what a scanner usually delivers, is highly sharable, compatible, and preserves its content. Which just is not text, but an image. Only pixels, no text at all.
OCR creates an invisible text layer (which should contain what you probably refer to as „the content“) and adds that to the PDF. But it is still PDF, namely a sequence of instructions telling the PDF interpreter what it should draw where.
Creating that text layer is not easy. It should correspond with the image, which is easy, if the same font is used as in the original. But how would software determine that? Or what should it do if said original font is not locally available?
How should the software interpret columns? In a text, it should go from the first column‘s top left (in an LTR environment) to its bottom right and then begin with the top left of the second column.
That’s what Abbyy does in DT, and it drives me crazy when applied to invoices or account statements – those look like columns, but they’re tables and meant to be processed from left to right, not top to bottom.

And so on. Frustration is understandable, but sometimes expectations for/at/in/of OCR are just too high.

This is not the issue I’ve brought up though, and yet it is still directly related to the OCR process, just not the “recognition” part of the process. ABBYY is correctly identifying virtually all individual characters, so in that sense, optical character recognition is “working” at the level of each individual character. It’s even recognizing the column order correctly in many cases, and where it doesn’t, other solutions fail as well.

The issue I’m reporting is that the way the layout is being sequentially written to the invisible layer is lacking compatibility outside of DEVONthink. Something about those instructions is not readable by other readers, and results in a jumbled text sequence. This is the issue that is unique to ABBYY’s implementation within DEVONthink.

If writing what has been recognized by ABBYY to the invisible layer (line/layout sequence instructions, in particular) is not part of the OCR process (I have assumed ABBYY handles it from start to finish), then this would only reinforce my point that there is an issue with DEVONthink. More likely though, the implementation being used has a bug that needs to be worked out with ABBYY. Either way, the fact that it used to work without this issue suggests that something broke, and I’ve identified when that most likely was.

cgrunenberg · September 27, 2024, 4:47pm

Abbyy’s SDK for third-party developers produces the text layer, not DEVONthink. But that SDK is not identical to the internal engine of FineReader.

Mindstormer · September 27, 2024, 5:16pm

Good to know. That’s probably where it’s happening then. Hopefully they can fix it then since it used to work better when I compare earlier OCR compatibility with later ones. Is DEVONtechnologies reaching out to them, or is this a matter of passively hoping they might notice and fix it in the unknown future?

BLUEFROG · September 27, 2024, 5:24pm

We don’t discuss specific internal communications. However, of course we report bugs to whomever, as needed.

Mindstormer · September 27, 2024, 5:36pm

Excellent, thanks. I wasn’t trying to prod. It just wasn’t clear to me that this issue is even being recognized. I love what DEVONthink does and has to offer, and simply hope for the best with it.

Mindstormer · April 8, 2025, 10:57am

Has this been fixed yet? I’m trying to decide on upgrading to DT4.

cgrunenberg · April 8, 2025, 12:07pm

Unfortunately not as this (and other reported issues) completely depends on Abbyy. But OCR has become less important in DEVONthink 4 as the Pro/Server editions make every imported PDF document automatically searchable (using macOS’ Vision framework) and support also macOS’ live text.

chrillek · April 8, 2025, 12:15pm

Last time I checked (about two months ago), Vision wasn’t perfect, either. It does occasionally mix up the sequence of words on the same logical line. So “searchable”, yes.

Mindstormer · April 8, 2025, 4:22pm

That’s a shame. I hope they’ve been notified of the issue, as it does not occur when using their dedicated software.

MacOS vision isn’t very competitive for historical documents with less-than-perfect scans.
Would you guys consider allowing us to bring an API key from MistralOCR as a working alternative?

BLUEFROG · April 8, 2025, 4:39pm

Their “dedicated software” is not the same they offer to third-party developers.

Mindstormer · April 8, 2025, 8:01pm

Yeah, exactly. Which is a pity that they’re sharing an inferior solution with you guys! I hope they discount whatever contractual agreements they have with you for providing you an increasingly nerfed product since the deterioration began.

BLUEFROG · April 8, 2025, 8:58pm

Thanks for the support!