OCR and weak AI

Blake · September 13, 2022, 10:36pm

I know that the OCR engine in DT is provided by ABBYY, not DT, but does anyone know why its so-called AI component is so weak?
Why does it OCR a text as “Dr. Barnes is well known as a collector. His home at Overbrook, I’a., contains the most comprehensive collection of modem pictures in America. It includes fifty Renoirs. Mis opinion should ba of exceptional interest.”
Surely the AI should have access to a big enough sample of English prose to be able to prefer “Overbrook, Pa.” to “Overbrook, l’a.” and “His opinion” and “be of exceptional” to “Bis opinion” and “ba of exceptional.” The engine is making guesses, anyway, about what letters are represented by the various squiggles on a page, so why doesn’t it make contextual guesses that are informed by easily accessed samples of English prose?!
This seems such a simple way to improve OCRing, in 2022, and yet it doesn’t seem to be implemented, anywhere.
In frustration,

Blake

chrillek · September 13, 2022, 10:51pm

AFAIK there’s no “AI” involved in the OCR process. What you seem to want here is a context-aware piece of software that somehow “understands” what the text is about. Now, I don’t now US geography, but I know that “Neustadt” (for example) is a city name that occurs often in Germany. And “Overbrook” seems to exist in Kansas, too. So the “Pa” that you see and that the OCR process converts to a non-sensical (?) “I’a” could also have been converted to an (equally useless) “Ka”, I suppose.

Your “easily accessed samples of English prose” sound innoucuous enough. But, alas, there’s not only English spoken in this world. So the OCR engine would have to access German, French, Spanish prose as well – and how exactly is it doing that? Using the internet, which means that you can’t run OCR without network access? Storing all these prose samples locally?

Furthermore, I doubt that samples of prose (in whatever language) are helpful here – the OCR engine would have to gather the meaning of the text and then draw conclusions about the best hit. Following which rules, though? Language depends on region, social status, time …

You might try your hand with Apple’s built-in OCR engine. Perhaps it gives better results? (I doubt that, seeing that it is easily mixing up lines in a text, but who knows…)

BLUEFROG · September 14, 2022, 3:05am

@chrillek is correct. There is no AI in OCR.

Why does it OCR a text as …

There is no operation where OCR is done on text. You are misunderstanding the process. OCR operates on an image and the quality, contrast, and sometimes letterforms matter.

This is analogically no different than you trying to recognize someone in a photo. You’re not examining the person. You’re examining a two-dimensional item, trying to detect edges, forms of shadow and light, color, etc., trying to decipher those things as someone recognizable. A faded photo from the 1910’s would be harder to examine than a well lit one taken yesterday.

Blake · September 14, 2022, 12:27pm

Ahh, I guess I was misled by ABBYY’s claim that it uses AI!
So (English) OCR merely chooses between the 26 letterforms in the alphabet (plus punctuation marks) without paying attention to the context within which they occur?
That’s strange, given that (pace Chrillek) it would in fact be pretty easy for an AI’s neural network to train itself to recognize (for example) the statistical probability that either “Overbrook, PA” or “Overbrook, KA” are more likely to occur than “Overbook, 'l’a” and then to recognize that “Overbook, PA” is statistically common in (public domain) texts that discuss “Dr. Barnes.” An AI OCR engine (like all AIs) wouldn’t have to “understand” any text, it just has to look for statistical groupings. (And it would be trivially easy for it to recognize the language a text is written in, then examine only sample texts in that language.)
Since the advent of GPT-3, there have been dozens of AI language generators that do far more complex tasks than the simple recognition I’m suggesting, so I can’t understand why they aren’t used in OCR. I was hoping someone in the (very smart!) DT community could give me a reason, to help allay my daily frustration!

chrillek · September 14, 2022, 3:31pm

Are those running in about 10 different languages locally on a Mac? If not, you’d have to send your text somewhere to do the cleanup. Which you, I suppose, could do anyway after OCRing.

I am not at all convinced that what you want is even possible now, although AI proponents seem to indicate that it is. Firstly, because currently English is probably the only wildly AI-fied language (and I do not care very much for that). Secondly, because what you want requires a vast corpus of text to figure out even seemingly simple things like “texts in the public domain (where, BTW, did you tell the OCR engine that?) dealing with Dr. Barnes are probably referring to Overbook PA, not Overbook KA.”

I think that your frustreation is probably caused by a mismatch between AI’s purported abilities (“marketing blurb”) than to what it can really do. Even simple statistical inference is only possible if you have a lot of data from which you can calculate the statistics. Which would require the engine to have already “read” a lot of public domain texts about Mr. Barnes and Overbook.

MsLogica · September 14, 2022, 3:46pm

Google has access to one of the biggest digital libraries in the world (and millions of dollars to build these things), and their OCR often misrenders words, sometimes with hilarious consequences for those who enjoy a bit of juvenile humour. Incidentally, this humorous example suffers from the same problem as your “modem/modern” issue, which is that both words look very similar (even in this font in fact!). Is that white pixel the separation of ‘r’ and ‘n’, or an imperfection in the text you’ve asked the OCR to render? WHO KNOWS.

MsLogica · September 14, 2022, 3:54pm

You’ve introduced another problem

Even assuming an AI OCR worked that had grasped English and could process words in context and offer up the most likely permutation (how would this happen and how long would it take?), how does it handle spelling errors, unusual texts (an essay on the differences between Overbook Pa and Overbook Ka springs to mind), historical language uses (good luck using this AI to render Chaucer), text in multiple columns (some OCRs already struggle with columns), etc. There are so many variables, even before we reasonably ask that it now needs to replicate all of that for at least 10s of other languages.

Blake · September 14, 2022, 4:16pm

Well, the AIs that DO exist – eg, for text generation – do seem to manage almost all of the above, as if truly by magic, just based on a neural network and a sufficient data set for training it. (And the data set of English-language texts is certainly sufficiently vast.) I realize that OCR, as things stand, fails dismally … but I still can’t understand why.

(For instance, feeding the following prompt into GPT-3: “Write a text on the house lived in by Dr. Barnes the Pennsylvania collector”

Comes up with:

“Dr. Barnes, the Pennsylvania collector, lived in a house in Overbrooke. The house was a two-story, white frame house with a porch and a small yard. Dr. Barnes was a collector of art and antiques, and his house was full of them. He had paintings by Rembrandt and Van Gogh, and sculptures by Rodin and Michelangelo. He also had a collection of ancient Egyptian artifacts. Dr. Barnes was a very private person and seldom had visitors.”

As far as I understand it, none of those sentences need to be cribbed direct from other texts on the Web, but can be built just from the minimal information in the prompt, and the mass of Barnes-related texts the AI has encountered. Couldn’t a paragraph of characters, having gone through a preliminary OCRing, be fed as a kind of “prompt” into a neural network, that would then correct statistically unlikely words to the most likely ones?)

At any rate … thanks for all the time you guys have given over to my query!

BLUEFROG · September 14, 2022, 6:18pm

Where did the original find from and what does it look like?

Blake · September 14, 2022, 6:58pm

Here’s a screen grab of the original text scan. It’s lousy, so I’m not at all surprised that an OCR that only does image-recognition on characters should have made the mistakes it did. I AM surprised that, in 2022, the OCR can’t include an added AI step that does TEXT recognition, via contextual cues, to correct the very obvious errors in the first-step OCR.
It would still make errors, of course, but it would at least choose solutions (and sometimes errors) that are statistically more likely than not. I doubt that “ba of exceptional” or “Bis opinion” would survive. (Especially since the AI could easily learn that “ba” is a common OCR error for “be” and “bis” for “his,” and make the most likely corrections, accordingly.)

Blanc · September 14, 2022, 7:08pm

I would have agreed with you before I came across deepl.com; there is clearly some ‘awareness’ of context in the AI used by the translation engine. I could well imagine that it could be put to use in OCR in the way the OP suggested (although I’m not aware of any software actually doing that).

(And actually, if you look at the improvements to dictation in iOS 16 then you have to suspect that context awareness is coming along nicely.)

chrillek · September 14, 2022, 7:13pm

Not to contradict you, of course, but (there’s always a but): DeepL has the context of one language and derives everything from there. OCR doesn’t have that and would have to go elsewhere – Google, Library of Congress, whatever. That’s a much bigger corpus than what DeepL works with.

Blanc · September 14, 2022, 7:18pm

That is a relevant ‘but’ for sure.

APC · September 14, 2022, 7:20pm

ABBYY’s SDK is OCR is just OCR. Its is using AI to recognise letters in the image.
ABBYY’s SDK in Devonthink is useful for text searching, classifying etc .

ABBYY’s FineReader desktop package has more options and AI features to match formatting, text and image block detection so it can deal with a mixture in the same file and it also has features for spell checking and autocorrect.

ABBYY’s FineReader desktop package offers better results but you would have to use it separately then integrate it into your Devonthink workflow.

OCR isnt perfect and mistakes can get through proof reading and correction is the only 100% solution.

People mention GPT3 natural language processing theoretically this could improve things, however, AI doesn’t actually understand and comprehend language therefore human proof reading is the best option.

I suppose it AI does have its place and can save some time.

Blake · September 14, 2022, 7:24pm

Yes, you are clearly right – certainly about ABBYY and its place in DT.
I just asked a scholar who has some expertise in AI, and she said that it may be a while before OCR can depend on AI for corrections, however much that makes sense in theory. Partly an issue of who pays for the AI’s training and then the processing power to run it! But you’d think business and government use alone would make a MUCH better OCR pay for itself…

APC · September 14, 2022, 7:29pm

Well, the AIs that DO exist – eg, for text generation – do seem to manage almost all of the above, as if truly by magic, just based on a neural network and a sufficient data set for training it. (And the data set of English-language texts is certainly sufficiently vast.).

However, the AI has no understanding of the text it processes this is still a critical requirement in language. AI is using statistical patterns rather than understanding context.

I realize that OCR, as things stand, fails dismally … but I still can’t understand why.

What is the quality of the files and images.?
This can make a huge difference. DPI, sharpness, brightness, font clearness/ font recognition, dust/dirt/tears/creases, flatness and orientation

bws950 · September 20, 2022, 2:20am

APC, have you (or others) used ABBYY’s separate application in conjunction with DT3? I’ve been frustrated at times with DT3’s inability to correctly identify columns of text (and sometimes even lines of text in a single-column page), especially when iOS’s new text recognition feature can correctly parse text from a screenshot of the same PDF. I thought about resorting to ABBYY or something like it for more reliable OCR processing of batches of PDFs, but assumed that since DT3 and DTTG already use the ABBYY SDK, I wouldn’t get improved results. From your post, though, it sounds like that isn’t true, so I’m wondering whether you or others have come up with workflows to solve problems like the one I’ve described.

Blake · September 20, 2022, 2:53am

Sorry, I haven’t used the separate ABBYY application … but I’m so disappointed in the general state of OCR that I’ve just learned to live with the lousy results it gives. (I wonder why Apple hasn’t released its iPhone OCR as a separate Web utility? It really is more powerful than most of the competition…)

eMaX · September 20, 2022, 6:31am

Would be interesting if you could share your sample. What does tesseract do with it? On the AI part, I think as a general rule, if something reads “AI in it”, then it’s normally not.

eMaX · September 20, 2022, 6:34am

I was doing that intensively, as I happened to own an Abby command line interface license. I put the whole thing in a docker container, and then used that to run the OCR. Unfortunately, with the move to M1, while the container still works, it’s now very slow as it is an intel version of the libraries I have (like, from 2013). But what I had already noticed in the past was that the results were practically identical to what Abbyy called from DTP did. So I switched to DTP and just confiured a rule where on demand I’m running OCR on a folder on all PDFs where the word count is 0. That works very fast for me, so no more problems.