What happens when I Re-OCR a document? (recognized text is different)

Antoine · September 3, 2024, 3:19pm

After reading a few posts regarding DT OCR, I tried to re-OCR several times the same file. My setting is not to compress the resulting document (am I right to think that this means that only the text layer will be modified?).

Now, if I copy-paste the text, say, of the first page of the OCR’d document, to a plain text file, I can see that the result varies from one round to the next; sometimes improving, sometimes worsening. Can someone explain what is going on? Is the OCR engine “learn” from previous runs?

Thanks!

kewms · September 3, 2024, 6:48pm

OCR essentially compares the text being analyzed to a table of “correct” characters. The further away the scanned character gets from the ideal, the less deterministic the result becomes. If the scanned character lies in the overlap between two “correct” characters – between an ‘s’ and a ‘5’, say – then subsequent passes might flip it between the two.

The OCR itself has no way to know what the “correct” answer is. So if you scan the same document twice, it can either start fresh, or assume that the previous scan was correct and give the same result. But since the human wants a new scan, probably the previous scan was unsatisfactory, better start fresh.

For it to “learn,” there would need to be a mechanism to check its result against a known accurate text.

(I do not work for DT and have no knowledge of the specific algorithm they use. The above is just based on my knowledge of how the underlying technology works.)

BLUEFROG · September 3, 2024, 7:26pm

Rescanning could definitely produce different results. Redoing OCR has the potential for generating different text as I believe the page image is regenerated. It doesn’t just try to reread the existing image.

Silverstone · September 3, 2024, 8:36pm

If you have a vector graphics layer in your PDF with a bad text, you’ll get raster one with the new recognized text layer.

aedwards · September 4, 2024, 9:05am

This depends on your OCR settings. If the PDF resolution is set to “As source” and the document already contains a text layer, the OCR will replace the existing text layer leaving the underlying image on each page unchanged. If the PDF resolution is set to a value, say 300 dpi, the image(s) will be copied into the new document. If the original document was of a higher resolution the image for each page will be downscaled to 300 dpi. However if the original document was of a lower resolution, ABBYY does not upscale images and the new PDF will be of the same resolution as the original. @BLUEFROG is correct, when copying the image layer minor artefacts could be introduced or removed in the image which could change the OCR’s analysis of a character where it has 2 close matches.

@kewms description of the OCR process is correct and is how ABBYY works. The OCR process does not learn from previous runs or use any previous text layer data in generating a new text layer.

Silverstone · September 4, 2024, 10:05am

So, this option is only for Apple Silicon Macs.
If I make it 0, will it change the image?

Silverstone · September 4, 2024, 10:12am

Just tried it with ‘Compress PDF’ off and ‘Resolution’ 0.
It changed nice vector letters with raster page images… Not good

aedwards · September 4, 2024, 10:27am

Yes “As source” is currently only available on Apple Silicon Macs

aedwards · September 4, 2024, 10:34am

All OCR engines work on raster images, any vector images will be converted to raster images

Silverstone · September 4, 2024, 11:45am

Well, I don’t mind, let them work with raster images, why do they change vector with raster in the resulting PDF? Sorry, I understand that this question is for ABBYY… )

BTW
With Apple Silicon Mac, if you choose the option “As Source”, does it do the same? Or leave vector layer?

Antoine · September 4, 2024, 1:46pm

Thanks; do you mean that the user-selected default language does not matter (OCR done on a letter-by-letter basis)? I would have assumed that “acce5s” would automatically be corrected to “access”, while “19S7” would be corrected to “1957”. I cannot think of an OCR mistake that would be corrected differently in two languages but this very probably exists.

kewms · September 4, 2024, 4:59pm

Again, I’m not familiar with the details of the specific software DT uses.
But “spellchecking” and “character recognition” are different tasks.

For the character recognition step, the OCR tool would need to know what alphabet is being used. Using the Roman alphabet when recognizing Arabic text would obviously have disastrous results, and even among languages using the Roman alphabet there can be special characters, accents, etc.

But for spellchecking, the OCR tool has to make an assumption that the text being recognized is composed of “dictionary” words. That’s not at all a safe assumption: receipts, financial statements, and other commonly OCRed texts often include not just numbers, but part numbers and product codes, which are very often arbitrary alphanumeric strings. If you’re digitizing historic documents, non-standard spelling might be part of the information that you want to capture. And so on.

Again, I don’t have any special knowledge about this particular software. But it seems to me that the only “safe” way to handle documents that can contain arbitrary strings is to ask the user. “Hey, these are the characters I saw. Do you want me to check them against my list of words, too?”

Edit: The thing to remember about all such tools, both OCR and more sophisticated “AI” tools is that they are not smart, not in the way that humans are. OCR is a pattern-matching task, comparing an image to a statistical distribution. It does not use (or require) any knowledge of “language” or any information about what the images being recognized actually “mean.”