OCR often needs editing

I’ve been scanning a lot of books and other printed material for a research project. I use a ScanSnap iX500 - sometimes I scan to the desktop, sometimes directly into DT.

Once in DT, I use DT’s “OCR to Searchable PDF”. So far, so good.

But when I copy some of the text from within DT and paste it into an outside document - often OmniOutliner - it sometimes requires editing. Sometimes it pastes just fine. But sometimes a line of text will have no spaces between words - as in: nospacesbetweenwords - and I’ll have to go in and manually insert the spaces. Sometimes there will be no space after the word “a”, so a phrase like “a phrase like” will read “aphrase like”.

I don’t mind doing a little manual editing, but sometimes it’s a bit much. And it happens even when the original pages I’m scanning from are clear and clean. There doesn’t seem to be anything wrong with the original, but the OCR sometimes gets wonky. So I’m wondering if there’s a way to improve the quality of the OCR being done inside DT.

Is there? Do we have any options?

OCR is not a 100% process and no OCR engine can produce a faultless conversion. (And the AABBYY engine is one of the best.) The quality of the original, including the resolution and contrast of the image as well as unknown letterforms, can affect the character detection.

Thanks for the fast reply. Are there any settings to improve quality? For instance, in DT/Settings/OCR I noticed that “Resolution” has a zero next to it. Might recognition improve if I enter a value?

That setting has no effect on the OCR, though it shouldn’t be 0.

What kind of material do you scan (age, paper quality, font(s))? And what are your scan settings (resolution etc)?

A wide variety of material, i.e., books old and new, hardcover and paperback, som in perfect condition, some not so much. ScanSnap settings are at Best or Excellent.

As Jim said, the popular consumer oriented OCR products like AABBYY and Adobe are imperfect.

You can help by fiddling with the settings. Excellent isn’t necessarily the most appropriate setting for OCR purposes. Try running the same thing through using different combinations of scan quality, compression rates, and gray/bW/color and you should see slight differences.

Personally, I prioritize readability over other considerations, so I usually have everything scanned with excellent + gray (or color, if necessary) + no compression and rely on the installed Adobe OCR. This may produce more “noise” for the OCR than BW in some cases. Your mileage may vary.

At any rate, unless you get access to whatever Google is using for Google Books (Tessera t / KNIME?), you will have to settle for good enough rather than excellent or perfect results.

1 Like

Great advice. As a test, I just scanned a page that had given me trouble - OCR produced a line of words with no spaces between them. I ran the page through the ScanSnap with various settings (Excellent, Best, B&W instead of Auto, etc.), and in each case I ended up with the same problem (one line of words with no spaces.)

I then put those .pdfs into DT and ran its OCR → Searchable PDF on them. I every case, the problem disappeared. So, for some reason, DT’s OCR cleaned up the problem that ScanSnap’s OCR couldn’t.

I am curious as to why that would be, but will happily live my life even if I don’t find out. Not having to clean up all those errors will make my workflow a lot more efficient.

Thanks for the suggestion!

1 Like

And this prompts me to ask one followup: my solution to the problem is to perform a second OCR on the pages. I know I can set the ScanSnap to scan and then send the .pdf into DevonThink. But is there a way to use DT’s own OCR function with the ScanSnap - in other words, save a step and have DT be the only OCR - or do I have to scan, import that into DT and then use DT’s OCR function?

Glad I could help, though I think you found the best solution for your case–rely on ABBYY instead of Adobe (bundled with ScanSnap). I think that ABBYY is generally superior to Adobe with the texts I handle, but I don’t seem to be able to prevent it from compressing / optimizing my images, so I end up with less readability. I seem to remember reading somewhere on these forums that the Apple silicon folks have access to this setting, so I will have to wait until I upgrade my computer again in the next year or two.

As for OCR, you should be able to make a “smart rule” that OCRs any PDFs in your inbox or some other designated location, but (for the reasons stated above) I have not fiddled with it.

2 Likes

For what it’s worth, when I compared a bunch of OCR engines a few years ago, ABBYY was the best for my uses (documents with columns/sections, etc). Tessaract is available open-source, but I think Google keeps a bunch of their in-house tuning to themselves, and I found the open source version to be markedly worse than ABBYY.

2 Likes

Actually, the ScanSnap uses ABBYY also. But for some reason, the ABBYY in DevonThink clears up problems the ABBYY from ScanSnap cannot.

1 Like

Thank you for the correction. I think I have in mind the older versions of ScanSnap that I was using, which, according to memory, relied on Adobe. Indeed, the ix500 I am using now relies on ABBYY. I have also found differences between its performance and DT’s. I suppose they are using different versions, and I would guess the SS one is older.

Thanks for the extra info about the Google practices. I was wondering about that as well, and now some of the things I have read make a bit more sense.

Did you check what version of the ABBYY package each is running? In my experience the ScanSnap software suite tends to get less frequent updates than DT. Could explain the difference.

I did - but ScanSnap doesn’t show the ABBYY version number.

Good point about versions. I tried ABBYY FineReader 12 at a time when DT was still using v11 (which might still be true). v12 did perform noticeably better than DT in one particular case (a large collection of poorly-scanned newspaper cuttings). In general, though, it wasn’t always clear that that v12 was an improvement; both versions made mistakes, but in different places.

Interesting. And yes, DT is still using FineReader v.11