OCR : missing words

michelarnould77 · February 19, 2009, 9:32pm

I’ve been working 4 days with DevonThink’s pb3 new OCR engine.
I’m quite disappointed, for the OCR engine is everything but perfect.

Many words are not recognized and “forgoten” by the AABBYY software.
I’m using a ScanSnap scanner :
A single page pdf document has been OCRized

First with DT PB3
and then with FineReader 3.0 for Snapscan

The winner is FineReader who made no mistake reading the A4 page when DevonThink, reading the same document, forgot 15 words !!!

I’ve tried different scanner settings and different DT settings : same problem.

Michel Arnould
FRANCE

jwstex · February 20, 2009, 12:04am

I’ve seen similar error rates with DTPO pb3 OCR. I scanned a single page document using my ScanSnap S510M with “Best” settings. I then opened the PDF in Adobe Acrobat Pro version 8 and did OCR using the built in features. The resulting text had a few errors but all words in the document were present and accounted for.
I imported the same (pre-OCR) document into DTPO pb3 and converted to a searchable PDF. The resulting text was missing 20 entire words!

twicks · February 20, 2009, 1:20am

Missing words by Abby is also a very major complaint of mine.

I have converted a number of previously-scanned pages to RTF using the built-in Abby OCR module.

What really annoys the heck out of me is that, in addition to missing complete (and very readable) words, there is absolutely no indication that a word was missed. Other OCR applications (like Omnipage) at least stick in an odd character like the ~ to show that the program missed something. No, not the highly-touted Abby. You have to be a very careful proofreader to spot these missing words. This just isn’t a viable way of handling “unreadable” words and characters. Why isn’t there some feedback to us users that something was missed?

Incidentally, I used the same pages with DTPO 1.x using the Iris engine and those words were recognized. I also read the scanned image into Omnipage Pro, and while those words were recognized, others weren’t. OP inserted ~ for missing groups of characters.

My scanner is an Epson 4990 firewire and I used Omnipage to scan in the original files.

twicks · February 20, 2009, 1:27am

Um, Michel, I think Finereader is an Abby product and that is what DTech has licensed for use with DTPO2. IIRC the standalone version of Finereader is PeeCee only (no Mac version). So how were you able to do this? Or am I misunderstanding what you did?

-Tod

Bill_DeVille · February 20, 2009, 4:02am

Yes, the ABBYY OCR engine is used both in the FineReader OCR application and in DTPRO 2.0 pb3 and later.

Annard has to do a lot of coding to control the OCR engine — and as the ABBY code is a port from windows, it doesn’t come with a lot of standard Mac APIs (spaghetti, anyone? Windows-style addresses, that don’t exist in the normal Mac world?).

Continued builds of the “controlling” DEVONthink plugins are coming along nicely, with much better accuracy.

As Annard commented in another post, “the good news is that ABBYY OCR has a lot of potential; the bad news is that ABBYY OCR has a lot of potential.”

Problems such as black pages when running OCR on some PDFs have been resolved. Annard sent me beta 11 of the plugins today, and accuracy is much better. There’s also a new “Accuracy” slider in OCR preferences, allowing the user to choose a compromise between speed and accuracy.

Annard has some further polishing in mind. I wouldn’t be surprised to see in-house beta 12 plugins tomorrow.

michelarnould77 · February 20, 2009, 8:00am

That’s good news ! I’ll wait for DT pro pb4