OCR still forgives words in DTPro 2.0pb3r2

michelarnould77 · February 25, 2009, 9:43pm

DTPro 2.0pb3r2 OCR engine shows better recognition acuracy than the previous version, but I can notice the engine still “forgives” to recognize some words… Not usable for professionnals yet.

Best regards
Michel ARNOULD
France

twicks · February 25, 2009, 11:26pm

Hi Michel,

Did you try changing the OCR preferences? In the latest build you can trade off better accuracy for slower speed with a slider control.

Having said that, though, I also find that with highest accuracy and slowest speed, the Abby process still ignores words - words that are of good quality (that is, not smudged or with ink bleeds or faint letters) on the original scan.

Along with accuracy my main complaint is that Abby doesn’t provide any indication of a missing word. You have to do a word-for-word proof of the OCRed text to be sure that something hasn’t been left out.

Shenandoah_Don · February 25, 2009, 11:51pm

I have to say this is disconcerting.

While consistently achieving a high degree of accuracy is obviously desirable, NOT knowing you have missing data is downright dangerous.

This is especially true for those of us scanning technical articles, safety information, complex scientific information, long dissertations, etc.,

Without the capacity to alert the user and correct missing words the scan seems like a lottery.

This is especially true given the words that I’ve noticed missing so far. Some of them were quite simple, and, as others have observed here, the source material was clean, legible, without any apparent reason for a missed word.

Scary. And detracting from what is otherwise an awesome piece of information management software.

What exactly are the improvements over the old OCR engine?

twicks · February 26, 2009, 12:11am

Improvements:

(quoting from the DT home page)
The new OCR engine is based on ABBYY FineReader which is more accurate and more reliable than the engine used in earlier versions.

(quoting from the release notes for DEVONthink Pro Office 2.0 Public Beta 3r2)
New: OCR now based on the ABBYY FineReader engine producing smaller PDFs with more accurate text recognition, using less memory and offering much enhanced job control (Windows > OCR Activity)

Of course Your Mileage may Vary!

Nick_Tamm · March 16, 2009, 4:36pm

I feel also very bothered by the omitting of whole words without any indication. Will this be fixed anytime soon?

According to my tests (about 50 files of different origins and sizes) the ABBYY engine is slightly more accurate than the IRIS OCR, but it is also way, way slower. ABBYY needs 3 to 4 times (!!!) longer to complete the OCR. Or is it just me?

The only but significant advantage seems to be that the resulting files are much smaller with the ABBYY engine. A 20 pages, 4 MB file weighs in at 11 MB after OCR (why do the files get so bloated anyway???), meanwhile with IRIS it’s a whopping 80 MB!

Summing it all up I feel not very comfortable with the ABBYY engine and (reluctantly) I will have to stick with the old OCR (which means having to cope with two versions of DTPro and a more complicated workflow).

Any hints or solutions?