SLOW OCR problem

Speaker · May 10, 2015, 7:07pm

hello
I’m using DTPO trying to OCR several hundred PDF’s but the speed of the conversion is much slower than I expected. The OCR process is only consuming 13% of my CPU. Is there any way to speed up the procedure? Can I run multiple instances of the OCR engine somehow? I’m using the default OCR settings of 150dpi, 75% quality, and Automatic.

Any tips for increasing the throughput would be appreciated.

Bill_DeVille · May 10, 2015, 7:38pm

The license from ABBYY for the current OCR module in DEVONthink Pro Office prohibits tuning for multicore CPUs. That will change in the future, but bot yet.

The downside is that OCR is slow. The upside is that while a batch of documents is being OCRed, your computer’s CPU resources are mostly free to allow you to perform other tasks.

Speaker · May 11, 2015, 1:57am

thanks for the reply, Bill.

2 related questions:

a. Every so often during OCR, I get an error message saying “couldn’t recognize page 0 of the PDF” - continue or quit?" Is there a way to disable the pop up (I always want to continue) so that my OCR process is not paused in the middle of the night?

b. Does it mean that if I purchased (or used a tree 30 day trial) of ABBYY Finereader for Mac I could get faster speeds? I see they have several products - which one is the one to get for maximum OCR efficiency?

JBB · May 11, 2015, 4:14am

FineReader Pro for Mac has the best accuracy and performance. I would suggest that DEVONthink offer an option to integrate with this solution (OCR directly from DEVONthink but using that app’s engine and automator actions) to overcome the multicore problem and the licensing for those of us who have purchased it. Then rather than DEVON paying ABBYY, maybe they should provide some discount/incentive to you for providing customers for their premiere OCR product.

Bill_DeVille · May 11, 2015, 2:50pm

@ JBB: Interesting twist, but it doesn’t seem to fit ABBYY’s business plan.

OCR is a demanding procedure, as it involves computer recognition of images of text and their conversion to searchable text. IMHO, the most important requirement of OCR is accuracy of text recognition. I’ve bought and tested all the important Mac OCR apps, comparing text conversion results for several scanner output files. (With one exception concerning “bought”: although I purchased registration for Acrobat Pro in the past, I don’t see the value to me of Adobe’s current price model.)

ABBYY’s OCR software is as good as it gets in the world of consumer-priced OCR software; my tests put it in first place. There are some big machine, much more costly OCR applications available to businesses and government that can sometimes produce better results than ABBY, based on larger dictionaries, but the improvements are pretty slight as practical matter. Given the vagaries of image resolution, image quality, fonts and font sizes and blemishes in the original paper copy, no OCR software has achieved perfect text recognition.

I’ve got four ABBYY FineReader applications: the OCR module in DEVONthink Pro Office, FineReader Mac that came with my ScanSnap iX500 scanner, FineReader Pro 12.x that I purchased and FineReader 10.x Windows that came with my Xcanex scanner. Accuracy is very good for all four. But the version in DEVONthink Pro Office is the slowest on my MacBook Pro with 4-core i7 CPU, because it isn’t optimized for multicore processing.

Nevertheless, I almost always stick the the OCR module in DEVONthink Pro Office, for two reasons. 1) it can OCR images from any source, unlike the free versions that came with my scanners, which are limited to OCR of images produced by that scanner (so can my registered copy of FineReader Pro 12.x). 2) it’s much simpler to use to OCR images that have already been captured into a DEVONthink Pro Office database. Yes, it takes longer, but it’s a background process and I can continue working while OCR processing is proceeding. But I certainly won’t object when the OCR module in DEVONthink Pro Office becomes optimized for multicore processing.

But is ABBYY FineReader Pro 12.x the fastest OCR app I’ve tested? No, I’ve got a couple of apps that can do OCR that are significantly faster than ABBYY. They really zip through the processing chores on my 4-core i7 CPU. But there’s a penalty. The accuracy of text recognition is inferior. They skip a lot of processing checks that ABBYY makes to achieve superior recognition accuracy. There’s really no free lunch.

thijsb · April 11, 2016, 12:36pm

Dear Bill,

Regarding to you answer : " But is ABBYY FineReader Pro 12.x the fastest OCR app I’ve tested? No, I’ve got a couple of apps that can do OCR that are significantly faster than ABBYY. They really zip through the processing chores on my 4-core i7 CPU. But there’s a penalty. The accuracy of text recognition is inferior. They skip a lot of processing checks that ABBYY makes to achieve superior recognition accuracy."

I am looking for a program to ocr about 15.000 pdf files. I also have a version of abby which came with Scansnap. In this speed it will take me bout 10 days. I have a macbook pro I5 with 2.6 GHz cores and 16GB of Ram .
Do you have a suggestion to speed up the progress ?

BLUEFROG · April 11, 2016, 8:52pm

I would assess whether you need to OCR all 15,000 now. If there is no need to have them all now, I would highly suggest doing the critical files now, then doing the balance as an ongoing project, or on an as-needed basis.

Frederiko · April 12, 2016, 10:08am

The fastest consumer OCR I have found is the default settings in Acrobat Pro (ie don’t straighten pages and don’t makes the pages editable). Much faster than Abbyy or Prizmo and the accuracy is comparable to Abbyy.

Honestly with 15000 pages to OCR I would either dedicate a computer to do nothing else for 24hrs a day, hire and train someone to do it or outsource the job to someone else with the facilities. This may well be cheaper than buying Acrobat if you don’t already have it.

The problem with doing it yourself is that 15000 pages is going to throw up a good number of errors along the way which could well stall the process until you intervene on each occasion.

Frederiko