Newbie OCR Questions: DTPO vs. ScanSnap Manager

jwarthman · June 16, 2014, 9:12pm

Hi, newbie here!
I’m getting ready to scan & OCR a large number of documents. I have DevonThink Pro Office version 2.7.6 on an iMac running OS X 10.9.3.

I also purchased a Fuji ScanSnap iX500, and I’m running the ScanSnap Manager version 6.2 L11.

I see that both DTPO and ScanSnap Manager can perform the OCR task, so I did an experiment: I scanned 12 sheets and had ScanSnap Manager perform the OCR. I then disabled OCR in ScanSnap Manager, and enabled OCR in DTPO. I tried the Fast, Automatic, and Accurate settings.

I was surprised to see such a large difference in file size. the files created by DTPO were about 50% larger than the file created by ScanSnap Manager. i was using ScanSnap manager’s default compression setting (3), and in DTPO I was using “Same as scan” for resolution, and 75% for quality.

But even more surprising was the difference in the time it took to process the 12 sheets. ScanSnap Manager completed the task, including the time to import to DTPO, in 1 min. 11 sec.

On the “Fastest” setting, DTPO took 7 min. 35 sec. to do the same thing! And on the “Accurate” setting, the time increased to 9 min. 9 sec.

So it appears that ScanSnap Manager is 6 - 8 times faster than DTPO, depending upon the DTPO setting chosen.

I have two questions. First, might I be doing something wrong in my test, causing this disparity?

And second, is something more going on with DTPO which would justify the added time? In other words, would I be better off, for some reason, having DTPO perform the OCR, despite the larger file sizes and much longer times?

Thanks for any insight you can provide!

Frederiko · June 18, 2014, 9:47am

My suggested strategy is to run the same document through both programs and then examine the resulting document for clarity and accuracy.

DT’s OCR engine seems to me to be very generalised (I have not personal knowledge of it beyond that its a version of the ABBY OCR engine a version behind what ABBY currently sells). The accuracy, which is the most important thing, does seem to be pretty good.

For most people it will almost certainly make no difference but if you are doing a lot of documents on a regular basis it may well be worthwhile to have a dedicated OCR programme like the Scansnap, the pro version of ABBY, Prizmo Pro or Acrobat Pro. In almost all cases I would choose to use a stand alone OCR engine over DTs because it almost certainly will be more flexible but thats because I handle thousands of docs. Of all of these Prizmo and PDFPenpro are the most scriptable if thats what you need. Personally I don’t use PDFPenpro for OCR as it seems to use the same scripting engine as DT but the implementation seems much less stable and robust than DTs.

I have also seen that DT’s ocred documents can produce very large files in comparison to my favoured solution which is Acrobat Pro. I have a suspicion this is down to DT using the Mac’s native Quartz engine to produce pdf’s which similarly produces large pdfs. Again, unless you have limited space and thousands of scanned docs, it makes no difference to the perfomance of DT.

Frederiko

I think DT is probably more careful about how it allocates background resources to the OCR process so this might explain the speed difference you observe.

Bill_DeVille · June 18, 2014, 3:55pm

Frederiko’s comment about how DEVONthink allocates background resources during OCR is correct. When DEVONthink Pro Office was introduced, many users had only 1 or 2 GB RAM installed on their Macs. So, instead of holding each temporary page of an image in RAM during OCR, the temporary page images are saved to disk. Slower, but helps prevent overall slowdown of the computer.

Of course, the ABBYY OCR in current use isn’t tuned for multicore processing. That also accounts for speed differences. We will be able to support multicore processing in the future.

The version of ABBYY FineReader OCR provided by Fujitsu with ScanSnap scanners will not perform OCR on images that were not produced by a ScanSnap scanner, but the OCR module built into DEVONthink Pro Office is not limited in this respect. ABBY sells versions of FineReader for Mac that will OCR images from any source.

Revearti · August 25, 2014, 5:46am

I’m glad this thread exists because I wondering about everything being discussed here.

Bill, I apologize, but I’m confused on your comment. ScanSnap will not perform OCR on images that were not produced by ScanSnap. Could you explain a little further? If you scan anything with a ScanSnap, it can perform OCR, so I don’t understand how it will not perform OCR.

brookter · August 25, 2014, 6:05am

It’s been some time since I’ve tried, but I think that the ABBYY OCR provided with the Scansnaps checked to see which program created a pdf, and will only work with those created on a Scanscap, not by, say Adobe or DTP.

I’ve a vague recollection of trying to get round this, but I can’t remember whether it worked or not, and I do remember it being a bit of a faff to try.

Revearti · August 25, 2014, 2:58pm

Are you referring to if someone tries to OCR a PDF without it being scanned? If that’s the case, then I can understand this scenario.

Bill_DeVille · August 25, 2014, 3:38pm

To clarify my remark that the ‘free’ copy of FineReader OCR shipped with ScanSnap scanners won’t perform OCR on image-only PDFs that were not produced by the ScanSnap scanner:

Just so. Present that copy of FineReader with an image-only PDF that was produced by a Canon scanner, or a camera image and see what happens. No OCR.

I assume that was a business decision by ABBYY, which produces OCR software for sale. ABBYY probably receives a small licensing fee from Fujitsu for the OCR that’s included with ScanSnap Manager. But ABBYY also sells standalone OCR software and so restricted the software packaged with scanners to work only on images produced by that scanner. Otherwise, ABBYY’s sales of their primary product, OCR software, would be adversely affected.

Our license of the ABBYY OCR module in DEVONthink Pro Office allows OCR of images from any source. That allows users freedom to use different scanners, or process camera images. (But ask Eric how strictly ABBYY monitors the license agreement. That’s why, for example, there’s a small daily limit of pages that can be OCRed in the demo mode of DEVONthink Pro Office.)

Revearti · August 25, 2014, 7:13pm

Bill, thank you for clarifying.