Which OCR should I use, ScanSnap's or DevonThinks

Right now when I scan in a document using my ScanSnap, the ScanSnap software will run its own OCR on the document first, and then it will save it to DevonThink, which will then run its own OCR software.

Obviously, I don’t need to do both. Which one should I turn off?

Which one gives you the best quality and accuracy? Keep that and turn off the other. If no difference - do you have a coin :wink:

To check accuracy, convert an OCRd PDF to rich text and see how well the OCR turned out.

You may also want to check the file size and image quality of the generated files. The OCR process of DEVONthink Pro Office insists on recompressing the scanned image using the JPEG algorithm, which is actually quite a bad choice for black-and-white documents. You can select the target resolution and quality of the JPEG process, which allows you to trade quality against size, but unfortunately you can’t turn this off altogether.

So in case your scanner outputs OCR’d PDF documents with CCITT Group 4 or even JBIG2 compression, you might want to steer clear of DTPO’s bundled OCR.

My experience is the ScanSnap OCR is WAY better–Devonthink produces huge files, and no reason for that–agree it is because they reprocess the image.

…which is sad, because with OCR turned on on ScanSnap sometimes there is a delay feeding documents. But still worth it.

Really unacceptable for Devon to produce a document management/storage solution that uses such a crappy OCR engine. That is on them…

You’re aware that DT uses Abby’s engine? And that at least the ScanSnap 1600 also uses Abby’s engine?


Right, it’s not the OCR engine itself that is “crappy”, but the way the PDF is reassembled at the moment when the text layer is being attached. Here we’d want the software to give us the option to preserve the images 1:1, at the very very least in all cases of CCITT Group 4 or JBIG2 compression.

Both being monochrome, ie black and white.