I’ve been getting really good OCR accuracy from my ScanSnap and DTPO.
The ScanSnap resolution settings are unlike those of many settings. They are (in order of increasing resolution) Normal (Fastest), Better (Faster), Best (Slow) and Excellent (Slow). As I recall, the lowest resolution will result in 200 dpi (dots per inch) resolution in black & white scans, but only 100 dpi resolution in color scans. ScanSnap users need to remember that in the automatic color resolution setting, the effective resolution of color imaging is half that of black & white imaging.
Most of my scans are done with paper copy that contains no color images and with the Better (Faster) setting, which results in 300 dpi resolution for B&W copy. If there’s no small type or markup on the pages, OCR accuracy is excellent.
For critical work, which might also include an occasional color image and perhaps small type, I’ll switch to the Best (Slow) resolution. Clean, unmarked copy usually comes through at 100% accuracy. I’ve copied a number of long documents at the setting without a single error, even for footnotes in small type. This setting corresponds, as I recall, to 600 dpi B&W or 300 dpi for color.
Things that are likely to cause OCR errors, even at high resolution are; text in or in close proximity to images, any handwritten or accidental marks or blemishes in the copy, dark highlighting, weird fonts or poor quality fax or photocopy feed. On some scanners black/white balance or contrast can be adjusted. I would expect that improper settings might cause OCR accuracy problems.
Over the years, from the Classic operating system to the present I’ve owned and used every OCR application available for Mac. For OS 9 the best one was Abby FineReader. For OS X I rate ReadIRIS 11 as the best available, and that’s the engine in DTPO’s OCR module.
Believe me, OCR accuracy has come a long way over the past few years. Back in the early 1990s it was pretty bad.
Tip: for critical work one can select an OCR’d PDF, click in the body of the document and choose Data > Convert > Rich Text. This will create a rich text version of the PDF content. Then choose Edit > Spelling > Check Spelling.
Example: I OCR’d with DTPO a 124-page court hearing record for an attorney. (This had to be done in three segments, then merged, because of the 50-page limit.) The original was a FAX from a photocopy, and some pages were slightly ‘tilted’. The OCR was probably helped by the fact that the text body was all upper case. There were only two obvious ‘glitches’ in the converted text. On the first page, the court reporter’s stamp resulted in garbage characters. On the last page, a signature resulted in garbaged characters. But the body of the hearing report was flawlessly converted by OCR. I had used the next-to-lowest resolution setting, 300 dpi for B&W copy.
On the ScanSnap I saw no appreciable difference between the highest and next-to-highest resolution settings. But the highest setting resulted in much slower scans, and Fujitsu recommended that no more than 10 pages at a time be fed to the sheet feeder, because of memory usage at the highest resolution.
So if you are getting errors at 600 dpi, I’m skeptical they would all go away at a higher resolution. The OCR engine is probably having difficulty with some of the issues mentioned above.
It’s possible to make minor text corrections with Acrobat Professional, although it’s very clumsy. PDFPen Pro is somewhat better for that purpose. The developers of Papyrus 12 note that a future version will be able to read standard PDF files (in addition to their own hybrid PDF file type). I hope so.
OCR glitches in PDF+Text do retain, of course, the original image of text, so it can be read and printed without displaying glitches. But OCR errors are not desirable for searching and analyzing a document in the database.