OCR Question

Good evening from Antwerp. Has any DT user tried OmniPage Pro X for Mac als an alternative to the built-in Iris or the stripped version of FineReader that comes with ScanSnap S510M? This is the most expensive OCR application on the market. I am wondering whether it is worth the money?

I’ve used an earlier version of OmniPage Pro and wasn’t very impressed by its accuracy. The company claims a significant increase in accuracy in the current version, which I’ve not used.

If you simply wish to convert scanned image-only PDFs into searchable PDFs and you don’t need Hebrew, Asiatic or other non-European languages, it might be hard to justify the price. OmniPage Pro does other things, including capturing most formatting, including tables, from a PDF to a conversion as a Word document. So do some other programs, including as I remember, PDFPen Pro. If you repeatedly scan a set of forms using the same fonts and layout, OmniPage Pro has a training feature; so does the corporate version of ReadIRIS. As I scan a wide variety of documents from different sources, I’ve found training to make little or no difference. OmniPage Pro also has some rudimentary editing control of recognition before the PDF is saved; so does the full version of ReadIRIS. I found that too time-consuming to be worth the effort for my purposes, but others may have a different need.

The version of ABBYY FineReader that comes with the Fujitsu ScanSnap is primarily intended, like the IRIS OCR engine currently used by DT Pro Office, to quickly convert scanned PDFs to searchable PDFs. Back in Classic Mac OS I considered FineReader the best OCR application then available for Mac. FineReader’s OCR engine for OS X works only with Intel Macs.

Don’t expect 100% accuracy from any OCR application, although that can be approached at times. I’ve scanned some long typed court records into DT Pro Office with no errors at all. But often copy has blemishes, unusual fonts or very small fonts and some errors are likely. I’m usually satisfied if the OCR accuracy is good enough to result in turning up needed documents in search lists. The beauty of the PDF format is that the image layer is the final arbiter; it is a faithful image of the original, and that’s what I see when I read or print the document.

For years one of the top things on my wish list has been an application that would let me correct OCR errors in a PDF without altering the image layer. I’m still waiting. :slight_smile:

OmniPage Pro X is the most accurate OCR solution I’ve tried for the mac, but having said that, there’s not any great difference between this and the OCR offered within DT or from FineReader etc. I’d guess only a few percent. Like Bill I’m mostly interested in making the PDF searchable and the vast majority of it is English so a margin of error is allowable for my purposes and I’m happy to let it fudge the odd quotation that it finds in an ancient or non-European script. Also, using OPX through up some irregularities and deficiencies:

Preview did not like the text layer created by OPX, for some reason saving the pdf from Preview created spaces within the words in the text layer - the same doesn’t happen with pdf files OCR’d within DT (Acrobat dealt with them okay though)

I also found OmniPage Pro X to be considerably slower than other software - in the time it took OPX to read and save 200 pages I can import them to DT, OCR, export the pages, combine and compress them (mostly via Automator) and fill in the table of contents. The OCR in DT was also less processor hungry.

In my experience the biggest difference to accuracy is to be gained before the OCR takes place, by scanning at a decent resolution (300dpi+), with the correct settings (some software prefers monochrome, some grayscale), and performing the OCR on these files before any compression takes place - once the text layer is in place you can do what you like with the image quality.

Hope that helps

  • Bill, I’m also waiting for an application to edit the text layer of a pdf - if you ever find one let me know :slight_smile: