OCR accuracy of DTPO

LE1 · December 1, 2006, 5:57pm

Experimenting with the new import/ocr functionality: I couldn’t get my epson 3200 to work with DTPO. But after scanning in Vuescan I was able to import the individual jpg files and get the ocr’d versions at a good pace. To see the ocr results I converted these imported images into plain text and the result didn’t seem to be much more accurate than the mediocre results I’d seen when experimenting with the stand-alone IRIS of several years ago. Having listened to a radio interview with IRIS’s publisher, which claimed major recent strides in accuracy, I was expecting better.

Does it help to increase scan resolutions beyond the recommended 600 dpi?

Bill_DeVille · December 1, 2006, 8:12pm

I’ve been getting really good OCR accuracy from my ScanSnap and DTPO.

The ScanSnap resolution settings are unlike those of many settings. They are (in order of increasing resolution) Normal (Fastest), Better (Faster), Best (Slow) and Excellent (Slow). As I recall, the lowest resolution will result in 200 dpi (dots per inch) resolution in black & white scans, but only 100 dpi resolution in color scans. ScanSnap users need to remember that in the automatic color resolution setting, the effective resolution of color imaging is half that of black & white imaging.

Most of my scans are done with paper copy that contains no color images and with the Better (Faster) setting, which results in 300 dpi resolution for B&W copy. If there’s no small type or markup on the pages, OCR accuracy is excellent.

For critical work, which might also include an occasional color image and perhaps small type, I’ll switch to the Best (Slow) resolution. Clean, unmarked copy usually comes through at 100% accuracy. I’ve copied a number of long documents at the setting without a single error, even for footnotes in small type. This setting corresponds, as I recall, to 600 dpi B&W or 300 dpi for color.

Things that are likely to cause OCR errors, even at high resolution are; text in or in close proximity to images, any handwritten or accidental marks or blemishes in the copy, dark highlighting, weird fonts or poor quality fax or photocopy feed. On some scanners black/white balance or contrast can be adjusted. I would expect that improper settings might cause OCR accuracy problems.

Over the years, from the Classic operating system to the present I’ve owned and used every OCR application available for Mac. For OS 9 the best one was Abby FineReader. For OS X I rate ReadIRIS 11 as the best available, and that’s the engine in DTPO’s OCR module.

Believe me, OCR accuracy has come a long way over the past few years. Back in the early 1990s it was pretty bad.

Tip: for critical work one can select an OCR’d PDF, click in the body of the document and choose Data > Convert > Rich Text. This will create a rich text version of the PDF content. Then choose Edit > Spelling > Check Spelling.

Example: I OCR’d with DTPO a 124-page court hearing record for an attorney. (This had to be done in three segments, then merged, because of the 50-page limit.) The original was a FAX from a photocopy, and some pages were slightly ‘tilted’. The OCR was probably helped by the fact that the text body was all upper case. There were only two obvious ‘glitches’ in the converted text. On the first page, the court reporter’s stamp resulted in garbage characters. On the last page, a signature resulted in garbaged characters. But the body of the hearing report was flawlessly converted by OCR. I had used the next-to-lowest resolution setting, 300 dpi for B&W copy.

On the ScanSnap I saw no appreciable difference between the highest and next-to-highest resolution settings. But the highest setting resulted in much slower scans, and Fujitsu recommended that no more than 10 pages at a time be fed to the sheet feeder, because of memory usage at the highest resolution.

So if you are getting errors at 600 dpi, I’m skeptical they would all go away at a higher resolution. The OCR engine is probably having difficulty with some of the issues mentioned above.

It’s possible to make minor text corrections with Acrobat Professional, although it’s very clumsy. PDFPen Pro is somewhat better for that purpose. The developers of Papyrus 12 note that a future version will be able to read standard PDF files (in addition to their own hybrid PDF file type). I hope so.

OCR glitches in PDF+Text do retain, of course, the original image of text, so it can be read and printed without displaying glitches. But OCR errors are not desirable for searching and analyzing a document in the database.

LE1 · December 2, 2006, 2:33am

Thanks for the info. Bill; it is a useful reference point. My test was on an unmarked photocopy from a main-line monograph publisher. Perhaps I’ll do some more experimenting and comparison with Vuescan’s built-in OCR and post a result here. One wonders whether the SnapScan is the key …

[quote=“Bill_DeVille”]
I’ve been getting really good OCR accuracy from my ScanSnap and DTPO.

…

Example: I OCR’d with DTPO a 124-page court hearing record for an attorney. (This had to be done in three segments, then merged, because of the 50-page limit.) The original was a FAX from a photocopy, and some pages were slightly ‘tilted’. The OCR was probably helped by the fact that the text body was all upper case. There were only two obvious ‘glitches’ in the converted text. On the first page, the court reporter’s stamp resulted in garbage characters. On the last page, a signature resulted in garbaged characters. But the body of the hearing report was flawlessly converted by OCR. I had used the next-to-lowest resolution setting, 300 dpi for B&W copy.

My result was nowhere near this but it does inspire to do more testing.

Bill_DeVille · December 2, 2006, 4:24am

I ran a test page at the same resolution settings on my two scanners, a ScanSnap and CanoScan LIDE 500F.

The test page was a policy document from my bank, with lots of fine print.

OCR accuracy was great from the ScanSnap, not quite as good from the CanoScan. So stated resolutions may not tell the whole story for OCR accuracy. The CanoScan image options (brightness, etc. were at default. Perhaps adjustments might have improved accuracy for OCR recognition.

But I need both. As the ScanSnap uses a paper feeder, it’s not good for scanning from books or magazines. That’s a job for the flatbed CanoScan, which also does a better job with photos.

lyle_eslinger · December 3, 2006, 2:58pm

Thanks Bill.

The majority of my scanning needs are books or journal articles. Usually I start by photocopying these in landscape, two of the book/journal pages per page.

I’ve had no trouble producing pdfs this way; can DTPO’s OCR engine handle two-column landscape-oriented pages?

Your experienced posts on scanning and ocr are most useful.