Scanning and OCR with V 1.5

garyburke · January 15, 2008, 3:31am

Since upgrading to 1.5 my import from scanner (OCR) has not been working properly. I get a message saying that the document was ‘scanned at 0dpi (should be at least 300dpi)’. This is despite the scanner being set to 300dpi.

To get around it I have to import the scan, export it, open it in Preview, save it, then import it as an OCR image. This works fine but is about five times longer.

The problem only started after the upgrade. Is anyone else having this issue?

Bill_DeVille · January 15, 2008, 4:22am

A scanner setting of 300 dpi may not actually result in production of a PDF with a resolution of 300 dpi.

For example, the “Better” scan setting of the Fujitsu ScanSnap results in a scan of black & white copy of 300 dpi. But if color is detected in the copy, the effective resolution becomes 150 dpi. To ensure that pages containing color are scanned at 300 dpi one should use the “Best” setting on the ScanSnap – and black & white copy would then be scanned at 600 dpi.

I don’t know which scanner you are using, but if DT Pro Office “sees” your PDFs as having a resolution of less than 300 dpi, try setting a higher resolution for scans.

garyburke · January 15, 2008, 4:29am

Thanks Bill,
I’m curious about why it has suddenly become an issue. I’ve been scanning with this scanner and DTPO for almost 2 years, without issue.

Scanning at 600dpi gives the same problem.

And if I export the pdf from DTPO into Preview, save it, and reimport it as Image (OCR) it works fine. But the whole process is extremely slow.

I’d like to know if it is v1.5 or if there is something at this end I can attend to that has brought it on.

If I go back to v1.4.x will I still be able to open the database?
Cheers.

Bill_DeVille · January 15, 2008, 4:50am

Hi, Gary. The change in version 1.5 was the addition of evaluation of the actual resolution of the PDF prior to OCR.

Prior to that addition, DT Pro Office would proceed to run OCR on PDFs regardless of their resolution. But the accuracy of OCR drops, sometimes dramatically, at lower resolutions. The OCR module is designed to give satisfactory OCR accuracy for most standard fonts at an actual resolution of 300 dpi.

What that DT Pro Office message is telling you is that some of the PDFs produced by your scanner actually have a lower resolution than 300 dpi, in which case you have not been getting the intended level of OCR accuracy. Try setting your scanner at a higher dpi setting.

There’s an entirely different issue affecting PPC (but not Intel) computers. Annard has noted to IRIS an intermittent bug that results in extremely low-resolution PDFs, looking water-soaked. I’m guessing that happens more frequently in interaction with Apple’s recent versions of Tiger and with Leopard. IRIS is working on that.

Re your multi-step procedure: What happens if, instead of exporting to Preview, saving and importing, you select the PDF in your database and choose Data > Convert > to Searchable PDF?

annard · January 15, 2008, 8:03am

Hello Gary,

The message is actually a warning (I was hoping it read as such) and allows you to continue OCR without a problem. Nothing has changed in the workflow, we just added this check to warn people of potential bad results. The next maintenance release will not put up this warning message when an image doesn’t return a sane dpi value.
If for the moment you think you can’t live with this behaviour, use the Automator or the AppleScript interface. Here everything will always be accepted.

As to the garbled output that is mentioned by Bill, I have an example here that fails on my Intel machine. We are forwarding such examples to IRIS.