Difficulty with OCR and PDFs

Bob_Sprague · July 21, 2007, 7:13pm

Importing to DTPO an existing PDF (a downloaded journal article), processing it through OCR produces a PDF+Text file in DTPO. If I export this file using the Export/Files and Folders it produces a file that is sometimes unreadable by my PC friends and will sometimes be unusable as and attachment in Bookends 10. Other PDFs created with in the OS X environment do not have this problem w/ my PC friends nor with Bookends. One receipient of DTPO exported PDF told me there were invisable characters in the title. Sometimes “laundering” the created PDF file through numerous “save as” or “printing pdf” will rid the DTPO creaded PDF of what ever offending my friends PCs. Any thoughts?

I could send you a current file that was produced w/ DTPO & OCR but will not attach to a reference in Bookends where the original will.

Bill_DeVille · July 22, 2007, 2:51am

Bob, you say “sometimes”. My first guess is that some of the PDFs were scanned at low resolution, or were scans of poor copy. I’ve downloaded some PDFs from a statistics journal that were very low resolution (less than 150 dpi) and could not be successfully OCRd – there were a great many errors. That’s because the OCR engine cannot reliably ‘read’ the poor quality text characters in the image, and makes guesses that are often wrong. Is that character Greek, Russian or English? The OCR engine really can’t tell sometimes with low-resolution images.

Several years ago I had a project that involved old microfilms. Capturing the film images as PDF or TIFF and running OCR produced varying results depending on the quality of the original images. Sometimes great OCR accuracy, sometimes dismal accuracy. If the image looked blurry or grainy on a microfilm reader, that was a good indicator that OCR would not work well.

Try this experiment with a PDF that doesn’t OCR well: Zoom in on it and see if the font gets blurry within a couple of enlargement ‘steps’. That would indicate a low scan resolution.

There’s really no magic difference between PC and Mac PDFs. The text layer of a PDF with lots of OCR errors will display errors on Macs as well as on PCs. Depending on the fonts on a particular machine – the platform makes no difference – it may look ‘weirder’ on one machine than on another.

Bob_Sprague · July 22, 2007, 5:07am

The problem is in the openability of the pdfs on a pc. They can’t even be read. The “IT” guy told me there were “invisible” characters in the title that made them unreadable. Sounded strange.

This issue w/ BE was resolved by the BE people.

Bill_DeVille · July 22, 2007, 1:41pm

Ah hah – I thought you were talking about the text layer in the PDF after OCR.

So the filename is seen as corrupt on some PCs? I’ve never seen that problem, but then I usually try to keep file names reasonably short. Some versions of Windows still have trouble with long filenames.

Can you tell us the file names of some PDFs that some PC users had trouble with? Did you name the files, or did they have the names assigned automatically by the OCR engine? I let DTPO name the PDFs with a date/time title and then change the document name in the database.

How did you export the PDFs from your database? By drag & drop or by File > Export > Files & Folders? The method makes a difference as the export mode uses the Name you gave the document in creating the file name.