OCR with DTPO

I am currently using DTP mainly to store the sources for my genealogy hobby. I am trying out the demo of DTPO mainly for its OCR capabilities. I have many PDFs with no text attached and I would like to use OCR to make them searchable and to add the ability to copy and paste text.

A couple of questions:

I don’t see a way to run OCR on a document except when importing. My PDFs are already in my database. Is there a way to run OCR on them from within the database?

OCR had done pretty well on the few test documents I have imported. But it has had trouble with some others (scans of older discolored documents). Is there a way to edit the text that attached to the PDF?

Thanks,

Hi, Karen:

[1] Yes, you can OCR a PDF that’s already captured in your database (DT Pro Office only). Select the Name of the PDF and Control-click (right click) on it. Choose the contextual menu option Convert > to Searchable PDF.

This will result in a new copy of the document with searchable text.

[2] The accuracy of OCR depends both on the scanning resolution (at least 300 dpi, better 600 dpi) and the quality of the original paper copy, including the type and size of fonts and the degree to which the copy has blemishes (discoloration, ‘markups’ such as underlining, highlighting, handwritten notes and the like, all of which tend to reduce accuracy).

[3] Correcting OCR errors in the PDF itself is usually not worth the trouble. Even Acrobat has very primitive editing capabilities, and the result is that not only the text but the image is modified. My attitude is that even if the OCR has errors the image layer itself is faithful to the original and so can be read without error. If there are some critical terms that I need for searching, I will usually add those as keywords in the Comment field of the PDF’s Info panel.

In cases where I need to excerpt text from a PDF, perhaps for a quote, and there are OCR errors, I usually use Data > Convert to make a text copy of the PDF and edit that text, using the PDF image layer as the guide to correct errors.

Thanks for the advice. I’ve been playing around and found that some images can be improved for OCR purposes with PS (removing any illustrations on the page, stray marks, lightening up the background,etc), but the modified image itself becomes more difficult to read. In that case, it probably makes sense to use the original image, but link it to a text file generated with OCR from the modified image.

On a related question, I am thinking of upgrading from DT Pro to Pro Office, and was looking at the Fujitsu Scansnap. If one buys it new you can get a $50 rebate and ReadIris, but I have found it refurbished (with a shorter warranter of 90 days) for only $200. Two questions:

  1. Is ReadIris such a good program that I would want it, since from Bill’s comment one can OCR in DT ProOffice w/o it?
  2. Should I lean towards the newer one which comes with a year-long warranty in order to be safe?

The refurbished one can be found at scannersrefurbished.com.

Thanks for any advice.

I bought mine on eBay: brand new, still in the sealed box, and one generation older than the newest one. It came with all the software, and it’s great! (Except that Acrobat 7 seems to be having a private war with MS Word, but that’s a separate problem.)