OCR question

danzac · March 26, 2007, 1:21am

I am trying out DTPOffice. I have a big file with PDF’s in it. Some are already native PDF’s while some are images, others are images with the first page having text info (like from JSTOR) so it is registered as PDF+text.

My question is, if I highlight these all and tell it to OCR and leave it overnight to do its work, will it pause and stop itself for the native PDF’s, or will it ruin/alter the native PDF’s?

Thanks.

milhouse · March 26, 2007, 4:22am

IIRC it creates a new file (pdf+text) and leaves the originals untouched. Also, it any pdfs are over 50 pages, the OCR will not work.

danzac · March 26, 2007, 12:29pm

milhouse,

If the OCR is applied to an already native PDF, will the new resulting PDF be exactly the same, or will attempt to somehow OCR an already native PDF and give a PDF which is not as precise?

I can perhaps ask this another way and avoid this question. How did others, when switching to DTP Office, process all the PDF’s that needed OCR? Is there a way to create a list that needs OCR in DT, without checking each one individually.

Bill_DeVille · March 26, 2007, 4:27pm

Hi, Dan. I identified image-only PDFs that were already in my database by adding a Kind column (using View > Columns > Kind) to a History (Tools > History), which gives a ‘flat’ view of all the items in the database.

Just sort by Kind and scroll to the items that are PDF (image-only) PDFs. Each can be opened displayed using the Reveal (Command-R) command, to check whether running OCR on it would be useful. If so, with the Name field selected in the view, Command-Click and choose the option to OCR it. Of course, I didn’t run OCR on hand-written notes, diagram-only content and so on.

This results in a second copy of the PDF as a PDF+Text file, and the original can then be deleted.

danzac · March 26, 2007, 4:56pm

Bill, thanks for this. Just one question- JSTOR attaches a page to the beginning of their files. This page they add is text already, but the subsequent PDF may or may not be. These are registered as PDF + text. I have others from EBSCOhost/ATLA which are image PDF’s, except for the header that indicates the article info. These are also registered as PDF+text.

A secondary question to this situation is - can anything negative happen in OCRing an already native PDF?

Bill_DeVille · March 26, 2007, 5:55pm

Dan, nothing negative will happen when a PDF is OCR’d, unless it’s been set up for pre-press with special resolution and color features of images.

The PDF will be rasterized again at 150 ppi, meaning images will be reduced to that resolution. That’s to maintain a good balance for viewing and printing the version sent to your database.

If you run OCR on an existing PDF in your database, a second copy will be created and you will probably wish to delete the original copy from the database.