OCRing PDFs which are already imported?

rickybuchanan · July 6, 2008, 1:53pm

I’m still very much trying to figure this out … but this afternoon I imported a large directory containing >300 files. It all went fine but I realised that a bunch of scanned PDFs were imported without going through the OCR process. This seems like a reasonable default, but I can’t figure out how to run them through OCR once they’re already imported without weird workarounds like finding the original file and re-importing it with the correct menu command. Since there’s >50 PDFs in this state I’d really like to NOT have to do them one by one!

Bill_DeVille · July 6, 2008, 2:46pm

It’s not necessary to re-import those PDFs. An image-only PDF in your database can be converted to searchable PDF when you select the PDF and choose Data > Convert > To searchable PDF.

How to locate those candidates for conversion to searchable PDF? A workable approach is to add the “Kind” column to a view window in your database. This is done by the command View > Columns > Kind. Now documents displayed in that view window can be sorted by Kind, simply by clicking on the header in that column. The Three Panes view makes it convenient to locate all the image-only PDFs in a group, then select and convert them.

An image-only PDF has the Kind, PDF.

A searchable PDF has the Kind, PDF+Text.

So a PDF can be selected, Data > Convert > To searchable PDF invoked, and OCR will be run on the selected PDF.

Note: Although multiple PDFs can be selected for conversion, the resulting searchable PDFs will be stored in the group in which the first one selected is located. To avoid disruption of an existing organizational scheme, one should perform PDF conversions group-by-group.

rickybuchanan · July 21, 2008, 2:12am

That’s really helpful, thanks. I was confused about that command because it seems like running it over an existing PDF doesn’t make it highlightable the way importing does, but that may have been a function of something independent going on.

Warmest Regards,
Ricky