making existing scanned documents searchable

The manual for DTP Office says, “Of course, you can also make your already existing scanned documents searchable.”

How is this done?


Obviously, they have to be in a format other than TIFF or JPEG. That’s what OCR does.

Either use your own OCR software or get DT Pro Office, which has OCR functionality built in.

To OCR images that are stored in the Finder select (in DTPO) File > Import > Images (With OCR).

Okay, I understand. I have many pdf documents in DTP already, but they are not in the finder. I’ve just converted to DTPO. To make these pdf documents searchable, I guess I have to export them back to the finder, and then import them with OCR. Correct? If there is an easier way, let me know.


Is the PDF a PDF with words, or is it a PDF file which is really an image?

I don’t know what specifically you’re starting with. Most PDFs, when imported into DEVONthink, are immediately searchable – DEVONthink does that. If somehow though you have created PDF files from scans and all they contain is a binary blob, you’ll need to do the OCR work.

Many of my pdfs are not searchable. For example, scanned books downloaded from Google Book or Gallica: some are searchable, but many are not. They are searchable on the net, but not on my computer.

I have found that importing PDFs from Google Books into DTOP does not give me a searchable file. DTOP only does OTF on the standard Google description page, found at the beginning of every Google Books file. It does not continue the OCR on subsequent pages.

Why is this, and is there a workaround?


To be perfectly clear: I have attempted to OCR the PDF. It has worked well on other PDFs. It only works on the first page of Google PDFs, viz., the standard descriptive page from Google. The pages that follow are all ignored.

That is the behavior I am attempting to understand.

Last July I downloaded The Federalist from Google Books, and in January of this year decided to try OCR on it. As it was already in a database, I selected the document and choose Data > Convert > to searchable PDF.

As this is a 521-page book, it took quite a while to OCR; I left it running overnight.

OCR was successful, and with much better accuracy than I expected. This copy was published in 1901, so didn’t have some of the tricky typographical features that the original publication has. (I’ve scanned some pages of a book published in 1786 and there were numerous conversion errors resulting from type conventions then in common use, especially substitutions for “s”.)

I haven’t OCRed other Google Books. Obviously, if the scan resolution was too low, OCR would fail.