Stubborn OCR

DarumaBlue · October 10, 2009, 8:02pm

So I’m adding PDFs to DevonThink, and usually don’t need to scan them to highlite and otherwise select the text.

However, I’ve had trouble with a few PDFs lately. When added to DevonThink, they are already classified as “PDF + Text” files, and yet, the only part of the article that acts like it has a text layer is the front page of the PDF, with the electronic databases watermark and information. Everything else behaves like a regular PDF image.

Bill_DeVille · October 10, 2009, 10:53pm

Some journals and news archives supply PDFs like that, with only a header section as PDF+Text.

You can try a conversion to searchable PDF (Data > Convert > to searchable PDF). The results may vary depending on the resolution and quality of the PDF image. I’ve had a couple like that and got reasonably good OCR accuracy.

DarumaBlue · October 11, 2009, 1:50am

Sorry - I didn’t clarify. It’s been a long day.

When I try to “Convert to Searchable PDF” nothing happens. DT doesn’t seem to respond in the usual way, which is to pop-up the work queue.

FiMi · October 11, 2009, 8:39am

Sorry for the stupid question: are you sure nothing happens ?

From time to time DTPO fails to pop up the OCR-activity-window automatically, so I have to show it manually by clicking the according menu-item.

Since the OCR.process can take quite while, in such a case nothing seems to happen.

korm · October 11, 2009, 9:06am

The OCR Activity window does not open if it was previously closed while it was displaying the status of an active scan. To reset the normal behavior of that window opening automatically when you do a scan you need to start an OCR scan, select Window > OCR Activity, let the scan finish, then close the activity window. (The behavior of the OCR Activity window is not the same as the Log window which always opens when DT has something to report in the log.)

DarumaBlue · October 11, 2009, 6:12pm

I’ve tried again, and both the OCR and general Activity logs show no activity at all. With the PDF displayed in the main display area, clicking on “Covert to Searchable PDF” produces no reaction at all. Is it because DT already recognizes it as a “coverted” PDF, even though the whole of the contents are not selectable/hilightable/etc?

Bill_DeVille · October 11, 2009, 8:22pm

Please send the PDF as an attachment in a message to Support, mentioning that you are unable to run Data > Convert > to searchable PDF on it.

dialektik · November 20, 2009, 3:47pm

Is there some kind of general OCR problem with the current version of DTPO (2.0pb7)? I am also having trouble OCR-ing PDFs. I click on Convert>to Searchable PDF and nothing happens. When I open the OCR Activity window, it’s empty and there is no activity. Any ideas what might be going on?

annard · November 20, 2009, 4:05pm

Check the Console (in Applications > Utilities) for any messages. Normally an error should be logged in the Log tool.

My guess is that in the past you indicated that you didn’t want to see the warning for files that have been converted already.

dialektik · November 20, 2009, 4:20pm

Thanks for the quick reply. I checked the console, and I can’t find an entry.

BTW, the file is not already converted. It’s a Google Book PDF, which is an image file where only the first page is a PDF+Text.

annard · November 20, 2009, 5:12pm

As usual the best thing is to send this document to support@devon-technologies.com with a reference to this thread and then we can check it here. Thanks!