Is there a way to force OCR on PDF+TEXT file?

msteuernagel · March 28, 2011, 2:48pm

I get some PDF’s from online sources such as JSTOR in which the text itself is not searchable, except for extra-textual elements (cover letter, copyright notice). When I import them into DevonThink, the appear as PDF+TEXT, and I am unable to OCR them. Is there a way to force OCR on a PDF that is already searchable? Or of somehow stripping the PDF from it’s text layer, so that DevonThin can recognize it as image only PDF? The only workaround that I found was saving the file as TIFF, reimport, and OCR, but then I end up with less resolution and a huge file.

Thanks.

Greg_Jones · March 28, 2011, 3:08pm

Have you tried right-clicking on the documents and selecting ‘Convert>To Searchable PDF’?

msteuernagel · March 28, 2011, 3:27pm

Yes, I did, and nothing happens. I have the impression that in some previous generation, DT would give me a warning about OCRing again, but then would do it, but with the latest version, nothing happens, and the OCR Activity window remains blank.

Greg_Jones · March 28, 2011, 4:33pm

Still works here for me, using DTPO 2.0.9. Does OCR still work on a straight PDF (no text) document or an image file?

msteuernagel · March 28, 2011, 9:02pm

Yes, all images and non-searchable PDFs work fine. That’s interesting. So there’s probably some kind of bug. Nothing shows up in the log either. And I’m also on 2.0.9. Any ideas of things I should try to troubleshoot this?

Trillium · March 28, 2011, 10:27pm

It sounds to me as if you may need to “rasterize” them, basically render out each page and then, re-import the graphic pages…make sure they are big enough- into an OCR program. If the resolution isn’t fairly high, you may get text errors.

Even though the images are perfect, no skew, perfect blacks, perfect whites.

Over the years, I’ve had to do this many times with PDFs that I needed to get text out of. The people who make PDFs use all sorts of different programs to create them. The act of exporting text from a PDF doesn’t always get the order right, and God help you i they used drop caps, fancy graphic effects… etc.

OCR can be a lifesaver in those situations. Even if its only 99% accurate… (these days, they are pretty good.)

msteuernagel · March 29, 2011, 11:36am

Trillium, I’m not sure you understood what my problem is. When I select “Convert to searchable PDF” with any document that already is PDF+Text, nothing happens. No error message, no OCR activity, no log message, no Console activity. OCR works fine with any image only PDF and images. This is not a problem with a specific PDF, there is an issue with DT not sending these files to OCR for some reason, whereas it used to do this in previous versions.

Bill_DeVille · March 29, 2011, 3:42pm

That’s flaky behavior. Try

Click on the DT application name in the menubar and choose ‘Empty Cache’. Perhaps a cache file is stuck or damaged.
Restart the computer. Perhaps there are cobwebs in memory. Very often a restart will clear up strange behavior.

oolfanska · April 21, 2011, 12:18pm

If you haven’t found a better way to do this yet, here’s a possibility: If the problem is just that OCR’d cover page, open the PDF in Preview or some other PDF viewer app, delete just the cover page (or move it elsewhere temporarily), save & close. You should now have a text-less PDF that can be OCR’d by DT. Add the cover page back in afterwards, with Preview, if you like.

msteuernagel · January 10, 2012, 1:41pm

I finally submitted this issue to support, and this was there response:

I guess DT defaults to doing nothing when you choose to dismiss the warning. (I probably had dismissed it so that I wouldn’t have to confirm every time I wanted to OCR something).

Anyway, it fixed the problem for me.