Scriptable detection method for PDF needing OCR?

cturner · May 24, 2010, 1:28pm

Hi all-

I’ve always appreciated DTPO’s flagging whether an imported PDF had no text in it: that is, flagged it as needing to be OCRed.

I wonder if anyone knows of a command line app, or some other script-friendly way to detect whether a PDF is all image; ie, might need to be OCRed? The idea is detecting this state short of importing it into DTPO.

I’d guess that a small util could be written with a rudimentary knowledge of PDFkit…

Thanks! Charles

cturner · May 24, 2010, 3:23pm

Well, that was easy…

http://vze26m98.net/devon/pdfstr.zip

The above is a little command line utility that will do one of two things:

pdfstr filename [returns the text of the PDF on stdout]

pdfstr -c filename [returns the character count of the PDF on stdout]

Actually, the argument parsing is pretty slim, so basically any flag will produce the count. (Which implies there’s not a lot of defensive programming here!)

Enjoy, Charles