Finding PDF's that have not been OCR'd

paul99 · January 11, 2016, 4:10am

I’ve found a few files in my database with non-OCR’d (and therefore non-searchable) PDF’s, mostly things that were emailed to me and I did not check at the time. I’d like to find all these, and OCR them now.

I found this nice (old) Tuesday tip: blog.devontechnologies.com/2007/ … ocr-layer/

But the script it references is not found (404): devon-technologies.com/files … t_Text.zip

Does anyone have a copy of this script, or something similar?

I thought I could just use search or advanced search, but while I can do a search on Document Kind, the available choices cannot differentiate “PDF” vs “PDF + Text”.

Can any of you gurus help me out here? Thanks a lot in advance!

paul

BLUEFROG · January 11, 2016, 4:48am

Make a Smart Group with criteria:
Kind is PDF/PS
Word Count is 0

:^)

cgrunenberg · January 12, 2016, 3:26pm

See menu Data > New with Template > Smart Groups > PDFs (not searchable)

paul99 · January 12, 2016, 10:41pm

Thanks a lot Jim, Christian! So much depth in DevonThink, I love it.

(I’d started thinking of checking the output of pdftotext for each file, and scripting a walk thru the entire database. So much easier this way! Now I have the smart group, so it’s trivial to see if/when new files show up without text layer)

Now that I see the correct phrase to Google for, I find this nice post from Evan K. on the complete process: 40tech.com/2015/08/17/the-be … think-ocr/

BLUEFROG · January 12, 2016, 11:21pm

No problem - and yeah, DEVONthink has a lot of things to discover!

(PS:Evan has some really practical info concerning DEVONthink on his blog. A good find! )