Is there a way to filter PDFs that are not OCRed?

joshgibson · December 4, 2019, 11:14pm

I’ve tried creating a smart group that groups files and was trying to figure out a way to set the file “Kind” as “PDF” and not “PDF+Text” but I wasn’t able to get that working.

Does anyone know of a way to figure out which documents need OCRing? I just recently switched from Evernote and want to make sure everything is searchable.

Thanks in advance!

rkaplan · December 5, 2019, 12:50am

You could try a smart group like this:

mbbntu · December 5, 2019, 11:53am

It may depend to some extent on what kinds of documents you are using, and where they come from. I find that certain academic databases have a cover sheet before an article, and while the cover sheet has an OCR layer, the article itself does not (this tends to happen more with older articles). Those pdfs appear as if the whole article has an OCR layer, but they still need processing to make the whole thing searchable.

joshgibson · December 5, 2019, 3:24pm

Thank you, I’ll give this a shot and report back!

joshgibson · December 5, 2019, 3:25pm

Yeah, that’s a great point. Fortunately, I’m mostly using PDFs I’ve either captured, saved from other apps, or scanned myself, so I’m guessing they’ll be all or nothing. I’m going to try rkaplan’s idea and see if it works. Fingers crossed!

BLUEFROG · December 5, 2019, 3:25pm

What @rkaplan suggests is the smart group we’ve been advocating for years.

joshgibson · December 5, 2019, 7:28pm

Perfect!