How to automatically find un-OCR'ed documents

DevonThunk · September 5, 2012, 4:57pm

I have looked in the DT manual, Internet, and two i-books, but can’t find an answer to what seems like a simple question. I just want to find the pdf’s within my database that haven’t yet had the OCR process run, and then run it on them. Does anyone know how to gather these together so I can begin the process?

korm · September 5, 2012, 5:17pm

This smart group will find non-OCR PDFs

(Word count is searched because non-OCR PDFs have no text layer and, therefore, no “words” - the image might have words, but the database doesn’t yet know that.)

DevonThunk · September 5, 2012, 5:29pm

Thanks Korm! That is what I needed. You da best!

cgrunenberg · September 6, 2012, 12:12pm

There’s also a template, see Data > New from Template > Smart Groups > PDFs (not searchable).

jooz · November 29, 2020, 2:42pm

Sorry, I am a new user of OCR stack in Devonthink. So my question might be very basic.

Unfortunately, the screenshot from @korm does not seem be shown in the forum anymore and it seems that since 2012, this template is non-existent in DT 3.6.

What would be the easiest way to identify all non-OCR pdfs? Use a smart-rule with “no words”?

I tried this:

But results do confuse me as I can search in some of the files for words:

chrillek · November 29, 2020, 3:07pm

That’s what is usually suggested here.

jooz · November 29, 2020, 3:21pm

Thank you. Do you have an idea why this rule did find apparently searchable PDFs in my case as indicated above?

jooz · November 29, 2020, 3:25pm

Maybe you also have an idea for what I am trying to do.

I realized that I have a bunch of PDFs which are not OCRed but I do have some markings and highlights in them. The PDFs were created originally by “save as pdf” in the browser on Mac, and I believe in that case mac does not create searchable PDFs. Anyways, that is how it looks like typically:

When I OCR them in DT, these highlights are got lost.

I assume this is expected behaviour.
Is there any way to preserve them in the OCRed file?