Enable Automatic OCR for PDF added to Inbox?

caprichoso · April 10, 2020, 6:22am

Devonthink 3.0.4 on MacOS 10.14.6.

When I scan a document in SnapScan on import to Devonthink the PDF is automatically OCRd. If I manually drag a PDF to the Inbox the PDF is not indexed. I have to OCR it, and then delete it.

Is there a way to automatically OCR things (all text) put into the Inbox?

cgrunenberg · April 10, 2020, 9:31am

This is usually not necessary, most PDF documents are already searchable. But you could use a smart rule to achieve this using the conditions Kind is PDF/PS and Word Count is 0 and performing the action OCR > Apply on import.

kikujiro · July 2, 2020, 6:56am

I have exactly this rule – see below.

It has worked consistently for a long time but is now applying OCR to imported print-to-PDF files whose word count (as can be seen from adding a Word Count column to the file list) is very much not zero – and thereby rasterising those files.

Is this a new bug?

cgrunenberg · July 2, 2020, 8:32am

I just tried to reproduce this but it works as expected (DEVONthink 3.5.1, macOS 10.15.5). What kind of file did you print in which app?

kikujiro · July 5, 2020, 4:10pm

The rule seems to be triggering with every PDF, including PDFs printed directly from Word and PDFs that have already been OCR’d. Both kinds show their word counts correctly on import.

I have tried restarting the Mac; still happening.

BLUEFROG · July 5, 2020, 6:18pm

Set the smart rule to On Demand and look at the matches.
Choose View > Columns > Word Count to see if there are any items that have a word count.(They shouldn’t.)
Those are the only files that should be processed by the smart rule.

kikujiro · July 5, 2020, 9:16pm

If I change it to On Demand, it only shows PDFs with zero word count BUT if I watch that view when I drop a print-to-PDF into the inbox, it appears very briefly in the view and then disappears again.

If I change it back to On Import, the same thing happens and it OCR’s the file.

kikujiro · July 5, 2020, 9:25pm

[accidentally deleted last post]

I have another smart rule that does some regex-based renaming, higher up the list than the OCR rule. If I delete the rename rule, the OCR rule works as it should.

BLUEFROG · July 5, 2020, 10:03pm

Your smart rule looks incomplete. You’re looks for ANY file in ALL databases whose name is not obviouslynot. Why are you using such a broad search?

kikujiro · July 6, 2020, 6:33am

— Ideally it would search for files with names matching the regex rule but that doesn’t seem to be an option.

— It should in fact be operating on import. I turned it to On Demand to see if that fixed the problem. It doesn’t.

— Whether it’s set to On Demand or On Import, surely its existence should not break the OCR rule?

kikujiro · July 6, 2020, 10:24am

I have removed the file renaming rule entirely for now, quit and restarted DTP, and PDFs printed directly from office that correctly report large word counts are getting OCRd on import. Any troubleshooting tips would be helpful as this is really not what I want the app to be doing.

kikujiro · July 6, 2020, 10:35am

There seems to be a delay in recognising the word count on import, so it briefly appears as if nil: see video.

BLUEFROG · July 6, 2020, 1:24pm

Well, the file needs to be indexed and the UI updated.

Regarding the smart rule, I would strongly suggest you don’t use such broad scopes.
Specify file types with particular names if you want to apply renaming instead of matching basically everything in your databases.

kikujiro · July 6, 2020, 4:49pm

I understand the file needs to be indexed and the UI updated – I was pointing this out in an effort to get to the bottom of why the OCR rule is kicking in when it shouldn’t, even after deleting the renaming rule completely.

On the renaming rule, as far as I understand there’s no way to match file names by regex, so I don’t see how I could do that, but very happy to be corrected.

BLUEFROG · July 7, 2020, 1:49pm

No, you can’t use full regex but you are clearly looking for specific filenames not just things not matching obviouslynot. You should be as specific as possible in your targeted locations and criteria.

You could use a criterion like: name:[0-9][0-9][0-9][0-9]+[01][0-9]+[0-5] to reduce the number of matches to process.

kikujiro · April 4, 2022, 8:32am

Is there a way of specifically capturing a period (or other specially treated character) in a wildcard name match?

BLUEFROG · April 4, 2022, 12:48pm

Like what, for example?

kikujiro · April 4, 2022, 7:12pm

Like a period? So I can use your approach above to find files that are named eg 2021.10.12 letter but not 2021-10-12 letter, which is what the regex scan would change the first one to

BLUEFROG · April 5, 2022, 12:18am

From the Help > Documentation > Appendix > Search Operators section…

And RegEx isn’t supported in the toolbar search field.

kikujiro · April 5, 2022, 5:52am

Thanks. I know regex isn’t supported in the search field. I was trying to implement your suggestion of using a targeted search to narrow the application of regex in a rule. And I had seen the manual, I was just wondering if there was a trick I was missing. So the answer is no.