How to batch OCR 15,000 Evernote notes many of which contain PDFs

rufus123 · October 24, 2020, 7:02am

I converted from EN to DT recently.
I imported 15,000+ Evernote notes into DT. All went well until I started searching and could not find what I was looking for and knew was there.
With Evernote, all imported PDFs are automatically processed through the OCR, which allows me to search within PDFs without any prior PDF processing.

1- Would the rule below allow me to OCR all my imported Evernote notes which contain a PDF ?

2- I can’t figure out how to trigger the On Demand command.

pete31 · October 24, 2020, 7:24am

Use the Smart Rule’s contextual menu to trigger it manually.

rufus123 · October 24, 2020, 7:39am

thank you Pete

rufus123 · October 24, 2020, 8:30am

I just triggered the batch conversion of thousands of evernote imported PDFs. I am suddenly having cold sweats. Will the OCR process duplicate the PDF ?

chrillek · October 24, 2020, 8:53am

That depends on the settings in your preferences.

rufus123 · October 24, 2020, 9:30am

OK original set to move to trash.
what is your recommendation for:

compression
DPI ? I left 150 DPI and am now wondering if that was a terrible idea. Should have set to 300?

chrillek · October 24, 2020, 9:57am

Why don’t you try it out with ten documents? I’m doing fine with 150dpi and without compression.

rufus123 · October 24, 2020, 10:40am

thank you for your comment.
Because I have already batch processed thousands of PDFs.
If my settings are not OK (original to trash, compressed, 150 DPI) I have to start everything over ie re-import 15,000 notes from evernote, ocr, etc
In fact what I would like is a reference where are discussed the pros and cons of the OCR options in DevonThink, especially 150 DPI or more, and compressed or not.
I don’t like the fact that I am just guessing

rufus123 · October 24, 2020, 10:56am

URGENT Please. I created a smart rule to OCR thousands of PDFs. How do I stop the process ?? I tried everything I could think of

chrillek · October 24, 2020, 10:58am

There is no answer to your question re dpi. It depends on the quality of the original document. If it’s a genuine PDF, I.e. not a scanned document, 150 should be ok. Otherwise, even 300 might not be enough for OCR to work.
Why don’t you explore these potentially crucial questions before you work on 15000 documents?

rufus123 · October 24, 2020, 11:01am

OK, forget I asked

wmc · October 24, 2020, 12:37pm

@rufus123

EDIT: See @matti’s note #15 and @BLUEFROG #16 below for why Evernote OCR doesn’t import into DEVONthink.
Note: I do not use Evernote so may be (edit: I am) wrong here, but:

the resulting text layer should be imported into DEVONthink. It will take a bit of time for DT to index the items for search. Are those PDFs shown as kind = PDF + text? If so, the OCR results are there and if search doesn’t work it may be that indexing is not complete. Or try to select some text in a PDF which will show if a text layer is present.

Re-OCRing may not be needed.

BLUEFROG · October 24, 2020, 1:33pm

It is not a good idea to do such a process en masse like this. It is better to do such an operation on much smaller batches.
And @chrillek is correct. Testing on a smaller sample should have been done before committing to larger batches.

korm · October 24, 2020, 3:21pm

Are you sure you have 15,000-ish PDFs that need OCR? Maybe search the groups with the exported documents and look for PDFs to make sure they are all not PDT+Text.

matti · October 24, 2020, 4:11pm

For anybody looking for more Info on this: Evernote Blog How Evernote’s Image Recognition Works

If you export a note containing a PDF that has been processed by the OCR system, there will be two nodes in the document: data and alternate-data . The data node contains a base–64 encoded version of the original PDF and the alternative-data represents the searchable version of the same PDF.

I don’t think DT handles this in this way though: At least in a small test, PDF recognition data was not included in the DT import (there was only one non-OCR-ed PDF).

BLUEFROG · October 24, 2020, 5:02pm

That is correct as Evernote keeps the OCR data on their servers.

rufus123 · October 24, 2020, 5:08pm

well … just did an urgent search for a term that yielded nothing. In Evernote, found the term immediately. It is impossible to guess which PDFs I will need to search.
thank you for your comment

BLUEFROG · October 24, 2020, 7:20pm

It is impossible to guess which PDFs I will need to search.

Actually, you can create a smart group to show PDFs with no text in them…
SmartGroup - PDFs to OCR

rufus123 · October 24, 2020, 7:22pm

I’m sorry: I don’t understand. What kind of PDF has not text ? Do you mean non readable ?
thank you

BLUEFROG · October 24, 2020, 7:24pm

A PDF with no text layer, which includes PDFs that have no OCR done on them.
If there is no text layer, the file wouldn’t be searchable by content.