How to batch OCR 15,000 Evernote notes many of which contain PDFs

I converted from EN to DT recently.
I imported 15,000+ Evernote notes into DT. All went well until I started searching and could not find what I was looking for and knew was there.
With Evernote, all imported PDFs are automatically processed through the OCR, which allows me to search within PDFs without any prior PDF processing.

1- Would the rule below allow me to OCR all my imported Evernote notes which contain a PDF ?

2- I can’t figure out how to trigger the On Demand command.

Use the Smart Rule’s contextual menu to trigger it manually.

1 Like

thank you Pete

I just triggered the batch conversion of thousands of evernote imported PDFs. I am suddenly having cold sweats. Will the OCR process duplicate the PDF ?

That depends on the settings in your preferences.

1 Like

OK original set to move to trash.
what is your recommendation for:

  • compression
  • DPI ? I left 150 DPI and am now wondering if that was a terrible idea. Should have set to 300?

Why don’t you try it out with ten documents? I’m doing fine with 150dpi and without compression.

2 Likes

thank you for your comment.
Because I have already batch processed thousands of PDFs.
If my settings are not OK (original to trash, compressed, 150 DPI) I have to start everything over ie re-import 15,000 notes from evernote, ocr, etc
In fact what I would like is a reference where are discussed the pros and cons of the OCR options in DevonThink, especially 150 DPI or more, and compressed or not.
I don’t like the fact that I am just guessing

URGENT Please. I created a smart rule to OCR thousands of PDFs. How do I stop the process ?? I tried everything I could think of

There is no answer to your question re dpi. It depends on the quality of the original document. If it’s a genuine PDF, I.e. not a scanned document, 150 should be ok. Otherwise, even 300 might not be enough for OCR to work.
Why don’t you explore these potentially crucial questions before you work on 15000 documents?

2 Likes

OK, forget I asked

@rufus123

EDIT: See @matti’s note #15 and @BLUEFROG #16 below for why Evernote OCR doesn’t import into DEVONthink.
Note: I do not use Evernote so may be (edit: I am) wrong here, but:

the resulting text layer should be imported into DEVONthink. It will take a bit of time for DT to index the items for search. Are those PDFs shown as kind = PDF + text? If so, the OCR results are there and if search doesn’t work it may be that indexing is not complete. Or try to select some text in a PDF which will show if a text layer is present.

Re-OCRing may not be needed.

1 Like

It is not a good idea to do such a process en masse like this. It is better to do such an operation on much smaller batches.
And @chrillek is correct. Testing on a smaller sample should have been done before committing to larger batches.

2 Likes

Are you sure you have 15,000-ish PDFs that need OCR? Maybe search the groups with the exported documents and look for PDFs to make sure they are all not PDT+Text.

1 Like

For anybody looking for more Info on this: Evernote Blog How Evernote’s Image Recognition Works

If you export a note containing a PDF that has been processed by the OCR system, there will be two nodes in the document: data and alternate-data . The data node contains a base–64 encoded version of the original PDF and the alternative-data represents the searchable version of the same PDF.

I don’t think DT handles this in this way though: At least in a small test, PDF recognition data was not included in the DT import (there was only one non-OCR-ed PDF).

2 Likes

That is correct as Evernote keeps the OCR data on their servers.

1 Like

well … just did an urgent search for a term that yielded nothing. In Evernote, found the term immediately. It is impossible to guess which PDFs I will need to search.
thank you for your comment

It is impossible to guess which PDFs I will need to search.

Actually, you can create a smart group to show PDFs with no text in them…
SmartGroup - PDFs to OCR

1 Like

I’m sorry: I don’t understand. What kind of PDF has not text ? Do you mean non readable ?
thank you

A PDF with no text layer, which includes PDFs that have no OCR done on them.
If there is no text layer, the file wouldn’t be searchable by content.

1 Like