I converted from EN to DT recently.
I imported 15,000+ Evernote notes into DT. All went well until I started searching and could not find what I was looking for and knew was there.
With Evernote, all imported PDFs are automatically processed through the OCR, which allows me to search within PDFs without any prior PDF processing.
1- Would the rule below allow me to OCR all my imported Evernote notes which contain a PDF ?
2- I can’t figure out how to trigger the On Demand command.
I just triggered the batch conversion of thousands of evernote imported PDFs. I am suddenly having cold sweats. Will the OCR process duplicate the PDF ?
thank you for your comment.
Because I have already batch processed thousands of PDFs.
If my settings are not OK (original to trash, compressed, 150 DPI) I have to start everything over ie re-import 15,000 notes from evernote, ocr, etc
In fact what I would like is a reference where are discussed the pros and cons of the OCR options in DevonThink, especially 150 DPI or more, and compressed or not.
I don’t like the fact that I am just guessing
There is no answer to your question re dpi. It depends on the quality of the original document. If it’s a genuine PDF, I.e. not a scanned document, 150 should be ok. Otherwise, even 300 might not be enough for OCR to work.
Why don’t you explore these potentially crucial questions before you work on 15000 documents?
EDIT: See @matti’s note #15 and @BLUEFROG#16 below for why Evernote OCR doesn’t import into DEVONthink.
Note: I do not use Evernote so may be (edit: I am) wrong here, but:
the resulting text layer should be imported into DEVONthink. It will take a bit of time for DT to index the items for search. Are those PDFs shown as kind = PDF + text? If so, the OCR results are there and if search doesn’t work it may be that indexing is not complete. Or try to select some text in a PDF which will show if a text layer is present.
It is not a good idea to do such a process en masse like this. It is better to do such an operation on much smaller batches.
And @chrillek is correct. Testing on a smaller sample should have been done before committing to larger batches.
Are you sure you have 15,000-ish PDFs that need OCR? Maybe search the groups with the exported documents and look for PDFs to make sure they are all not PDT+Text.
If you export a note containing a PDF that has been processed by the OCR system, there will be two nodes in the document: data and alternate-data . The data node contains a base–64 encoded version of the original PDF and the alternative-data represents the searchable version of the same PDF.
I don’t think DT handles this in this way though: At least in a small test, PDF recognition data was not included in the DT import (there was only one non-OCR-ed PDF).
well … just did an urgent search for a term that yielded nothing. In Evernote, found the term immediately. It is impossible to guess which PDFs I will need to search.
thank you for your comment