How to batch OCR 15,000 Evernote notes many of which contain PDFs

korm · October 24, 2020, 8:29pm

The Smart Group rules that @BLUEFROG suggested is a standard way to locate PDFs that are not “PDT+Text”. Sample result in my own databases from that Smart Group rule, below, shows that the “Kind” of not-OCRd PDFs is just “PDF Document” rather than “PDF+Text”

Anyone who does a lot of PDF importing and needs to locate documents needing OCR would benefit from the rule.

rufus123 · October 25, 2020, 5:14am

GREAT suggestion ! thank you

rufus123 · October 25, 2020, 2:44pm

I apologize for my brusque reply and appreciate the fact that you were taking time to think about my problem

Macster · October 27, 2020, 10:51am

Warning: A lot of OCR tools which turn PDFs with the scanned image of a document into a “searchable PDF” with a text layer are not able to just “add” the text layer to the otherwise unchanged image data but instead do recompress the image data anew. This can lead to either bloated files or vastly inferior image quality, or even both.

Usually it’s the CCITT Group 4 fax codec and the JBIG2 codec which are unsupported by many tools, including the bundled OCR software that came with DEVONthink 2, because Apple’s PDFKit software framework did only implement their decompression. So all software which was built on top of this framework was unable to write PDFs using theses codes unless the developer added additional code on their own to take care of these cases. Many didn’t, so you’d get either huge files (lossless but inefficient deflate compression instead of efficient CCITT or JBIG2) or blurred text (JPEG).

So if you are about to convert such a big archive be very wary and do test beforehand whether the results you are getting match your requirements. If they don’t, you might want to use some other software instead of DEVONthink to do the OCRing and only then import the resulting files.

I for one would never tolerate having the size or quality of my scanned documents compromised just because some given piece of software lacks important features. But maybe DT3 does it properly now, I haven’t checked as I haven’t had the need again yet.

rufus123 · October 27, 2020, 11:17am

thank you. I will do some testing.

gg378 · October 30, 2020, 5:28pm

I generally found that OCRing with Adobe Acrobat Professional (last version I had was 11) gave by far the best results. It can also easily batch OCR. I would put all those pdfs into a folder before importing to DT, and run Acrobat OCR over them. To massively reduce the size on certain files, its ClearScan feature is unmatched and truly impressive (in the current versions, they no longer call it that way); however, you can’t just batch OCR with ClearScan, not all files are suitable for it.

rufus123 · October 30, 2020, 9:14pm

thank you but it is too expensive. I am surprised you run version 11. I thought that 10 and 11 don’t run on Catalina.

FROBGOBLIN · October 31, 2020, 8:00am

Adobe, alas, doesn’t seem to run on Catalina. I’ve relied heavily on Adobe X for about a decade now. It’s difficult to believe that ten years later I can actually do much less with my computer. A couple other wonderful programs also don’t work in Catalina. And, apparently ten years has not been long enough to make me into someone wealthy enough to use Adobe’s current cloud services. To add insult to injury, As far as I can tell with informal tests, the OCR hasn’t even improved in Adobe. Paying more for less… How did this sorry state of affairs come about.

I am also unwilling to sacrifice the quality of my scans for OCR. Make sure to carefully check the results of tests before embarking on massive OCR projects. As for me, one computer has been left in the past, so to speak, where it continues to run an old OS and old software. It’s already well past obsolete, though, so I will have to find a way to enjoy less as more in the current Appleverse.

Claude · October 31, 2020, 8:06am

How did you do convert your notes from EN to DT?

rufus123 · October 31, 2020, 8:57am

I totally agree with your comment.

Coming from Adobe it is not surprising. My alternative for PDF editing + OCR is PDF Pen Pro. I prefer PDF Expert in general but it does not have OCR.

The upgrade to Catalina did cause some nasty surprises. For example, I used to buy Kindle books and have an app to print the out in a very nice booklet form which allowed me to go on walks and read on benches. It is one of many apps that I lost when I upgraded to Catalina.

tcga · October 31, 2020, 9:03am

@rufus123 Could you tell those of us still on Mojave what the app is that can print kindle books in booklet format?

Thanks a lot.

rufus123 · October 31, 2020, 9:04am

A 2 step process:

1- DevonThink
File → Import → Notes from Evernote
I imported 15, 000 notes with no problems

2- the second step is due to the fact that Evernote automatically OCRs all PDFs that you import (in Evernote) BUT the text portion of the OCR is stored on Evernote servers. After importing into DevonThink, your PDFs are therefore not searchable which in my case was a big pain.

After the evernote import, I selected all EN notes in DevonThink→ Data → OCR → to searchable PDF and about 6,000 PDFs were OCRed by DevonThink without any problems. I ran some random tests and they have all indeed become searchable.

I must add that many forum members disagreed with step 2 (batch OCR), so ask other opinions before going ahead.

FROBGOBLIN · October 31, 2020, 10:25am

@ tcga (regarding Kindle issues)
An answer would probably go off track into territory of dubious legality, so I recommend contacting the poster directly about it.

@rufus123
Batch processing of PDFs is a great idea, and it is something I do on a regular basis with Adobe Pro X, but I think the comments were more about jumping into a new workflow / massive project without testing (and backups!) first.
In fact, the first thing I thought when I saw this thread was, “what about backups!?” If it were me, for just about anything, I’d probably test with a handful of files, check them over carefully (Do images get downscaled too much? Is the OCR performance acceptable? etc.).
By the way, I just imported about 5000 Evernote files, and I discovered that my images (apparently) were not brought over properly, and the html files with them were not displaying the images either. In other words, I have a lot of groups that have an html file in them with nothing there. An export as html from Evernote and then import into DT solved the problem. A minor hiccup, but a good example of something that might look OK at first, but isn’t OK upon further inspection.

rufus123 · October 31, 2020, 10:40am

yes, they were right. Testing first - you are right

very interesting. I will have a look. There is a problem with your solution: if you export an evernote note either to enex or html, the exported note does not contain the notebook information. I have about 400 notebooks, so Evernote→ export → import into DevonThink is not an option.

FROBGOBLIN · October 31, 2020, 11:07am

Hi. I only have one notebook, so no problem for me

The truth is, for all of Evernote’s positive aspects, its treatment of notebooks as an afterthought, lack of export options (the result, I suppose, of an Applish “we know best” attitude toward features), and anemic support for tags always made me wary of using additional organizational features beyond file names.

Claude · October 31, 2020, 4:18pm

Thanks @rufus123 for your explanation!

gg378 · October 31, 2020, 8:21pm

I have an old iMac at work, which runs my copy of Acrobat 11 on ElCapitan, I believe. The Adobe cloud subscription is too expensive for what I need. I weened myself entirely from the Creative Suite. Illustrator ➝ Affinity Designer, Photoshop ➝ Pixelmator and Graphic Converter, Acrobat ➝ Qoppa PdfStudio. The only thing remaining is ClearScan OCR on that old iMac.

rufus123 · November 1, 2020, 6:30am

very interesting. I just scrapped an old Mac and should have kept it.