Best way to do external OCR on already-existing documents?

mason_mark · February 23, 2015, 7:39am

I am evaluating DevonThink “Pro Office” to replace Evernote.

However, I have thousands of PDF scans which are Japanese documents, a language which DT’s OCR feature does not support.

For new documents, my scanner (recent-model ScanSnap) generates Japanese-aware searchable PDF, so I can use that. But for the existing documents, the OCR data is locked away in the Evernote cloud. My plan is to use DT’s “Import → Notes from Evernote” feature to get the PDF files (and other notes) into DT, but Evernote doesn’t export the OCR data.

What is the best way to use an external tool (a Japanese-capable OCR app) to then OCR the documents, once already added to DT?

cgrunenberg · February 23, 2015, 2:05pm

That’s hard to tell without knowing the software. Does it support AppleScript?

mason_mark · February 24, 2015, 7:05am

Well, I am OK with buying whatever tool I need to do the actual OCR – I was thinking that ABBY FineReader (the full version, which does support Japanese, unlike the version bundled with DevonThink). It purports to support AppleScript.

But what does the process look like?

Basically, what I want to do is replace the PDFs in the DT database with new PDFs of the same content, but with English and Japanese text search data added.

Whether I do that somehow from within DT, or quit DT and OCR all the PDFs in its database, then somehow get DT to re-index them, doesn’t really matter to me.

naljudaibi · February 29, 2016, 2:09pm

Sorry to post on this old thread.

Mason, did you find a way to OCR your documents with external OCR from within DT?