Migrating from Evernote - Re-OCR?

Scovillain · June 6, 2019, 7:46am

Hi,
I finally took (actually still taking) the step to migrate my things from Evernote to DT3. I have a ix-500 scanner and did all my OCR on Evernote side (premium user).

Is it advisable to rescan all my PDF files in DT3? And is there an easy way to do this?

I already tested it on two files and besides creating an extra document I noticed the file size actually dropped to nearly 50%.

Thx in advance

Blanc · June 6, 2019, 8:18am

I actually did rescan mine, but I think you may sort of be answering your own question here:

DT will often times produce a smaller file size, which will be of reduced quality (which might or might not be visible to you). The DT guys feel they have optimally balanced size/quality. YMMV.
I found the text recognition quality in the files I re-OCRd to be better than in those which I had simply imported (but I can’t remember whether I used ScanSnap Manager or Evernote to perform the OCR originally - in any case the OCR was performed years ago, so a newer engine might perform better)

You can set DT to move the original to trash following OCR (Preferences/OCR -> Original Document Move to Trash), so DT wouldn’t (visibly) create an extra document for each OCR. I recently used DT3b2 to OCR approx. 1800 PDFs, which it quietly did in the background with no apparent problems. An idea might be: if you need the extra space or feel the current OCR quality is not ideal, then use DT to re-OCR. Otherwise save carbon dioxide

A word of warning: there is a known bug in the ABBYY-engine and/or DT3b1/2 which can lead to misinterpretation of rotation data; the page is then re-oriented, but some text is lost in the process (Information after an OCR scan is lost - half of the document is missing). You might want to wait for that particular bug to be dealt with prior to re-OCR-ing, depending on the documents involved.

BLUEFROG · June 6, 2019, 12:37pm

did all my OCR on Evernote side (premium user).

Unless Evernote has changed something recently, the OCR’d data is only on their servers, so you’d need to do OCR again in DEVONthink. Again, this may have changed, but that’s the last state I knew of things.

Is it advisable to rescan all my PDF files in DT3?

I wouldn’t say you’d have to rescan. If the originals were scanned at a good quality and resolution, e.g., 200-300 dpi, you should be able to just run OCR on them.

Scovillain · June 6, 2019, 12:43pm

Thanks for your answers. I decided to run OCR again and DT3 is working on it right now. From what I see at a first glance it does quite a better job compared to the recognition I did the last years (mixup between Evernote, ABBYY and whatnot).

I wouldn’t say you’d have to rescan

Sorry, I meant re-OCR, not scan

BLUEFROG · June 6, 2019, 12:52pm

No worries!