OCR fails from scan import in Catalina

There appears to be a problem with OCR on import via scanning in DT3 version 3.5.1 running macOS 10.15.6.

I scanned in a few pages of text documents, to make a PDF, as I’ve done hundreds of times before; selected OCR of course, and the PDF was made (and I noticed “recognising”, etc on the activity thing at bottom left), and the file got listed as “PDF+Text”.

I thought all was well.

But when I tried to select any text in the PDF document, I couldn’t. Often it would just select/highlight the entire page - as if the OCR had actually failed, despite the file being listed as “PDF+Text”.

In the end I rescued the trashed jpg files, opened one in Preview, and added more of them one by one to Preview’s ‘thumbnails’ sidebar, “printed” that as a PDF to the desktop, then imported that printed file to DT3, and performed the “Convert > to searchable PDF” on it, which did result in a properly OCRd file.

But it won’t work from the import/scan process.

The ABBYY FineReader OCR “extra” is installed, btw.

I just restarted DT3 and tried again, with the same results.

What might be wrong?

As a rider to my remarks, I notice that, curiously, searching for some text (that you know is there) in one of these “text not selectable” apparently OCRd PDFs does find the text on the page. It’s just not selectable (and the found text isn’t highlighted, as well as being not selectable).

Is the PDF marked as PDF Document or PDF+Text?

PS: While we are continuing to investigate some OCR issues with ABBYY, some fixes to OCR are coming in the next maintenance release.

As I mentioned, the PDF is indeed listed as “PDF+Text”.

The weird bit is that I can find text on the page, but just no highlighting when found, and more importantly, no selecting!

This is quite important for me as I regularly scan in lots of printed documents for OCRd PDFs.

Ahh… Sorry, I missed that.

I’m curious: If you run OCR on the file again, does it behave the same or differently?

Aha! Well I got the warning dialog, "Are you sure you want to convert this searchable PDF again … ", but I clicked the ‘Convert’ button anyway, and on the newly created file, the text is selectable.

Hooray. But boo also, as I don’t really want to do it twice on every scan. But thanks for the suggestion as a workaround. It seems to work.

Trying that “re-converting” workaround again using 300dpi (I usually use 150), I thought that the app had hanged (hung?) on the process. It took a long time - 10 minutes or more - and the DTOCRHelper process swallowed 15GB of RAM. (I only tried 300dpi thinking it might help. It didn’t.)

This should be resolved in the next maintenance release.

Great. I’ll look forward to that fix. Thanks.

No problem.

I’ve encountered the same issue with a ScanSnap ix500, MacOS 10.15.6, and DEVONthink 3.5.1. Until the maintenance release is available, I been scanning to disk with ScanSnap Manager doing the OCR, then importing into DT.

Yes, I’ve had to rely on other software as well, as the “converting twice” method can be hit and miss. Hardly “no problem” really. I’ll be glad when I can scan documents straight into DT3 again, using DT3.

The problem is still not fixed in the latest 3.5.2 update !

The ABBYY download happened, but scanned documents, OCRd, produce no selectable text!

Furthermore, as has happened before, if I opt for the scan to go to a “new binder”, the scan process completely forgets about that as soon as I click scan.

What kind of scanner are you using?

Epson Perfection 4990.

FineReader has no problems OCR-ing it, but with DT3 I always have to scan twice!

(that is, not “scan twice”, but after the unsuccessful OCR, go for the OCR > to searchable PDF)

Hmm… I’m doing a scan with an HP OfficeJet 9010 in DEVONthink’s Import sidebar, with OCR enabled.

The text is fully selectable in the finished file.

Outside of manually reinstalling the OCR components, @aedwards would have to comment on this further.

Well, it isn’t for me. Yes, import sidebar, OCR enabled - of course. Considering that Finereader successfully OCRs documents, there’s no reason to think that it’s anything other than DT3 at fault (and especially since we saw this identically behaving bug in the previous release version).

Looking around similar threads I notice that re-installing the ABBYY DTOCRHelper application seems to help. I find I have an older version 1.1.2 (as opposed to the version 1.1.13 installed today with the DT3 3.5.2 update). Should I try replacing the newer version with the old? It seems a bit of a kludge.

As I said previously,

Outside of manually reinstalling the OCR components, @aedwards would have to comment on this further.

Uh … OK …

200