OCR fails from scan import in Catalina

avatar · July 28, 2020, 4:53pm

There appears to be a problem with OCR on import via scanning in DT3 version 3.5.1 running macOS 10.15.6.

I scanned in a few pages of text documents, to make a PDF, as I’ve done hundreds of times before; selected OCR of course, and the PDF was made (and I noticed “recognising”, etc on the activity thing at bottom left), and the file got listed as “PDF+Text”.

I thought all was well.

But when I tried to select any text in the PDF document, I couldn’t. Often it would just select/highlight the entire page - as if the OCR had actually failed, despite the file being listed as “PDF+Text”.

In the end I rescued the trashed jpg files, opened one in Preview, and added more of them one by one to Preview’s ‘thumbnails’ sidebar, “printed” that as a PDF to the desktop, then imported that printed file to DT3, and performed the “Convert > to searchable PDF” on it, which did result in a properly OCRd file.

But it won’t work from the import/scan process.

The ABBYY FineReader OCR “extra” is installed, btw.

I just restarted DT3 and tried again, with the same results.

What might be wrong?

avatar · July 28, 2020, 4:55pm

As a rider to my remarks, I notice that, curiously, searching for some text (that you know is there) in one of these “text not selectable” apparently OCRd PDFs does find the text on the page. It’s just not selectable (and the found text isn’t highlighted, as well as being not selectable).

BLUEFROG · July 28, 2020, 5:15pm

Is the PDF marked as PDF Document or PDF+Text?

PS: While we are continuing to investigate some OCR issues with ABBYY, some fixes to OCR are coming in the next maintenance release.

avatar · July 28, 2020, 5:18pm

As I mentioned, the PDF is indeed listed as “PDF+Text”.

The weird bit is that I can find text on the page, but just no highlighting when found, and more importantly, no selecting!

This is quite important for me as I regularly scan in lots of printed documents for OCRd PDFs.

BLUEFROG · July 28, 2020, 5:24pm

Ahh… Sorry, I missed that.

I’m curious: If you run OCR on the file again, does it behave the same or differently?

avatar · July 28, 2020, 5:30pm

Aha! Well I got the warning dialog, "Are you sure you want to convert this searchable PDF again … ", but I clicked the ‘Convert’ button anyway, and on the newly created file, the text is selectable.

Hooray. But boo also, as I don’t really want to do it twice on every scan. But thanks for the suggestion as a workaround. It seems to work.

avatar · July 28, 2020, 5:43pm

Trying that “re-converting” workaround again using 300dpi (I usually use 150), I thought that the app had hanged (hung?) on the process. It took a long time - 10 minutes or more - and the DTOCRHelper process swallowed 15GB of RAM. (I only tried 300dpi thinking it might help. It didn’t.)

BLUEFROG · July 28, 2020, 6:06pm

This should be resolved in the next maintenance release.

avatar · July 28, 2020, 6:14pm

Great. I’ll look forward to that fix. Thanks.

BLUEFROG · July 28, 2020, 6:20pm

No problem.

amalis · August 5, 2020, 3:44pm

I’ve encountered the same issue with a ScanSnap ix500, MacOS 10.15.6, and DEVONthink 3.5.1. Until the maintenance release is available, I been scanning to disk with ScanSnap Manager doing the OCR, then importing into DT.

avatar · August 10, 2020, 7:22am

Yes, I’ve had to rely on other software as well, as the “converting twice” method can be hit and miss. Hardly “no problem” really. I’ll be glad when I can scan documents straight into DT3 again, using DT3.

avatar · August 13, 2020, 3:20pm

The problem is still not fixed in the latest 3.5.2 update !

The ABBYY download happened, but scanned documents, OCRd, produce no selectable text!

Furthermore, as has happened before, if I opt for the scan to go to a “new binder”, the scan process completely forgets about that as soon as I click scan.

BLUEFROG · August 13, 2020, 3:48pm

What kind of scanner are you using?

avatar · August 13, 2020, 3:51pm

Epson Perfection 4990.

FineReader has no problems OCR-ing it, but with DT3 I always have to scan twice!

(that is, not “scan twice”, but after the unsuccessful OCR, go for the OCR > to searchable PDF)

BLUEFROG · August 13, 2020, 5:31pm

Hmm… I’m doing a scan with an HP OfficeJet 9010 in DEVONthink’s Import sidebar, with OCR enabled.

The text is fully selectable in the finished file.

Outside of manually reinstalling the OCR components, @aedwards would have to comment on this further.

avatar · August 13, 2020, 5:52pm

Well, it isn’t for me. Yes, import sidebar, OCR enabled - of course. Considering that Finereader successfully OCRs documents, there’s no reason to think that it’s anything other than DT3 at fault (and especially since we saw this identically behaving bug in the previous release version).

avatar · August 13, 2020, 8:20pm

Looking around similar threads I notice that re-installing the ABBYY DTOCRHelper application seems to help. I find I have an older version 1.1.2 (as opposed to the version 1.1.13 installed today with the DT3 3.5.2 update). Should I try replacing the newer version with the old? It seems a bit of a kludge.

BLUEFROG · August 13, 2020, 8:23pm

As I said previously,

Outside of manually reinstalling the OCR components, @aedwards would have to comment on this further.

avatar · August 13, 2020, 8:28pm

Uh … OK …

200