OCR bug for certain pdfs on DEVONthink Pro 3.8.6

porcupine945 · September 29, 2022, 10:21am

I have been using DEVONthink 3 Pro for years and I’ve been relying upon the OCR function heavily, with no real errors or issues - everything just works.

I recently updated to version 3.8.6 and I now appear to be experiencing a bug using the OCR to searchable PDF function for certain pdfs. In short, it appears to start OCRing the pdfs, but then it seems to stop after a certain number of pages, with no error message or notification (or at least none that I can find). For example, I have one pdf (57 pages) that I tried to OCR to searchable PDF this morning. I can see using the Activity window that it begins to OCR the document. But when it gets to page 20, it simply disappears from the Activity window and I am pretty sure that the OCR function failed completely. A similar thing happens on a much longer pdf when it gets to around page 27.

I should note that I have occasionally had difficulties highlighting text in certain pdfs where the text appears to be selectable but before running the document through the OCR feature – the highlight menu is greyed out for some reason. I recall looking into this on the forum and learned that a solution was to OCR these documents to searchable PDF. This always worked in the past – after I’d OCRd to searchable pdf I was able to highlight the text. The 57 page pdf I’ve mentioned above is one of these. So the reason I am fairly sure the OCR function hasn’t worked is because I still cant highlight anything.

I’d be very grateful if anyone could help - it appears to be a bug in the new version and I have no idea what the solution is.

cgrunenberg · September 29, 2022, 10:23am

Do you use a Mac with an Intel or Apple (M1/M2) chip?

porcupine945 · September 29, 2022, 10:35am

I’m using an M1 MacBook Air.

aedwards · September 29, 2022, 1:24pm

Can you open a bug report by selecting Help menu and whilst pressing the Option button select “Report bug”. This will open an email, add to the top of this “for attention of Alan”.

porcupine945 · September 29, 2022, 1:37pm

Done. Thank you.

tjur · September 29, 2022, 1:50pm

Same problem here! I will send a bug report, too!

I think the problem has something to do with wrong page orientation. As soon as you “print” the PDF file as a new PDF, the orientation is correct and the OCR works.

Grogol · October 2, 2022, 11:23am

I’m seeing the same problem:

OCR stops after first page
no error message or log
problem is solved by deskewing (in my case using PDFScanner)
file can now be OCRed by DEVONthink

The documents are not particularly askew; I don’t believe that is why DEVONthink is not OCRing them.

BLUEFROG · October 2, 2022, 3:25pm

I have no files to corroborate your theory but we are testing a fix for the OCR engine, if you’re on an Apple Silicon Mac.

Grogol · October 2, 2022, 3:37pm

That’s great! Yes, on Apple Silicon.

kikujiro · October 3, 2022, 7:48am

I have a related issue with OCR in the new version. DTP was failing to OCR a PDF, as above. I printed it to a new PDF and it OCR’d successfully. But despite my having “PDF Resolution“ set to “As source” in OCR settings, the resolution of the searchable PDF has degraded quite badly – see comparison below.

aedwards · October 3, 2022, 8:06am

We will be releasing an update to the OCR that will include a fix for this today.

CWM · October 4, 2022, 1:58am

Yeah, I’m having the exact same problem with OCR. I’m using an M1 MacBook Pro.

BLUEFROG · October 4, 2022, 2:09am

Quit and relaunch DEVONthink and update the OCR engine via DEVONthink 3 > Install Add-Ons.

porcupine945 · October 4, 2022, 6:39am

Many thanks. This is now working for me.

stephenjw · October 4, 2022, 7:13am

Hi Jim (& Alan)

Thanks very much for fixing the image degradation on OCR - it looks so much better.

I’m getting some big increases in file sizes when I apply OCR and thought it might help if I posted a summary in case anyone else is having the same experience.

Attached is an example - original pdf is 287 KB, after OCR it’s 11 MB.

Before - 287 KB.pdf (280.0 KB)

After - 11 MB.pdf (10.4 MB)

I’m uisng an M1 MacBook Air, macOS Monterey v 12.6.

OCR preferences in DT below:

I tried setting PDF Resolution to 200 dpi, instead of “As source” but it still came out at 11 MB.

The original PDF already had OCR but since the PDFKit font issues, I re-do the OCR in DT (and haven’t had a corrupt PDF since).

I recall file size increases on OCR being raised in this forum a couple of years ago as an issue.

I thought the increase in file size might be ABBYY doing something strange but as a comparison I exported the original (287 KB) pdf to TIFF using Acrobat DC, recombined the TIFF pages (again using Acrobat DC), which produced an 11.7 MB file. Applying OCR in Acrobat took it up to 11.8 MB - roughly the same (a bit bigger, in fact) as ABBYY.

So I’m guessing it might be a function of ABBYY converting to an image? That could be a good thing in that it takes off the existing PDF layers which can cause the corrupt OCR font issue in PDFKit.

If the extra MB is the price of not having unreadable OCR fonts because of the PDFKit issue, for me the extra memory use is a price worth paying. But if you know of anything obvious that I’m doing wrong, if you could let me know.

It’s not a big or urgent issue, though.

Thanks very much, as always.

CWM · October 4, 2022, 11:48am

Thanks, Jim. Problem solved!

BLUEFROG · October 4, 2022, 1:55pm

You’re welcome - though the credit technically goes to @aedwards for the fix!

system · October 3, 2025, 1:55pm

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.