OCR bug for certain pdfs on DEVONthink Pro 3.8.6

I have been using DEVONthink 3 Pro for years and I’ve been relying upon the OCR function heavily, with no real errors or issues - everything just works.

I recently updated to version 3.8.6 and I now appear to be experiencing a bug using the OCR to searchable PDF function for certain pdfs. In short, it appears to start OCRing the pdfs, but then it seems to stop after a certain number of pages, with no error message or notification (or at least none that I can find). For example, I have one pdf (57 pages) that I tried to OCR to searchable PDF this morning. I can see using the Activity window that it begins to OCR the document. But when it gets to page 20, it simply disappears from the Activity window and I am pretty sure that the OCR function failed completely. A similar thing happens on a much longer pdf when it gets to around page 27.

I should note that I have occasionally had difficulties highlighting text in certain pdfs where the text appears to be selectable but before running the document through the OCR feature – the highlight menu is greyed out for some reason. I recall looking into this on the forum and learned that a solution was to OCR these documents to searchable PDF. This always worked in the past – after I’d OCRd to searchable pdf I was able to highlight the text. The 57 page pdf I’ve mentioned above is one of these. So the reason I am fairly sure the OCR function hasn’t worked is because I still cant highlight anything.

I’d be very grateful if anyone could help - it appears to be a bug in the new version and I have no idea what the solution is.

Do you use a Mac with an Intel or Apple (M1/M2) chip?

I’m using an M1 MacBook Air.

Can you open a bug report by selecting Help menu and whilst pressing the Option button select “Report bug”. This will open an email, add to the top of this “for attention of Alan”.

Done. Thank you.

1 Like

Same problem here! I will send a bug report, too!

I think the problem has something to do with wrong page orientation. As soon as you “print” the PDF file as a new PDF, the orientation is correct and the OCR works.

1 Like

I’m seeing the same problem:

  • OCR stops after first page
  • no error message or log
  • problem is solved by deskewing (in my case using PDFScanner)
  • file can now be OCRed by DEVONthink

The documents are not particularly askew; I don’t believe that is why DEVONthink is not OCRing them.

I have no files to corroborate your theory but we are testing a fix for the OCR engine, if you’re on an Apple Silicon Mac.

That’s great! Yes, on Apple Silicon.

I have a related issue with OCR in the new version. DTP was failing to OCR a PDF, as above. I printed it to a new PDF and it OCR’d successfully. But despite my having “PDF Resolution“ set to “As source” in OCR settings, the resolution of the searchable PDF has degraded quite badly – see comparison below.

We will be releasing an update to the OCR that will include a fix for this today.

Yeah, I’m having the exact same problem with OCR. I’m using an M1 MacBook Pro.

Quit and relaunch DEVONthink and update the OCR engine via DEVONthink 3 > Install Add-Ons.

Many thanks. This is now working for me.

Hi Jim (& Alan)

Thanks very much for fixing the image degradation on OCR - it looks so much better. :slightly_smiling_face:

I’m getting some big increases in file sizes when I apply OCR and thought it might help if I posted a summary in case anyone else is having the same experience.

Attached is an example - original pdf is 287 KB, after OCR it’s 11 MB.

Before - 287 KB.pdf (280.0 KB)

After - 11 MB.pdf (10.4 MB)

I’m uisng an M1 MacBook Air, macOS Monterey v 12.6.

OCR preferences in DT below:

I tried setting PDF Resolution to 200 dpi, instead of “As source” but it still came out at 11 MB.

The original PDF already had OCR but since the PDFKit font issues, I re-do the OCR in DT (and haven’t had a corrupt PDF since).

I recall file size increases on OCR being raised in this forum a couple of years ago as an issue.

I thought the increase in file size might be ABBYY doing something strange but as a comparison I exported the original (287 KB) pdf to TIFF using Acrobat DC, recombined the TIFF pages (again using Acrobat DC), which produced an 11.7 MB file. Applying OCR in Acrobat took it up to 11.8 MB - roughly the same (a bit bigger, in fact) as ABBYY.

So I’m guessing it might be a function of ABBYY converting to an image? That could be a good thing in that it takes off the existing PDF layers which can cause the corrupt OCR font issue in PDFKit.

If the extra MB is the price of not having unreadable OCR fonts because of the PDFKit issue, for me the extra memory use is a price worth paying. But if you know of anything obvious that I’m doing wrong, if you could let me know.

It’s not a big or urgent issue, though.

Thanks very much, as always.

Thanks, Jim. Problem solved!

You’re welcome - though the credit technically goes to @aedwards for the fix! :smiley:

1 Like