Further degradation of OCR PDF image layer in 3.0.2

As I updated from 3.0.1 to 3.0.2 DT appeared to automatically install a new version of ABBYY. The image layer in OCR’d PDFs seems significantly degraded.

I hoped that the new OCR engine would produce PDFs with an image layer that is no longer so degraded as I’ve been seeing in 3.0.1. When the imported scan is not of great quality, I usually have to save the document in both OCR and image versions, because the OCR typically produces a PDF that, while searchable, is harder or less pleasant to read. Instead of improvement, there is a decline in quality. I hope there is a straightforward fix.

PDF before OCR:

35%20AM

PDF after OCR in DT 3.0.1:

45%20AM

PDF after OCR from original image-only PDF in DT 3.0.2:

02%20AM

Here are the settings used for both OCR imports: Compress PDF is off. Deskew and Page Orientation are on.

I would prefer a larger PDF that is close to, or identical to, the imported version in legibility, even if the resulting file is larger, as currently I’m saving two copies anyway. The 3.0.1 version is fuzzy but easier to read than the 3.0.2 version. And of course the unprocessed image is much better than both. Thanks for any advice!

1 Like

Same here. No matter how I set the setting under the OCR tab, to compress or not to compress the PDF, seem like converting document to searchable PDF will compress the PDF. I just want to keep the original PDF with added searchable text.

1 Like

Ouch! I hadn’t realised the effect the built in OCR process was having; this is surely a critical issue

2 Likes

Same issue here.

Yup, same here. Compressing seems to make it worse, but it’s bad even with that unchecked.

I just tried to uninstalling Devonthink 3.0.2 and reinstalling 3.0.1 afterwards. Also downloaded and installed the Abbyy Finereader plug-in afterwards, but it’s still the same. I believe the issue is the Abbyy Finereader plugin that’s currently in the cloud, because now 3.0.1 is acting the same as 3.0.2 after upgrade.

Using current version of ScanSnap I have ScanSnap doing the OCR and setup Devon to NOT complete and OCR. Seems to be working well but I don’t have any idea if this scenario changes anything else. Scans are clear and have OCR.

The OCR in PDFPen Pro works well. I found this script for Hazel that will run the OCR through PDFPen Pro. I tried making a smart rule in DT3 to do this, as well, but it failed to run.

Anyone have an idea on how to get this to work? I’ve done virtually nothing in Applescript, so I don’t know the language.

In the post, there is the apple script where you can copy and paste. You don’t need to modify anything.

I tried putting that exact script into a Smart Rule and created a script that I added to the scripts menu. Neither work. When I run either, there’s no activity in Activity Monitor.

The script have to be used with Hazel, which is not a Devon product.

Thus my original question on whether anyone knew how to make this work through DT3.

So I found the following post “Abbyy finereader within DT 3”, it shows the location of the DTOCRHelper, that Devon uses to OCR the PDF. According to the new file downloaded and the one I have on my computer, it’s dated November 8th, 2019. Anyone possibly have the DTOCRHelper file prior to this date?

Resolution: Use Time machine to restore the ~/Library/Application Support/DEVONthink 3/Abbyy/DTOCRHelper.app file to an earlier version prior to the upgrade.

1 Like

We need to be able to count on the IMAGE to be of higher quality as OCR is what OCR is at best not accurate I hope ERIC and the design team are reading this. When these changes are done that impact our data it would be nice to KNOW or BE ADVISED. I for one do not appreciate DT to make decisions for me especially as it affects MY data. Awaiting your comments DT staff.

1 Like

We are currently working with ABBYY to determine the cause of the degraded image quality. This is an issue with the ABBYY FineReader product and they are currently investigating this problem.

5 Likes

Thank you for the hint and file location:

I copied from Time Machine the previous version 1.0.23, which solved it in my case.

I have also reverted to the previous version, but the image is still degraded from the original scan. OCR with PDFPenPro does not affect the image layer, at least that I can detect; however that of course entails an additional step, and the ABBYY OCR is marginally better.
Please ask ABBYY to include an unaltered image as an option, in addition to compression choices.

A lot of OCR software on the Mac sucks big time by not being able to preserve the CCITT T.6. (Group 4) lossless image compression when rewriting a PDF file. This compression format has been the de-facto industry standard for scanning and archiving since multiple decades, but Apple’s PDFKit library framework seems to only be able to decode (= display) such PDFs but apparently cannot write them, and software companies producing OCR or document management for the Mac are either agnostic to this problem (which has been existing for many many years, eg. see this forum post from 2012) or just don’t bother.

I had hoped that with DTP v3 things would have changed for the better, but apparently not.

So one should either not alter PDF files coming from a scanner (all original bundled scanner software that I know does do CCITT Group 4) on a Mac or use Adobe Acrobat. Both alternatives suck.

That doesn’t look like MRC compression.
When a black and white pdf is compressed with MRC. it gets more jagged.

Finereader Pro output shown. left uncompressed; right low quality.

Do your pdfs look better in Adobe Acrobat Reader? (Preview, and by extension, all renderers using PDFKit prerender using low resolution greyscale. But usually this clears up within a second, if not instantaneously.)