Further degradation of OCR PDF image layer in 3.0.2

chreliot · November 19, 2019, 4:34pm

As I updated from 3.0.1 to 3.0.2 DT appeared to automatically install a new version of ABBYY. The image layer in OCR’d PDFs seems significantly degraded.

I hoped that the new OCR engine would produce PDFs with an image layer that is no longer so degraded as I’ve been seeing in 3.0.1. When the imported scan is not of great quality, I usually have to save the document in both OCR and image versions, because the OCR typically produces a PDF that, while searchable, is harder or less pleasant to read. Instead of improvement, there is a decline in quality. I hope there is a straightforward fix.

PDF before OCR:

35%20AM

PDF after OCR in DT 3.0.1:

45%20AM

PDF after OCR from original image-only PDF in DT 3.0.2:

02%20AM

Here are the settings used for both OCR imports: Compress PDF is off. Deskew and Page Orientation are on.

I would prefer a larger PDF that is close to, or identical to, the imported version in legibility, even if the resulting file is larger, as currently I’m saving two copies anyway. The 3.0.1 version is fuzzy but easier to read than the 3.0.2 version. And of course the unprocessed image is much better than both. Thanks for any advice!

jlee0928 · November 19, 2019, 9:31pm

Same here. No matter how I set the setting under the OCR tab, to compress or not to compress the PDF, seem like converting document to searchable PDF will compress the PDF. I just want to keep the original PDF with added searchable text.

danio1972 · November 20, 2019, 8:59am

Ouch! I hadn’t realised the effect the built in OCR process was having; this is surely a critical issue

wolcy · November 20, 2019, 4:45pm

Same issue here.

RobH · November 20, 2019, 4:55pm

Yup, same here. Compressing seems to make it worse, but it’s bad even with that unchecked.

jlee0928 · November 20, 2019, 4:56pm

I just tried to uninstalling Devonthink 3.0.2 and reinstalling 3.0.1 afterwards. Also downloaded and installed the Abbyy Finereader plug-in afterwards, but it’s still the same. I believe the issue is the Abbyy Finereader plugin that’s currently in the cloud, because now 3.0.1 is acting the same as 3.0.2 after upgrade.

wolcy · November 20, 2019, 5:39pm

Using current version of ScanSnap I have ScanSnap doing the OCR and setup Devon to NOT complete and OCR. Seems to be working well but I don’t have any idea if this scenario changes anything else. Scans are clear and have OCR.

RobH · November 20, 2019, 6:09pm

The OCR in PDFPen Pro works well. I found this script for Hazel that will run the OCR through PDFPen Pro. I tried making a smart rule in DT3 to do this, as well, but it failed to run.

Anyone have an idea on how to get this to work? I’ve done virtually nothing in Applescript, so I don’t know the language.

jlee0928 · November 20, 2019, 6:28pm

In the post, there is the apple script where you can copy and paste. You don’t need to modify anything.

RobH · November 20, 2019, 7:45pm

I tried putting that exact script into a Smart Rule and created a script that I added to the scripts menu. Neither work. When I run either, there’s no activity in Activity Monitor.

jlee0928 · November 20, 2019, 7:46pm

The script have to be used with Hazel, which is not a Devon product.

RobH · November 20, 2019, 7:56pm

Thus my original question on whether anyone knew how to make this work through DT3.

jlee0928 · November 20, 2019, 9:11pm

So I found the following post “Abbyy finereader within DT 3”, it shows the location of the DTOCRHelper, that Devon uses to OCR the PDF. According to the new file downloaded and the one I have on my computer, it’s dated November 8th, 2019. Anyone possibly have the DTOCRHelper file prior to this date?

jlee0928 · November 20, 2019, 9:28pm

Resolution: Use Time machine to restore the ~/Library/Application Support/DEVONthink 3/Abbyy/DTOCRHelper.app file to an earlier version prior to the upgrade.

CaretechDT · November 22, 2019, 6:00pm

We need to be able to count on the IMAGE to be of higher quality as OCR is what OCR is at best not accurate I hope ERIC and the design team are reading this. When these changes are done that impact our data it would be nice to KNOW or BE ADVISED. I for one do not appreciate DT to make decisions for me especially as it affects MY data. Awaiting your comments DT staff.

aedwards · November 25, 2019, 11:36am

We are currently working with ABBYY to determine the cause of the degraded image quality. This is an issue with the ABBYY FineReader product and they are currently investigating this problem.

cblaha · November 30, 2019, 1:53pm

Thank you for the hint and file location:

I copied from Time Machine the previous version 1.0.23, which solved it in my case.

wmc · November 30, 2019, 2:20pm

I have also reverted to the previous version, but the image is still degraded from the original scan. OCR with PDFPenPro does not affect the image layer, at least that I can detect; however that of course entails an additional step, and the ABBYY OCR is marginally better.
Please ask ABBYY to include an unaltered image as an option, in addition to compression choices.

Macster · December 1, 2019, 5:37pm

A lot of OCR software on the Mac sucks big time by not being able to preserve the CCITT T.6. (Group 4) lossless image compression when rewriting a PDF file. This compression format has been the de-facto industry standard for scanning and archiving since multiple decades, but Apple’s PDFKit library framework seems to only be able to decode (= display) such PDFs but apparently cannot write them, and software companies producing OCR or document management for the Mac are either agnostic to this problem (which has been existing for many many years, eg. see this forum post from 2012) or just don’t bother.

I had hoped that with DTP v3 things would have changed for the better, but apparently not.

So one should either not alter PDF files coming from a scanner (all original bundled scanner software that I know does do CCITT Group 4) on a Mac or use Adobe Acrobat. Both alternatives suck.

jerwin · December 1, 2019, 7:22pm

That doesn’t look like MRC compression.
When a black and white pdf is compressed with MRC. it gets more jagged.

Finereader Pro output shown. left uncompressed; right low quality.

Do your pdfs look better in Adobe Acrobat Reader? (Preview, and by extension, all renderers using PDFKit prerender using low resolution greyscale. But usually this clears up within a second, if not instantaneously.)