Further degradation of OCR PDF image layer in 3.0.2

Resolution: Use Time machine to restore the ~/Library/Application Support/DEVONthink 3/Abbyy/DTOCRHelper.app file to an earlier version prior to the upgrade.

1 Like

We need to be able to count on the IMAGE to be of higher quality as OCR is what OCR is at best not accurate I hope ERIC and the design team are reading this. When these changes are done that impact our data it would be nice to KNOW or BE ADVISED. I for one do not appreciate DT to make decisions for me especially as it affects MY data. Awaiting your comments DT staff.

1 Like

We are currently working with ABBYY to determine the cause of the degraded image quality. This is an issue with the ABBYY FineReader product and they are currently investigating this problem.

5 Likes

Thank you for the hint and file location:

I copied from Time Machine the previous version 1.0.23, which solved it in my case.

I have also reverted to the previous version, but the image is still degraded from the original scan. OCR with PDFPenPro does not affect the image layer, at least that I can detect; however that of course entails an additional step, and the ABBYY OCR is marginally better.
Please ask ABBYY to include an unaltered image as an option, in addition to compression choices.

A lot of OCR software on the Mac sucks big time by not being able to preserve the CCITT T.6. (Group 4) lossless image compression when rewriting a PDF file. This compression format has been the de-facto industry standard for scanning and archiving since multiple decades, but Apple’s PDFKit library framework seems to only be able to decode (= display) such PDFs but apparently cannot write them, and software companies producing OCR or document management for the Mac are either agnostic to this problem (which has been existing for many many years, eg. see this forum post from 2012) or just don’t bother.

I had hoped that with DTP v3 things would have changed for the better, but apparently not.

So one should either not alter PDF files coming from a scanner (all original bundled scanner software that I know does do CCITT Group 4) on a Mac or use Adobe Acrobat. Both alternatives suck.

That doesn’t look like MRC compression.
When a black and white pdf is compressed with MRC. it gets more jagged.

Finereader Pro output shown. left uncompressed; right low quality.

Do your pdfs look better in Adobe Acrobat Reader? (Preview, and by extension, all renderers using PDFKit prerender using low resolution greyscale. But usually this clears up within a second, if not instantaneously.)

@Macster I think OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) retains image quality at or near lossless on MacOS (via command line). But I don’t think its OCR fidelity—via the Tesseract engine—is as good as ABBYY’s in DT3.

@jerwin Here’s a screenshot of the same part of the file above, OCR’d in DTP 3.0.2 and then opened in Adobe Acrobat Reader DC 2019.21.20056:

58%20PM

To my eye it exhibits the blur effect more apparent in the DTP 3.0.1 image above, rather than the blockiness of the DTP 3.0.2 image, but resolution is still objectionably degraded.

Yes, there are a couple of OCR apps/tools which support the original compression and just add their text layer to the file, which I consider the right thing to do. But it seems that on the Mac a lot of companies seem to just take the simplistic approach and use Apple’s PDFKit without a lot of consideration. I don’t care if the PDFs created (e.g. by the MacOS Preview app) for printing purposes recompress images, but for archiving purposes it’s a no-go.

One can easily check what is going on by opening the PDF file with a text editor. In PDFs produced by a bundled scanner app one typically finds lines like this for the page images:

<</Type/XObject/Subtype/Image/Width 2477/Height 3507/ColorSpace/DeviceGray/BitsPerComponent 1/Filter/CCITTFaxDecode/DecodeParms<</K -1/Columns 2477/Rows 3507>>

Watch out for the /Filter/CCITTFaxDecode — a simplistic app turns this into /Filter/FlateDecode (then it might be lossless conversion, but with a much bigger file size as this compression algorithm is not nearly as well suited for scanned images) or even /Filter/DCTDecode, which is the lossy JPEG compression.

@jerwin The jagginess is of course a matter of the image resolution, i.e. 300dpi is more jaggy than 600dpi — which enough pixels, everything looks smooth.

1 Like

For my purposes (archiving business documents) I actually consider Tesseract good enough. No automated OCR will yield perfect results anyway, except perhaps under very controlled conditions which don’t happen in my case where each document is different in size, layout, fonts, colors and contents.

So for me OCR just 1. increases the odds that a document will show up in a Spotlight search, and 2. makes it possible to copy some single pieces of data from it like e.g. serial numbers (so I can save some typing even though I need to closely look whether what’s copied matches the original text). So I’d choose a less-than-perfect OCR result over mangled images inside of the PDFs any day.

EDIT: Especially since degrading the image makes it impossible to do OCR again at some later point in time with a better software again. So one would need to keep the original PDF around, too. What a nonsense!

To my eye, PDFPen Pro fits that description. I can’t see a difference between the before and after OCR versions.

That sounds great, but also have a look at the size of the resulting PDFs. If the file size increases considerably, that’s not because of the few words of OCR’ed text which only takes a few kilobytes. It’s because of some kind of recompression with a suboptimal codec.

Original 25.5MB (from ExactScan, no OCR); PDFPen Pro output file 14.5MB. Both use the `/Filter/FlateDecode you mention above.
Edit: There’s a cost: Compressed by DT3.0.1 OCR engine: 1.2MB. Also FlateDecode, FWIW. Not as clear as the PDFPen output, but usable, unlike the 3.0.2.

Your case is probably different than that of the original poster @chreliot who had a bi-level black and white scan turned into greyscale with what looks like JPEG compression.

The /Filter/FlateDecode is lossless, it’s basically the same as LZW or ZIP, which can be performed with different “thoroughness” during the processing stage (= higher compression takes more time to compute) in order to possibly get smaller files but without sacrificing image quality.

In order to get way smaller files with /Filter/FlateDecode, resolution needs to be decreased, i.e. pixels need to be dropped. Normally one does not want this either, because if one doesn’t need the resolution, one would have scanned the document with a lower one to begin with.

(But then again I am looking at the topic from an archiving background — if you don’t want or need to “preserve” a file but just want to keep it around in a size where you can still “read” it, all that might not be so important.)

1 Like

Appreciate your comments; very informative.

Except in the case where more res is needed for accurate OCR (small point size type) but OK for viewing/printing. So more lossy compression might be acceptable.
But yeah, everything is a tradeoff.

Original was black print on yellow paper, scanned in color, 300dpi.

OK, I agree this is a case where one surely desires smaller files when one is done processing them. A 300dpi color scan is huge.

With /Filter/CCITTFaxDecode one usually gets single-page PDFs which are only around 50 kilobytes in size at A4 @ 300dpi, and then it really hurts if just the OCR stage either turns them into 2 megabytes or degrades image quality, or even both. This is what I was mainly talking about, sorry.

1 Like

Didn’t mean to drag you off-topic, but learned a lot from your comments. Thanks!

1 Like

OK, after examining some files with a text editor.

Finereader Pro tends to replace CCITTFaxDecode with JBIG2Decode.

(a short pdf file containing CCITTFax data may be obtained through this link. Click on the ‘image-56.pdf’.