Further degradation of OCR PDF image layer in 3.0.2

chreliot · December 1, 2019, 7:24pm

@Macster I think OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) retains image quality at or near lossless on MacOS (via command line). But I don’t think its OCR fidelity—via the Tesseract engine—is as good as ABBYY’s in DT3.

chreliot · December 1, 2019, 7:31pm

@jerwin Here’s a screenshot of the same part of the file above, OCR’d in DTP 3.0.2 and then opened in Adobe Acrobat Reader DC 2019.21.20056:

58%20PM

To my eye it exhibits the blur effect more apparent in the DTP 3.0.1 image above, rather than the blockiness of the DTP 3.0.2 image, but resolution is still objectionably degraded.

Macster · December 1, 2019, 7:58pm

Yes, there are a couple of OCR apps/tools which support the original compression and just add their text layer to the file, which I consider the right thing to do. But it seems that on the Mac a lot of companies seem to just take the simplistic approach and use Apple’s PDFKit without a lot of consideration. I don’t care if the PDFs created (e.g. by the MacOS Preview app) for printing purposes recompress images, but for archiving purposes it’s a no-go.

One can easily check what is going on by opening the PDF file with a text editor. In PDFs produced by a bundled scanner app one typically finds lines like this for the page images:

<</Type/XObject/Subtype/Image/Width 2477/Height 3507/ColorSpace/DeviceGray/BitsPerComponent 1/Filter/CCITTFaxDecode/DecodeParms<</K -1/Columns 2477/Rows 3507>>

Watch out for the /Filter/CCITTFaxDecode — a simplistic app turns this into /Filter/FlateDecode (then it might be lossless conversion, but with a much bigger file size as this compression algorithm is not nearly as well suited for scanned images) or even /Filter/DCTDecode, which is the lossy JPEG compression.

@jerwin The jagginess is of course a matter of the image resolution, i.e. 300dpi is more jaggy than 600dpi — which enough pixels, everything looks smooth.

Macster · December 1, 2019, 8:07pm

For my purposes (archiving business documents) I actually consider Tesseract good enough. No automated OCR will yield perfect results anyway, except perhaps under very controlled conditions which don’t happen in my case where each document is different in size, layout, fonts, colors and contents.

So for me OCR just 1. increases the odds that a document will show up in a Spotlight search, and 2. makes it possible to copy some single pieces of data from it like e.g. serial numbers (so I can save some typing even though I need to closely look whether what’s copied matches the original text). So I’d choose a less-than-perfect OCR result over mangled images inside of the PDFs any day.

EDIT: Especially since degrading the image makes it impossible to do OCR again at some later point in time with a better software again. So one would need to keep the original PDF around, too. What a nonsense!

wmc · December 1, 2019, 8:10pm

To my eye, PDFPen Pro fits that description. I can’t see a difference between the before and after OCR versions.

Macster · December 1, 2019, 8:13pm

That sounds great, but also have a look at the size of the resulting PDFs. If the file size increases considerably, that’s not because of the few words of OCR’ed text which only takes a few kilobytes. It’s because of some kind of recompression with a suboptimal codec.

wmc · December 1, 2019, 8:24pm

Original 25.5MB (from ExactScan, no OCR); PDFPen Pro output file 14.5MB. Both use the `/Filter/FlateDecode you mention above.
Edit: There’s a cost: Compressed by DT3.0.1 OCR engine: 1.2MB. Also FlateDecode, FWIW. Not as clear as the PDFPen output, but usable, unlike the 3.0.2.

Macster · December 1, 2019, 8:39pm

Your case is probably different than that of the original poster @chreliot who had a bi-level black and white scan turned into greyscale with what looks like JPEG compression.

The /Filter/FlateDecode is lossless, it’s basically the same as LZW or ZIP, which can be performed with different “thoroughness” during the processing stage (= higher compression takes more time to compute) in order to possibly get smaller files but without sacrificing image quality.

In order to get way smaller files with /Filter/FlateDecode, resolution needs to be decreased, i.e. pixels need to be dropped. Normally one does not want this either, because if one doesn’t need the resolution, one would have scanned the document with a lower one to begin with.

(But then again I am looking at the topic from an archiving background — if you don’t want or need to “preserve” a file but just want to keep it around in a size where you can still “read” it, all that might not be so important.)

wmc · December 1, 2019, 8:45pm

Appreciate your comments; very informative.

Except in the case where more res is needed for accurate OCR (small point size type) but OK for viewing/printing. So more lossy compression might be acceptable.
But yeah, everything is a tradeoff.

wmc · December 1, 2019, 8:48pm

Original was black print on yellow paper, scanned in color, 300dpi.

Macster · December 1, 2019, 8:56pm

OK, I agree this is a case where one surely desires smaller files when one is done processing them. A 300dpi color scan is huge.

With /Filter/CCITTFaxDecode one usually gets single-page PDFs which are only around 50 kilobytes in size at A4 @ 300dpi, and then it really hurts if just the OCR stage either turns them into 2 megabytes or degrades image quality, or even both. This is what I was mainly talking about, sorry.

wmc · December 1, 2019, 9:03pm

Didn’t mean to drag you off-topic, but learned a lot from your comments. Thanks!

jerwin · December 2, 2019, 12:37am

OK, after examining some files with a text editor.

Finereader Pro tends to replace CCITTFaxDecode with JBIG2Decode.

(a short pdf file containing CCITTFax data may be obtained through this link. Click on the ‘image-56.pdf’.

Macster · December 2, 2019, 10:38am

JBIG2 is much more powerful and compresses even better than the CCITT Group 4 fax encoding and can operate lossless, too. So as long as a choice is given to the user to not reduce the resolution and not go lossy for even further compression, JBIG2 might actually be the best choice.

(It’s just that CCITT Group 4 has been present in the PDF standard since the very beginning, while JBIG2 was introduced by some later revision, apparently 1.4, so the CCITT compression may still be more universally supported in PDF software.)

EDIT: Actually JBIG2 can be very dangerous to use in lossy mode; this became apparent in the curious case of Xerox copier machines which produced copies with altered texts: since the letters 6 and 8 look “similar”, the algorithm might chose to insert the image of a “6” instead of an “8” or vice versa.

So there are a lot of reasons why OCR software should generally leave the original PDF image content alone and unmodified unless explicitly told to “save space” (and then it needs do the right thing, too).

chreliot · December 5, 2019, 11:21pm

This issue seems well resolved in release 3.0.3. I thank the developers (@aedwards) for their prompt efforts to solve this issue with the OCR provider.

amalis · December 6, 2019, 12:20pm

Is the correct version of DTOCRHelper.app now 1.0.25? That’s what I have installed following upgrading DT3 to 3.0.3. Thanks!

cgrunenberg · December 6, 2019, 12:22pm

That’s the latest helper of version 3.0.3.

wmc · December 7, 2019, 4:16pm

Second this. Appreciate the DT team’s work in resolving this issue.

nano5 · December 9, 2019, 8:10am

3.0.3 resolve the problem by keeping the originals (plus text player) in the folder.

The OCR results now sometimes looking better (file size much smaller too), compared to the original scan, and sometimes worse…may have to do with the pixel resolution of the original scans.

RobH · December 9, 2019, 8:11pm

Is this normal or expected? I have a PDF that is 425 KB with no OCR layer. After I run the OCR on the PDF, the file size jumps to 2.2 MB. That’s a 5x increase in size.

The original PDF was pretty clear, not fuzzy, and produced nearly perfect word recognition (only a few words mangled).

Mojave, DT3.0.3