OCR PDF compression size and image comparison

sandgiant · May 29, 2020, 12:11pm

I wanted to test the effect of PDF compression when doing OCR and I found some rather interesting results for a paper that I wanted to share.

This is the paper I’m working with: https://www.nature.com/articles/320250a0

NOTE: Compression is obviously a tricky things so these results may not be representative for PDF compression in DT in general. As a matter of fact, even with these results, I would still expect the PDF compression to produce smaller file sizes on average. Even so it may be worthwhile studying the cases where it doesn’t, and this is clearly one of those cases. Perhaps there’s something funny with the input PDF I’m using, I don’t know.

I simply create two OCRd PDFs, one with compression ON and one with compression OFF. Here are the results, including file size:

Original (320 KB):

Uncompressed OCR PDF (3.6 MB):

Compressed OCR PDF (4.5 MB):

The quality degradation from uncompressed to compressed is not that surprising. What I find surprising is that the file size of the uncompressed PDF is smaller than for the compressed version.

Perhaps I’m late to the game here, and this is some kind of known phenomenon. In any case I was surprised. My naive guess is that there are several compression algorithms at work here and that they somehow together produce worse results than with no compression. I wonder of both ABBYY and DT do compression, with DT ending up trying to compress compression artifacts from ABBYY, and that produces larger file sizes? This post from October might be related: OCR Settings: Compress PDF, Deskew, Page Orientation

Does anyone know what could be causing this?

aedwards · May 29, 2020, 2:35pm

The most likely reason is that metadata has been added or transferred from the original file, this is done after the OCR. When resaving the file macOS only provides for a quartz filter compression which is not as efficient as the ABBYY compression.