After OCR, PDF reduces size a lot

MichaelJAQ · January 14, 2021, 12:57pm

I try to use OCR to convert a image-based PDF into searchable PDF, in settings, I unchecked “compress PDF”, DPI=300. The original file is 14mb, after convert, it’s only 4mb and image inside is a little bit blur.

I don’t need any quality loss, just adding searchable ability, how can I do that?

aedwards · January 14, 2021, 1:05pm

Can you provide a copy of the original file so that I can test it, and which version of macOS are you running?

MichaelJAQ · January 14, 2021, 1:47pm

Building a Second Brain- The Illustrated Notes.pdf (11.5 MB)

Please check it out. Please forgive me the original file size is around 12MB. My macOS is 11.1.

aedwards · January 14, 2021, 3:46pm

Thanks for sending the file. It looks like there. are some addition artefacts around the text that could have been introduced during the page extraction. I have added a change that should reduce this.

MichaelJAQ · January 14, 2021, 4:04pm

But how come I think the image quality reduced a little bit after conversion (file size from 12mb → 4mb)? Possible remain the image quality but extract texts?

aedwards · January 14, 2021, 4:32pm

The difference between the original size which was generated with macOS PDFKit and the OCR’d file is that the ABBYY OCR has a significantly better compression than PDFKit.

tja · January 20, 2021, 12:39pm

That’s not the point here.

As he wrote, he disabled compression

aedwards · January 20, 2021, 1:03pm

ABBYY will always apply some compression. If the “Compress PDF” option is off this relates to the final PDF size in two ways:

If metadata is added or transferred from the original file the saved file will not be compressed.
In ABBYY OCR when generating the PDF, the priority is set for quality over size, however it is ABBYY that determines the actual amount of compression applied to the final file.

tja · January 20, 2021, 1:21pm

As i understood the OP, the visual quality of the PDF was affected.

This seems to point to lossy compression being used.

Disabling compression in the settings should AT LEAST disable any lossy compression!

aedwards · January 20, 2021, 1:51pm

The issue is not due compression but the extraction of the page image from the original PDF, which as I said earlier has been fixed.

tja · January 20, 2021, 1:54pm

Ah, then I’m sorry - I did not get this point and only noticed that compression is always used.

aedwards · January 20, 2021, 2:07pm

Not a problem, happy to explain the cause of the issue.