Dramatically increased size of PDF after OCR

P12 · March 6, 2022, 12:18pm

As far as I know it’s a one-off purchase, not a subscription (though it’s also on SetApp). It has been money well spent for me.

rmschne · March 6, 2022, 12:32pm

Yes. My bad. I guess from your link the fee $9.99/month is for Setapp or something. I’m unfamiliar with this app.

I use Ghostscript, which is free of license fee, to compress PDFs.

marshalleq · March 6, 2022, 11:52pm

Can @BLUEFROG confirm if there’s ever been a ticket logged with Abbey about this? Have just been hit by this also and the way it would normally work is the paying customer of Abbey (in this case Devonthink) would log a ticket and say we have thousands of users complaining, please can you fix.

Sorry I haven’t searched all the other threads for this. In my case 18M document is turning into 143M which admittedly is not as bad as the others - but the main problem seems to be that it’s now slow to read on my computer due to the size of the file. New computer inbound but really it shouldn’t be like this.

Thanks.

jandamm · June 12, 2025, 11:54am

Sorry for opening this up again.

I checked that the extremely large files are indeed created by Quartz instead of Abby.
I was wondering whether there is a way to strip/modify metadata so it’ll be created by Abby instead?
In most cases I don’t care for the metadata only the (then searchable) content.

aedwards · June 12, 2025, 12:27pm

It depends on the PDF, there are a number of reasons why modifications are made after the OCR stage, such as transferring annotations or table of contents.

What size are the original and OCR’d files? Could you provide a copy of the original file and a screenshot of your OCR settings. If you do not wish to share the PDF on the forum please create a support ticket and add “For the attention of Alan” at the top.

jandamm · June 12, 2025, 4:26pm

Thanks, I’ve created a ticket.

I also tried to reduce the file size with PDF Squeezer but it was still a very large pdf and hard to read.

rfog · June 13, 2025, 6:21am

If it is only one PDF, you could send privately to me and I will process with Abbyy 15 PDF with MRC compression from Windows. The result will be like those downloaded from arivchve.org and normally it is ten to one in size reduction without normal visual quality reduction. From macOS, if the “image” backgrounds are too much different because paper irregularities, the size grows absurdly.

Problem with this MRC compression is if you annotate in DT, Preview or most macOS applications that use the Apple PDF Framework, it grows back into the original big size, or even bigger, does not matter if you only highlight a letter or the entire document. The only application in macOS I’ve found does not do that is PDF Expert.

(It’s quite Kafkaesque that Windows now has much better native and third-party support for PDF than macOS, which is almost where the PDF format first came into its own).

jandamm · June 13, 2025, 6:45am

Thanks for the offer.
Unfortunately I got a lot of pdfs from that insurance and only some of them grow in size when OCRed.
The final size is 3-4MB which is okay. But it’s quite large considering the files are 100KB before.

chrillek · June 13, 2025, 7:08am

Stupid question: are you sure the insurance sends out PDFs without a text layer? I don’t see that with any invoice, statement, whatever from banks, insurances, internet providers etc.

jandamm · June 13, 2025, 11:38am

I’ve encountered four different pdfs from companies:

searchable (most pdfs)
selectable text (in DT) but not searchable
kind of encrypted text layer. Can select text but the text isn’t equal to the visible text. Maybe it’s not text but vectors instead?
image only pdfs

In this case it was the third type.