Dramatically increased size of PDF after OCR

As far as I know it’s a one-off purchase, not a subscription (though it’s also on SetApp). It has been money well spent for me.

Yes. My bad. I guess from your link the fee $9.99/month is for Setapp or something. I’m unfamiliar with this app.

I use Ghostscript, which is free of license fee, to compress PDFs.

1 Like

Can @BLUEFROG confirm if there’s ever been a ticket logged with Abbey about this? Have just been hit by this also and the way it would normally work is the paying customer of Abbey (in this case Devonthink) would log a ticket and say we have thousands of users complaining, please can you fix.

Sorry I haven’t searched all the other threads for this. In my case 18M document is turning into 143M which admittedly is not as bad as the others - but the main problem seems to be that it’s now slow to read on my computer due to the size of the file. New computer inbound but really it shouldn’t be like this.

Thanks.

Sorry for opening this up again.

I checked that the extremely large files are indeed created by Quartz instead of Abby.
I was wondering whether there is a way to strip/modify metadata so it’ll be created by Abby instead?
In most cases I don’t care for the metadata only the (then searchable) content.

It depends on the PDF, there are a number of reasons why modifications are made after the OCR stage, such as transferring annotations or table of contents.

What size are the original and OCR’d files? Could you provide a copy of the original file and a screenshot of your OCR settings. If you do not wish to share the PDF on the forum please create a support ticket and add “For the attention of Alan” at the top.

Thanks, I’ve created a ticket.

I also tried to reduce the file size with PDF Squeezer but it was still a very large pdf and hard to read.

If it is only one PDF, you could send privately to me and I will process with Abbyy 15 PDF with MRC compression from Windows. The result will be like those downloaded from arivchve.org and normally it is ten to one in size reduction without normal visual quality reduction. From macOS, if the “image” backgrounds are too much different because paper irregularities, the size grows absurdly.

Problem with this MRC compression is if you annotate in DT, Preview or most macOS applications that use the Apple PDF Framework, it grows back into the original big size, or even bigger, does not matter if you only highlight a letter or the entire document. The only application in macOS I’ve found does not do that is PDF Expert.

(It’s quite Kafkaesque that Windows now has much better native and third-party support for PDF than macOS, which is almost where the PDF format first came into its own).

Thanks for the offer.
Unfortunately I got a lot of pdfs from that insurance and only some of them grow in size when OCRed.
The final size is 3-4MB which is okay. But it’s quite large considering the files are 100KB before.

Stupid question: are you sure the insurance sends out PDFs without a text layer? I don’t see that with any invoice, statement, whatever from banks, insurances, internet providers etc.

1 Like

I’ve encountered four different pdfs from companies:

  • searchable (most pdfs)
  • selectable text (in DT) but not searchable
  • kind of encrypted text layer. Can select text but the text isn’t equal to the visible text. Maybe it’s not text but vectors instead?
  • image only pdfs

In this case it was the third type.