Merged file rather large

MikeP · February 17, 2015, 9:55pm

Hi,
After merging a 980k and 940k PDF, the resulting merged document is… 17.1MB!!!

A bit of common sense and basic mathematics tells me it should be under 2MB.
How can this be? Anyone else seen this? All I did was select two files, right-click, merge.

Thanks,
Mike

Edit: forgot to add I’m using DTPO 2.8.3 on Yosemite, all fully up to date.

Bill_DeVille · February 18, 2015, 12:06am

Apple’s PDFKit uses Quartz default settings to save PDFs resulting from edit, split or merge operations, often with considerable bloat. That’s why I avoid splitting or merging PDFs unless absolutely necessary – and I’ve developed a very high threshold about what’s “necessary”.

It’s possible to configure Quartz settings to minimize file size bloat, but I gave up on that and use a utility to reduce PDF sizes when needed, called PDF Shrink.

cgrunenberg · February 18, 2015, 8:09am

DEVONthink uses Mac OS X’s PDFKit framework to edit/split/merge PDF documents and saves them lossless. Otherwise each operation could reduce the quality. But a future release might use Quartz filters to reduce the file size. Finally, could you please send these documents to cgrunenberg - at - devon-technologies.com? Thanks!

Frederiko · February 19, 2015, 1:04pm

@Mike

If this is something you need to do often you should be easily able to modify the script I posted here to split pds so as to join them instead.

It uses an external utility called sejda which is better at keeping small sizes than the Quartz filters that OS X uses.

Personally I haven’t found the various pieces of software that claim to be able to shrink PDFs to be particularly effective.

Frederiko

Bill_DeVille · February 19, 2015, 6:25pm

I agree, in that I’ve tried a lot of them, often will poor results.

I’ve been doing book scanning, using the little Xcanex portable book snd document scanner. I installed Windows 7 under Boot Camp on my MacBook Pro, as Mac software isn’t yet available. It produces great results using the FineReader OCR app supplied with the scanner, and I usually save the OCRed PDFs as high quality (in case I want to redo OCR with a different filetype save, such as Word). As a result, the PDF file is enormous.

PDF Shrink has a number of file size reduction settings, some of which don’t reduce file size very much or may reduce view quality too much. I’ve settled on the Web setting, configured for PDFs to be posted on the Web. It reduces the file size of my book PDFs by about 90%, yet produces very readable PDFs.

The little 7 ounce Xcanex scanner is remarkable. It has had three software updates since I got one last year, and is capable of professional results. There’s a bit of a learning curve in setup and copy placement, but each software update has made that easier. I’m amused that in the latest release they display and compare images with those produced by “Scanner F” (obviously the Fujitsu ScanSnap S600), with advantage to the Xcanex. However, the Xcanex is limited to a book page size of about U.S. letter size without elevating the scanner and lowering image resolution, while the ScanSnap S600 can handle somewhat larger page sizes. And the S600 is Mac compatible. (I still hate Windows.)

Shoolie · February 20, 2015, 5:18am

Another alternative for shrinking PDFs is Jerome Colas’ free Quartz Filter Collection. See: fairerplatform.com/2013/01/how-t … x-preview/

See the link toward the bottom of the page.

The article says to place the filters in ~/Library/Filters. I placed them in /System/Library/Filters. I suppose the difference is that placing them in your user Library makes the filters available only to your user account, whereas placing them in the /System library makes them available to all accounts.

You will have to load the PDF into Preview and re-save it while applying a filter. Sometimes a little trial and error is needed but I always find a filter that works well on a given PDF.

If/when Devon implements support for Quartz filters, I hope there is a way for DT(PO) to leverage user-supplied filters.

NB: I have not tried these filters on Yosemite.

MikeP · February 21, 2015, 3:39pm

Hi,

Thanks all for your responses. The issue is not really with trying to shrink them, as the two files are an acceptable size (<1MB each) before merging. It’s only when I join them together that they blow up to over 8x the original combined size. This shouldn’t be necessary, where does all the additional data come from? If that’s the way it is I can live with keeping the original files, it’s just a little bizarre though.
And thanks Christian, I’ve emailed the files over.

If it helps, they are generated by ABBYY Pro. I don’t use DTPO’s OCR because I also use Hazel to apply content-based rules (such as putting the document date into the filename), which has to happen after OCR and before loading into DTPO.

Thanks & regards,
Mike

alanshutko · February 21, 2015, 3:59pm

Basically, there are lots of ways that are compliant with the PDF spec to store data. PDFKit only uses some of them. When PDFKit loads a document, changes it and saves it, it changes the on-disk representation to the subset of PDF options that it likes to write. There’s been speculation that PDFKit writes data that it finds faster to read, but no official responses from Apple that I know of.

Quartz Filters don’t really solve this problem. They can reduce the size of the objects when writing it, but they do so by degrading the objects not by changing the on-disk representation.

The only program I’ve found that can losslessly reduce the size of PDFs is Adobe Acrobat. It writes much smaller PDFs because it uses PDF features that PDFKit does not when writing. If you split or merge PDFs with Acrobat, it won’t even change the structure of the objects at all, will just move them.

Bill_DeVille · February 21, 2015, 6:29pm

alanshutko:

Basically, there are lots of ways that are compliant with the PDF spec to store data. PDFKit only uses some of them. When PDFKit loads a document, changes it and saves it, it changes the on-disk representation to the subset of PDF options that it likes to write. There’s been speculation that PDFKit writes data that it finds faster to read, but no official responses from Apple that I know of.

Quartz Filters don’t really solve this problem. They can reduce the size of the objects when writing it, but they do so by degrading the objects not by changing the on-disk representation.

The only program I’ve found that can losslessly reduce the size of PDFs is Adobe Acrobat. It writes much smaller PDFs because it uses PDF features that PDFKit does not when writing. If you split or merge PDFs with Acrobat, it won’t even change the structure of the objects at all, will just move them.

I agree. While Adobe allows other developers, including Apple, to write applications for creating, displaying and editing PDFs, Adobe’s own applications, especially Acrobat Pro, contain a lot of proprietary code. Acrobat allows lossless (image quality) saves of edited PDFs with relatively small file sizes. That demonstrates that it is possible in OS X.

But Adobe’s business plan seems to ascribe a lot of value to their proprietary code in Acrobat Pro. Check out the price!

If the price of Acrobat Pro were substantially cheaper (as it once was), it would be one of my tools.

I’ve been scanning a series of syllabi for graduate seminars on Science, Technology and Public Policy that I coauthored years ago and for which I’ve received requests for copies. As they are not available in digitized form (although available in a number of libraries) I’ve undertaken to scan them. To keep their file sizes of PDFs with hundreds of pages down, I’m using the approach of saving the scans as MS Word files (I’m using ABBYY FIneReader Pro 12.x), then exporting them as PDF. That’s more work, darn it, and not without occasional glitches.

MikeP · March 1, 2015, 1:08am

Just a followup for the benefit of anyone bumping into this:

Christian already solved that this is an issue with the PDFkit, because merging it in the Preview app also results in the same size.

I tried something else completely outside of DevonThink: I had 10 receipts scanned into a single 800KB PDF. I split them into individual files using PDFpen, the individual files totalled nearly 7MB. Yessir, that’s an average 700k file size for a simple paper receipt!

Adobe trying to make money out of selling Acrobat Pro? Surely not!

Thanks all who responded,
Mike