Can anyone recommend a pdf compressor and DT workflow?

rmschne · June 15, 2022, 8:30am

I am not at my computer and won’t be for a few weeks to look. best to read the “man” file and try different settings. i usually go for 75 dpi.

Chazzo · June 15, 2022, 10:09am

Thanks. I made a PDF from a big TIFF and Ghostscript crunched it very well.

But I can’t get the -r flag to make any difference. -r100 and -r50x50 yield the same file size (and visually the same result) as omitting -r altogether. Perhaps it only works for some output devices? The documentation is a bit terse. For the record, I used:

gs -sDEVICE=pdfwrite -r100 -dCompatibilityLevel=1.7 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=/Users/charles/Desktop/compressed.pdf ~/Desktop/test.pdf

chrillek · June 15, 2022, 10:17am

That’s probably the understatement of the month. Quoting from it

Output resolution
Some printers can print at several different resolutions, letting you balance resolution
against printing speed. To select the resolution on such a printer, use the -r switch:
gs -sDEVICE= printer -r XRES x YRES
where XRES and YRES are the requested number of dots (or pixels) per inch. Where
the two resolutions are same, as is the common case, you can simply use -r res.
The -r option is also useful for controlling the density of pixels when rasterizing to an
image file. It is used this way in the examples at the beginning of this document.

If you know what they mean, it’s perfectly clear. Which is why I love Open Source documentation so much… Anyway, a PDF device has no pixel resolution. Or rather it now no pixel resolution that you can select. It’s always 72 dpi, regardess if the physical printer has 300dpi or 1200.

I guess what happens when you pass a TIFF to pdfwrite, it simply gets converted into a PDF containing one large TIFF image (given that GS doesn’t have much choice there …). Why that would be a lot smaller than the original … could be that the original has another (or no) compression that GS is using, for example.

Chazzo · June 15, 2022, 11:44am

Interesting. I’ve never seen that issue. But when I was making test PDFs just now I swear that Preview.app’s Info pane said one of them had a password, but that it was stored by the system or some such. I can’t reproduce it.

With the variable names from my PDF Shrink script, what about:

set this_file to do shell script "pdfs " & this_image & " --replace"

I would love to help but I’ve always found this the most baffling part of Automator (and possibly now Shortcuts too). But I think Jim’s workflow should help?

Chazzo · June 15, 2022, 11:46am

Thanks, that’s useful – bearing in mind that all PDFs are different, and that different compression settings are available. PDF Shrink, at least, gives quite a lot of control, though the settings screen is a bit clunky IIRC.

Chazzo · June 15, 2022, 1:14pm

Thanks, this works well! Always good to be able to use built-in tools. Is there a convenient way to run an Automator workflow within DT?

The purpose-built PDF tools probably give more options – for instance, PDF Shrink can change the PDF version, handle colour and mono images differently, strip metadata, and so on. But for most PDFs I’m guessing that image compression is the biggest contributor to smaller files.

BLUEFROG · June 15, 2022, 1:36pm

You’re welcome and glad to hear it. And yes, we try to remove or limit dependencies in what we do.

You can add it to the ~/Library/Application Scripts/com.devon-technologies.think3/Menu directory and run it just like any script.

rkaplan · June 15, 2022, 1:37pm

Thank you @Chazzo

I will give this a try and read the PDF Squeezer automation docs in more detail and report back - it may take a few days

Chazzo · June 15, 2022, 5:46pm

Aha. Thank you.

I don’t know what GS is using, but it’s quite impressive.

My original uncompressed TIFF (a bunch of flowers at 12 MP) is 36 MB.

Saving it as a PDF from Preview.app gives 20 MB, with some smoothing of edges. Squashing the PDF with GS gives 718 kB, with really rather good image quality. There is some blockiness, but not much.

I get a similar file size (740 kB) by saving the TIFF as a JPG from Preview with about 30% quality. Compared to the GS PDF there’s more blockiness, though less noise.

Perhaps this has something to do with the way Preview displays PDFs compared to TIFFs and JPGs, but GS still performs well in this case. So I tried it on a well-made PDF of a magazine, and it snarled up many of the non-ASCII characters.

Which I think goes to show that all PDFs are different and you never know what compression will do until you try.

BLUEFROG · June 15, 2022, 5:58pm

Which I think goes to show that all PDFs are different and you never know what compression will do until you try.

There are many, many forms of PDFs in the wild, not all of them well-behaved.

Did you try my workflow on the file?
- If so, with what compression settings?

Chazzo · June 15, 2022, 7:48pm

I have now JPEG compression and “least” quality on the slider.

It’s definitely working. I captured a PDF of a random web page (404 kB) and shrunk it to 283 kB.

However, when I tried it with my original PDF (the flower photo, 19.7 MB) the image size changed by just 2 bytes and I can’t tell the difference visually. It’s as if the image isn’t being touched. I find this odd because it’s a full-size photo from my iPhone, so I’d expect it to be an ideal candidate for shrinking.

And the magazine PDF I mentioned actually grew in size. I’m less surprised here, because this is a well-made PDF: 24 colour pages with loads of graphics in just 3.8 MB. PDF Shrink couldn’t make it any smaller; the Quartz filter took it up to 6.3 MB, though unlike Ghostscript it did leave the non-ASCII characters in place.

Discourse is politely pointing out that I’m hogging this thread, so I will apologise and shut up now.

BLUEFROG · June 15, 2022, 8:09pm

Ignore Discourse. It ain’t the boss of you

rmschne · June 17, 2022, 12:08pm

Here is a little Python3 program (a function definition followed by the main program which calls the function) that shows the settings I last used with Ghostscript. The function builds a command string that is then executed by the system. See the various parameters that I chose to set.

Best that you experiment. I won’t vouch these are “best” or even “good” settings. Your mileage may differ.

My goal (when I get around to it) is to turn this into some sort of automation in DEVONthink that, say, compresses all the selected “big” files, or compresses big files when they are created in DEVONthink, or … “world is my oyster” as they say.

#!/usr/bin/env python
# coding: utf-8

# In[1]:


import sys
import os
def shrinkcommand(file_input,file_output,dpi="75"):
    print("Shrink Command",file_input,file_output,dpi)
    cmd="gs "+\
    " -q -dNOPAUSE -dBATCH -dSAFER"+\
    " -sDEVICE=pdfwrite"+\
    " -dCompatibilityLevel=1.4"+\
    " -dPDFSETTINGS=/screen"+\
    " -dColorImageDownsampleType=/Bicubic"+\
    " -dColorImageResolution="+dpi+\
    " -dGrayImageDownsampleType=/Bicubic"+\
    " -dGrayImageResolution="+dpi+\
    " -dMonoImageDownsampleType=/Subsample" +\
    " -dMonoImageResolution="+dpi +\
    " -sOutputFile="+file_output+" "+file_input
    return cmd

file_extension=".pdf"
dpi="75"
DEBUG=True

base=os.path.basename(sys.argv[1])
basename=base.split('.')
basepath=os.path.dirname(sys.argv[1])
try:
    outpath=sys.argv[2]+"/"
except:
    outpath=""
file_input=sys.argv[1]
file_output=outpath+basename[0]+"_dpi-"+dpi+file_extension


if DEBUG:
   print("arguments:",file_input,file_output,dpi)
   print("gs command:",cmd)
   print(os.system(cmd))

cmd=shrinkcommand(file_input,file_output,dpi=dpi)

Chazzo · June 17, 2022, 1:02pm

Thank you, and good luck! For my large test image, running those settings on the command line* produces a result that’s very similar to the basic gs -sDEVICE=pdfwrite, i.e. with no attempt to control the output resolution. The file size is slightly larger (831 kB) and the JPG blockiness is just a bit more pronounced.

Is that related to the lack of control over resolution that you mentioned above, and the fact that Preview reports 72 dpi for the uncompressed PDF?

For the moment I think I’ll stick to the various commercial solutions, even though I’m sure are often just wrappers around Ghostscript and similar.

* I wan’t clear about the slashes in, for example, -dColorImageDownsampleType=/Bicubic, but GS complains if I leave them out.

chrillek · June 17, 2022, 1:48pm

I found

gist.github.com

https://gist.github.com/ahmed-musallam/27de7d7c5ac68ecbd1ed65b6b48416f9

compress_pdf.md

# How to compress PDF using ghostscript

As a developer, it bothers me when someone sends me a large pdf file compared to the number of pages. Recently, I recieved a 12MB scanned document for just one letter-sized page... so I got to googlin, like I usually do, and found ghostscript!

to learn more abot ghostscript (gs): https://www.ghostscript.com/

What we are interested in, is the gs command line tool, which provides many options for manipulating PDF, but we are interested in compressign those large PDF's into small yet legible documents.

credit goes to this answer on askubuntu forum: https://askubuntu.com/questions/3382/reduce-filesize-of-a-scanned-pdf/3387#3387?newreg=bceddef8bc334e5b88bbfd17a6e7c4f9

This file has been truncated. show original

which is very similar to what @rkaplan just posted. The interesting parameters are the …Image resolutions, I guess. They probably select a DPI value for the images separately from the document’s DPI. They seem to originate from Acrobat, but I couldn’t find any description yet.

Ryan_N · July 3, 2022, 5:43pm

I too have pondered how and if I should attempt to compress some PDF files already in DT–those with annotations and so on. I’ve shied away from trying this, but have been curious if it’s possible and find many tips/scripts shared here rather intriguing.

Below is the crazy workflow I’m using right now for building PDF files from downloaded images of a multi-volume journal, as I’m curious. Lots of move-into and move-out-of DT in this process I use, with the relevant bit to this thread being #5 below. (I’d love to eliminate some of these steps, but don’t know if I can?)

Download every page of a desired historical journal volume I wish to have in DT. Catch: the subscription service that digitized the title only allows the pages of a volume to download one at a time, and these download only as .JPG images. I use Keyboard Maestro to automate this process and usually run the macro overnight, leaving several hundred .JPG images in my downloads folder to deal with when the project resumes.
I then sort downloads folder by Date Created as well as the project’s destination group in DT. Then drag/drop this glob into DT. In DT, the images are then converted to PDF, then merged into single PDFs for each four-issue volume (the automation downloads four volumes per night).
The single-volume PDF then gets split into its four issues. I do this in PDF Expert using its tile view, and extract the PDFs for each issue to a folder on my Desktop.
The split files once again drag back DT for OCR—another process I run overnight, though not on the nights I run the download automation ).
Final step: compression. I’m averaging 140 MB per issue right now, before compression. I too find PDF Squeezer a spectacular product. I like its GUI, because I feel in control with it. I’m using the default “strong” compression to crunch these files to about 50% of their original file size (72 DPI).

This seems to work fine, but it sure is a lot of dragging, deleting and emptying trash to accomplish just a single four-issue volume. I don’t know if there’s a better way, but have enjoyed reading the ideas to consider here.

BTW, a friendly advisory concerning PDF Squeezer: it apparently uses all cores of a CPU, and this can’t be changed in its preferences settings. My old dog of a computer (a 2013 MBP i7, 16 GB RAM) runs its fans full throttle while the app compresses four files simultaneously (this also can’t be adjusted, as far as I can tell).

snpower · July 28, 2022, 4:11pm

Would this script be modifiable to use PDF Squeezer? TIA.

trevormeier · February 25, 2023, 3:40pm

I’ve been trying out this script. Unfortunately it removes all metadata from PDFs captured by DT from webpages (such as URL and comment) as well as loosing the date information (e.g. date added).

There’s also a small error in the script that was adding a trailing space to the filenames. I’ve attached a corrected version.

As a side note, using DT’s “Clutter-Free” PDF capture can result in files that are quite large. For example, this NYTimes article results in a 22MB PDF. Compressing it with the script results in a much more acceptable ~2MB. Multiply this by hundreds of PDFs and it can save many GB. However, it’s essential to retain captured metadata.

Is there any way to reduce the size of PDFs without losing all of the metadata?

rmschne · February 25, 2023, 3:50pm

NYTimes pages behind their firewall are notorious for being difficult to copy. I no longer pay much attention to their website and dropped my subscription, so can’t even look at your link.

Most of the time I use the DEVONthink clipper to first try markdown. Then if there are images/figures I wish to retain, I convert to PDF. Meta data is retained. If I consider the resulting PDF too large, I go back to the original markdown and edit out all the images that I don’t want to keep. Convert again to PDF with the retained images/figures still there. If further compression needed, I open with PDFPen and “optimise” it by setting all graphics as 75dpi. For the last step, the metadata is lost–but probably a script can be found or written to copy meta data from one document to another–I never tried.

trevormeier · February 25, 2023, 5:00pm

One issue I’ve found with what you’re suggesting is that DT’s “Convert to PDF” for Markdown doesn’t retain Finder comments. It retains them for bookmarks… so not sure what’s up there.

But really it’s the last step that I’m looking for a solution for: if I’ve got a PDF with metadata that’s larger than it needs to be (doesn’t matter the source), I’d like to reduce the file size and retain the metadata.

It’d be nice if this was built in to DT.

Even better would be some formatting options for the Capture PDF mechanism (in the sorter and in Convert To). In that case I could convert a PDF to a PDF but reformat it using the specified options.