Script to show all documents < 300 dpi

Hey there is there a script or a function that shows all pdfs in my database that are less than 300 DPIs?

The AppleScript property dpi is limited to images as PDF documents might contain no, one or multiple images with different DPIs, even on the same page. Therefore this would require external tools.

A PDF itself “is” not any number of DPIs. PDF is originally a vector format, internally using a 72dpi coordinate system. As @cgrunenberg pointed out, you can have image(s) in a PDF, which might be digitized using a certain number of dots per inch. But that is not a property of the PDF.

1 Like

Christian, I understand that - let me ask you (and @cgrunenberg ) what does this setting mean, in the OCR settings.

I guess (!) that it’s the same setting as when you scan an image. To perform OCR, the PDF must be rendered as a raster image. And that image has a resolution.

Just an educated guess, though.

1 Like

Document controls:

  • …
  • PDF Resolution: Set the desired resolution for the image layer in the PDF from 150 to 600 dpi. On M-series Macs, you can also choose As source to retain the originally scanned resolution.

(From p. 243 in the 4.1 manual, Settings > OCR)

1 Like

Thanks Troej, so my question is, if I do this to a file with images at different dpi? will everything later has that dpi? (maybe I’m getting it all wrong)

external tool for example? can they be integrated into Devonthink?

I don’t know what tool/app does what you want, but I asked X’s AI Grok “can i use ghostscript to find PDFs that have images less than 300 dpi” and the answer was “no” but it gave an alternative idea using pdfimages (from Poppler utilities) and ImageMagick. The suggestion seems reasonable and perhaps worth a try as would be free of cost. Probably can script something in AppleScript and/or JavaScript (or other) to call these two apps in sequence for a file in DEVONthink.

Scripting PDFKit is possible, but getting at the PDF content is probably quite messy. And as usual, Apple’s documentation is terse on the topic.

As to @troejgaard’s find, perhaps @bluefrog or @cgrunenberg can comment on that. For example, what would “resolution” mean for a PDF that contains only drawing commands (including text), no images at all? A brief search didn’t reveal any definition of “PDF image layer”.

That’s right. There’s also a setting As Source to retain the original resolution.

1 Like

I don’t know. PDF is a complex format and I’m by no means an expert.

I assume the most common scenario is a standard scan where each page is a single image and all images are scanned with the same DPI. The explanation in the manual is probably written with that scenario in mind.

Do you have an example file to inspect/test with?

1 Like

Yes, direct to PDFkit while scriptable, you say messy. No doubt true. I was thinking of making calls to the two apps which may take the “messiness” away. Frankly, I don’t know. Haven’t tried. Just repeating what I heard–which of course could be a hallucination. But if it was something I wanted to do, I’d try it anyway.

1 Like

As @cgrunenberg has confirmed, the setting is not about particular images.

The context here is OCR. We have a PDF, which is a set of drawing instructions (very simplified). The OCR software doesn’t give a damn about these instructions; all it cares for are pixels. So, the PDF has to be converted to a pixel image that the OCR software can understand – preferably lossless, so TIFF, PNG or GIF.

What happens behind the scenes is this, I think:

  • the PDF is converted to a bitmap image, using the specified resolution
  • the OCR software “reads” this image and builds text from it.
  • then it outputs a PDF again, containing all the text “written” white on a white background, with the bitmap image on top.

If you feed the OCR software a bitmap image (ie TIFF, PNG, JPG, GIF etc), the first step is, of course, not necessary.

If your PDF originated from a scanner, it will only contain a bitmap image. If it was produced by software, it might already contain a text layer. But if it does not, performing OCR on it will probably remove all the original drawing commands from the PDF and leave you with just an invisible text layer under a bitmap image with a certain resolution.

3 Likes

I have no idea what the OP really wants. The title says “documents”, which would include images. But in the text, they are mentioning PDFs explicitly. As a PDF does not have a resolution, the question doesn’t make sense. And all answers here are just speculation.

1 Like

sorry I was talking about documents which I import through the scanner which are pdf+text files. Currently I import with 300 dpi. But I used to do it with compression and 200 dpi. And I want to track down the old imports

Track them down for what purpose?