Script: Remove empty pages from PDF

chrillek · July 9, 2025, 5:31pm

Perhaps someone finds that useful. The script loops over the currently selected records, weeds out everything that is neither a PDF nor has a text layer, and then removes all pages not containing any text from the PDFs. So, it works only on PDFs that have been OCRd or have otherwise received a text layer!

I hacked that together because I usually scan two-sided documents, and sometimes the second page is blank. No need to keep that.

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i);
      }
    }
    
    /* emptyPages contains the indices of all pages without text. 
       Remove them from the PDF one by one, adjusting the index */
    emptyPages.forEach((pageNo, i) => {
      PDFDoc.removePageAtIndex(pageNo-i);
    });
    PDFDoc.writeToFile($(path));
  })
  })()

As usual, that is JavaScript (not AppleScript) that can be run from Script Editor or on the command line with osascript -l JavaScript filename.js after having saved the code to “filename.js”. To use it in Script Editor, you must copy/paste it into that program.

The step that does remove the pages is a bit tricky: if one removes page number 2, the numbers of the following pages will be reduced by 1 (3 will become the new 2, etc.). The expression pageNo-i in the call to removePageAtIndex takes care of that.

uimike · July 9, 2025, 10:19pm

very nice Christian, and incredibly useful!
A small note: some of my very (really) old scans, blank pages might contain scanning artifacts, or splotches left behind because these were scanned from faxes (believe it or not). And in some cases OCR-ing those will result in some marks being interpreted by the OCR engine as ?something? - causing the pages not to be extracted by your great script. Well, there’s always the exception, eh?

chrillek · July 10, 2025, 5:39am

If the OCR recognizes that as one or more letters – yes, that will not be removed by the code. I can’t think of a way to handle this safely.

cgrunenberg · July 10, 2025, 5:48am

Some heuristics might help at least in case of latin-based languages, e.g. very few words/characters, a very small average word length or more non-ASCII than ASCII characters.

chrillek · July 10, 2025, 7:11am

Rather than removing these pages unconditionally, one could flag them in a separate document. Imagine a page with a drawing labeled „Fig. 1“ and nothing else …

uimike · July 10, 2025, 7:50am

Interesting ideas both. On the positive side, these documents (with splotches) are less common, and if these pages are not removed, then I am erring on the safe side - the vast majority of docs will have zero text pages removed, and the script is awesome!

uimike · July 10, 2025, 7:56am

Yep - this is one of the bits, for example, for an artifact, copied as text" “— - ■”
Another one however read as : “Ã”
And a third one as “—10”

uimike · July 10, 2025, 8:01am

Christian, now that you are doing this - it occurred to me another script or a variation:

Select several pdf, a script will tell you which pdfs contain blank pages, and add a tag, for example “blank_pages” to these.

chrillek · July 10, 2025, 9:05am

As they used to say to us in university: left as an exercise for the reader

The changes to the current script are really minimal.

uimike · July 10, 2025, 9:16am

ok, I love the challenge! - but I might come back to you, Master!

uimike · July 10, 2025, 11:42am

OK, here’s my progress so far (pre-pseudocode) - see comment “The loop above…” inside

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i);
      }
    }
	
	/* The loop above serves of course as a way of knowing if there are "0 length" pages - emptyPages would be empty, right?  I can disregard the rest of the code below here, as I'm not removing anything or saving the pdf, but I need to tag the pdf - this is something I'm investigating how to do in js - probably by looking at your other scripts :-) */
	
	
	
	
    
    /* emptyPages contains the indices of all pages without text. 
       Remove them from the PDF one by one, adjusting the index */
    emptyPages.forEach((pageNo, i) => {
      PDFDoc.removePageAtIndex(pageNo-i);
    });
    PDFDoc.writeToFile($(path));
  })
  })()

chrillek · July 10, 2025, 11:55am

Yes, the comment is correct.

uimike · July 10, 2025, 4:19pm

Hi Christian, I think I got it!

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
	  /* modified to just push tag "blank" if there are blank pages */
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i); r.tags = [...r.tags(), "blank"];
      }
	     
    }
  })
  })()

chrillek · July 10, 2025, 4:22pm

Kind of.
There’s no need to have emptyPages at all in this context. And the tag updating would be simpler written as
r.tags = r.tags().concat("blank");

Your version unravels the tags array and builds a new one from it, adding a new element in the process. Not wrong, just a bit convoluted.

uimike · July 10, 2025, 4:46pm

Yes Sir! will do my new homework

Now:

      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
	r.tags = r.tags().concat("blank");      }

chrillek · July 10, 2025, 5:08pm

Arie · July 14, 2025, 5:12am

A little shorter. No need for emptyPages.

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    
    /* Loop over all pages and remove emptyPages */
    for (i = pageCount - 1; i >= 0; i--) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        PDFDoc.removePageAtIndex(i);
      }
    }
    PDFDoc.writeToFile($(path));
  })
})()

chrillek · July 14, 2025, 6:24am

Cool, thanks. I keep forgetting about down-counting loops, unfortunately.