Script: Remove empty pages from PDF

Perhaps someone finds that useful. The script loops over the currently selected records, weeds out everything that is neither a PDF nor has a text layer, and then removes all pages not containing any text from the PDFs. So, it works only on PDFs that have been OCRd or have otherwise received a text layer!

I hacked that together because I usually scan two-sided documents, and sometimes the second page is blank. No need to keep that.

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i);
      }
    }
    
    /* emptyPages contains the indices of all pages without text. 
       Remove them from the PDF one by one, adjusting the index */
    emptyPages.forEach((pageNo, i) => {
      PDFDoc.removePageAtIndex(pageNo-i);
    });
    PDFDoc.writeToFile($(path));
  })
  })()

As usual, that is JavaScript (not AppleScript) that can be run from Script Editor or on the command line with osascript -l JavaScript filename.js after having saved the code to “filename.js”. To use it in Script Editor, you must copy/paste it into that program.

The step that does remove the pages is a bit tricky: if one removes page number 2, the numbers of the following pages will be reduced by 1 (3 will become the new 2, etc.). The expression pageNo-i in the call to removePageAtIndex takes care of that.

3 Likes

very nice Christian, and incredibly useful!
A small note: some of my very (really) old scans, blank pages might contain scanning artifacts, or splotches left behind because these were scanned from faxes (believe it or not). And in some cases OCR-ing those will result in some marks being interpreted by the OCR engine as ?something? - causing the pages not to be extracted by your great script. Well, there’s always the exception, eh? :slight_smile:

If the OCR recognizes that as one or more letters – yes, that will not be removed by the code. I can’t think of a way to handle this safely.

Some heuristics might help at least in case of latin-based languages, e.g. very few words/characters, a very small average word length or more non-ASCII than ASCII characters.

Rather than removing these pages unconditionally, one could flag them in a separate document. Imagine a page with a drawing labeled „Fig. 1“ and nothing else …

Interesting ideas both. On the positive side, these documents (with splotches) are less common, and if these pages are not removed, then I am erring on the safe side - the vast majority of docs will have zero text pages removed, and the script is awesome!

Yep - this is one of the bits, for example, for an artifact, copied as text" “— - ■”
Another one however read as : “Ô
And a third one as “—10”

Christian, now that you are doing this - it occurred to me another script or a variation:

Select several pdf, a script will tell you which pdfs contain blank pages, and add a tag, for example “blank_pages” to these.

As they used to say to us in university: left as an exercise for the reader :winking_face_with_tongue:

The changes to the current script are really minimal.

2 Likes

ok, I love the challenge! - but I might come back to you, Master!

OK, here’s my progress so far (pre-pseudocode) - see comment “The loop above…” inside

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i);
      }
    }
	
	/* The loop above serves of course as a way of knowing if there are "0 length" pages - emptyPages would be empty, right?  I can disregard the rest of the code below here, as I'm not removing anything or saving the pdf, but I need to tag the pdf - this is something I'm investigating how to do in js - probably by looking at your other scripts :-) */
	
	
	
	
    
    /* emptyPages contains the indices of all pages without text. 
       Remove them from the PDF one by one, adjusting the index */
    emptyPages.forEach((pageNo, i) => {
      PDFDoc.removePageAtIndex(pageNo-i);
    });
    PDFDoc.writeToFile($(path));
  })
  })()

Yes, the comment is correct.

1 Like

Hi Christian, I think I got it!

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    const emptyPages = [];
    
    /* Loop over all pages and store the index of those without text
      in array emptyPages */
    for (i = 0; i < pageCount; i++) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
	  /* modified to just push tag "blank" if there are blank pages */
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        emptyPages.push(i); r.tags = [...r.tags(), "blank"];
      }
	     
    }
  })
  })()

image

Kind of.
There’s no need to have emptyPages at all in this context. And the tag updating would be simpler written as
r.tags = r.tags().concat("blank");

Your version unravels the tags array and builds a new one from it, adding a new element in the process. Not wrong, just a bit convoluted.

1 Like

Yes Sir! will do my new homework

Now:

      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
	r.tags = r.tags().concat("blank");      }

:+1:

A little shorter. No need for emptyPages.

(() => {
  ObjC.import('PDFKit');
  const app = Application("DEVONthink");
  const records = app.selectedRecords();
  records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
    const path = r.path();
    
    /* Build PDFDocument from record's path and get its page count */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
    const pageCount = PDFDoc.pageCount;
    
    /* Loop over all pages and remove emptyPages */
    for (i = pageCount - 1; i >= 0; i--) {
      const p = PDFDoc.pageAtIndex(i);
      const pageTxt = p.string.js;
      if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
        PDFDoc.removePageAtIndex(i);
      }
    }
    PDFDoc.writeToFile($(path));
  })
})()
2 Likes

Cool, thanks. I keep forgetting about down-counting loops, unfortunately.