Perhaps someone finds that useful. The script loops over the currently selected records, weeds out everything that is neither a PDF nor has a text layer, and then removes all pages not containing any text from the PDFs. So, it works only on PDFs that have been OCRd or have otherwise received a text layer!
I hacked that together because I usually scan two-sided documents, and sometimes the second page is blank. No need to keep that.
(() => {
ObjC.import('PDFKit');
const app = Application("DEVONthink");
const records = app.selectedRecords();
records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
const path = r.path();
/* Build PDFDocument from record's path and get its page count */
const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
const pageCount = PDFDoc.pageCount;
const emptyPages = [];
/* Loop over all pages and store the index of those without text
in array emptyPages */
for (i = 0; i < pageCount; i++) {
const p = PDFDoc.pageAtIndex(i);
const pageTxt = p.string.js;
if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
emptyPages.push(i);
}
}
/* emptyPages contains the indices of all pages without text.
Remove them from the PDF one by one, adjusting the index */
emptyPages.forEach((pageNo, i) => {
PDFDoc.removePageAtIndex(pageNo-i);
});
PDFDoc.writeToFile($(path));
})
})()
As usual, that is JavaScript (not AppleScript) that can be run from Script Editor or on the command line with osascript -l JavaScript filename.js after having saved the code to “filename.js”. To use it in Script Editor, you must copy/paste it into that program.
The step that does remove the pages is a bit tricky: if one removes page number 2, the numbers of the following pages will be reduced by 1 (3 will become the new 2, etc.). The expression pageNo-i in the call to removePageAtIndex takes care of that.
very nice Christian, and incredibly useful!
A small note: some of my very (really) old scans, blank pages might contain scanning artifacts, or splotches left behind because these were scanned from faxes (believe it or not). And in some cases OCR-ing those will result in some marks being interpreted by the OCR engine as ?something? - causing the pages not to be extracted by your great script. Well, there’s always the exception, eh?
Some heuristics might help at least in case of latin-based languages, e.g. very few words/characters, a very small average word length or more non-ASCII than ASCII characters.
Rather than removing these pages unconditionally, one could flag them in a separate document. Imagine a page with a drawing labeled „Fig. 1“ and nothing else …
Interesting ideas both. On the positive side, these documents (with splotches) are less common, and if these pages are not removed, then I am erring on the safe side - the vast majority of docs will have zero text pages removed, and the script is awesome!
OK, here’s my progress so far (pre-pseudocode) - see comment “The loop above…” inside
(() => {
ObjC.import('PDFKit');
const app = Application("DEVONthink");
const records = app.selectedRecords();
records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
const path = r.path();
/* Build PDFDocument from record's path and get its page count */
const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
const pageCount = PDFDoc.pageCount;
const emptyPages = [];
/* Loop over all pages and store the index of those without text
in array emptyPages */
for (i = 0; i < pageCount; i++) {
const p = PDFDoc.pageAtIndex(i);
const pageTxt = p.string.js;
if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
emptyPages.push(i);
}
}
/* The loop above serves of course as a way of knowing if there are "0 length" pages - emptyPages would be empty, right? I can disregard the rest of the code below here, as I'm not removing anything or saving the pdf, but I need to tag the pdf - this is something I'm investigating how to do in js - probably by looking at your other scripts :-) */
/* emptyPages contains the indices of all pages without text.
Remove them from the PDF one by one, adjusting the index */
emptyPages.forEach((pageNo, i) => {
PDFDoc.removePageAtIndex(pageNo-i);
});
PDFDoc.writeToFile($(path));
})
})()
(() => {
ObjC.import('PDFKit');
const app = Application("DEVONthink");
const records = app.selectedRecords();
records.filter(r => r.recordType() === "PDF document" && r.wordCount() > 0).forEach(r => {
const path = r.path();
/* Build PDFDocument from record's path and get its page count */
const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
const pageCount = PDFDoc.pageCount;
const emptyPages = [];
/* Loop over all pages and store the index of those without text
in array emptyPages */
for (i = 0; i < pageCount; i++) {
const p = PDFDoc.pageAtIndex(i);
const pageTxt = p.string.js;
/* modified to just push tag "blank" if there are blank pages */
if (!pageTxt|| pageTxt.length === 0 || pageTxt === '') {
emptyPages.push(i); r.tags = [...r.tags(), "blank"];
}
}
})
})()
Kind of.
There’s no need to have emptyPages at all in this context. And the tag updating would be simpler written as r.tags = r.tags().concat("blank");
Your version unravels the tags array and builds a new one from it, adding a new element in the process. Not wrong, just a bit convoluted.