I’ve scrambled together a script that kind of does what you were asking for.
It’s written in JavaScript and makes heavy use of the ObjC bridge and PDFKit. The workhorse is the function countWordsBetweenOutlines
that expects the path to a PDF file as its parameter and returns an object whose keys are the outline labels (only the first level, though). The values of these keys are again objects of the form
wcObjC: number
wcJS: number
wcObjC
is the number as determined by the NSString
method enumerateSubstringsInRange
, wcJS
is used by splitting the text at whitespace characters, filtering out empty strings and counting the number of resulting strings. As these two numbers often differ (wcObjC mostly being the higher one), I decided to provide them both.
The self-executing anonymous function at the end of the code simply takes all currently selected DT records, filters out those not being PDFs and then runs countWordsBetweenOutlines
on the rest. The results are written as Markdown to a new record named Word Counts
in the global inbox.
In most cases, the numbers are in the same range as what wc -w
finds. Only for the abstract of the sample document did I see a huge difference between what my script calculates and what wc -w
of the copy/pasted abstract finds. I have no idea why that happens, though.
ObjC.import('PDFKit');
function countWordsBetweenOutlines(path) {
const pdfDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
const pdfOutline = pdfDoc.outlineRoot;
if (pdfOutline.js === undefined)
return;
// Initialize positions to first outline
let currentOutline = pdfOutline.childAtIndex(0);
const startPage = currentOutline.destination.page;
const startPos = {point: currentOutline.destination.point, page: startPage};
stop = $();
const result = {}
/* Loop over the 2nd to the last outline */
for (let i = 1; i < pdfOutline.numberOfChildren; i++) {
const nextOutline = pdfOutline.childAtIndex(i);
const page = nextOutline.destination.page;
const point = nextOutline.destination.point;
let txt = pdfDoc.selectionFromPageAtPointToPageAtPoint(startPos.page, startPos.point, page, point).string;
let wordCount = 0;
txt.enumerateSubstringsInRangeOptionsUsingBlock($.NSMakeRange(0,txt.length),$.NSStringEnumerationByWords,
(string, subRange, enclosingRange, stop) => {wordCount++});
result[currentOutline.label.js] = {
wcObjC: wordCount,
wcJS: txt.js.split(/\s+/).filter(t => t !== '').length};
startPos.point = point;
startPos.page = page;
currentOutline = nextOutline;
}
return result;
}
(() => {
const app = Application("DEVONthink 3");
let text = '';
app.selectedRecords().filter(r => r.type() === 'PDF document').forEach(r => {
const result = countWordsBetweenOutlines(r.path());
if (result) {
text += `## ${r.name()}\n\n`
+ Object.keys(result).map(k => `${k}: ${result[k].wcObjC} / ${result[k].wcJS} words`).join('\n\n');
+ '\n\n';
}
})
if (text.length) {
const newRecord = app.createRecordWith({type: "markdown", content: text, name: "Word counts"}, {in: app.incomingGroup()})
}
})()
Explanation: countWordsBetweenOutlines
loops over all first-level elements of the PDF Outline if there is one. It calls selection:fromPage:atPoint:toPage:atPoint
, using the destinations of the current and the next outline as parameters, to find the text between these two outline destinations.
It then calls enumerateSubstringsInRange…
on this text to count the words. In addition, it splits the JavaScript version of the text as described above and saves this count, too.