Devonthink - pdf-files, word count for a part of the document?

mjannesh · October 28, 2023, 9:18am

Hi
I have a number of articles where I want to count the number of words for individual parts of the article. For example, how many words are used in the ‘Introduction’? How many words are used in the ‘conclusion’? etc. I can see the total number of words in a file, and I have hoped that highlighting could provide the same information for parts of the file (as in MS Word), but that does not seem to be the case.

DT identifies the headings in the file correctly, so maybe there is a possibility to count the number of words between headings? Hmmm … other suggestions for how to do this will be welcomed!

chrillek · October 28, 2023, 10:07am

What do you mean?

You can select the headings in the PDF?
You can search for the heading text in the PDF?

You could, for example

Select and copy the text between two headings in the PDF and run wc -w in the terminal, paste the text and press Ctrl-D
Use a small script in DT that uses the plainText property of the PDF to do what you want
and there might be more ways to do that.

But the main problem seems to me to identify what “the introduction” or “the conclusion” encompasses. Identifying them in arbitrary PDFs might not be easy.

mjannesh · October 28, 2023, 12:55pm

I mean that I can see the headings as entries in the Table of Contents. My thought was if a heading contained some kind of identifier: then I could ask for number of words between two headings …

However, the wc -w (paste) ctrl-D works well. I’ll go with that. Thanks!

chrillek · October 28, 2023, 4:37pm

Those are what Apple’s PDFKit calls outline, I think. It might be possible to whip up a script to count words between them. I’ll have a look.

BLUEFROG · October 28, 2023, 5:59pm

You could use Tools > Split PDF > Into Chapters to generate individual PDFs of each chapter. Each chapter would have its own word count.

Also, this AppleScript one-liner will produce a word count for selected text…

tell application id "DNtp" to count (words of (selected text of think window 1))

cgrunenberg · October 29, 2023, 7:52am

This functionality is only available in the plain/rich text and Markdown editor.

chrillek · October 30, 2023, 11:18am

I’ve scrambled together a script that kind of does what you were asking for.

It’s written in JavaScript and makes heavy use of the ObjC bridge and PDFKit. The workhorse is the function countWordsBetweenOutlines that expects the path to a PDF file as its parameter and returns an object whose keys are the outline labels (only the first level, though). The values of these keys are again objects of the form

wcObjC: number
wcJS: number

wcObjC is the number as determined by the NSString method enumerateSubstringsInRange, wcJS is used by splitting the text at whitespace characters, filtering out empty strings and counting the number of resulting strings. As these two numbers often differ (wcObjC mostly being the higher one), I decided to provide them both.

The self-executing anonymous function at the end of the code simply takes all currently selected DT records, filters out those not being PDFs and then runs countWordsBetweenOutlines on the rest. The results are written as Markdown to a new record named Word Counts in the global inbox.

In most cases, the numbers are in the same range as what wc -w finds. Only for the abstract of the sample document did I see a huge difference between what my script calculates and what wc -w of the copy/pasted abstract finds. I have no idea why that happens, though.

ObjC.import('PDFKit');

function countWordsBetweenOutlines(path) {
  const pdfDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));
  const pdfOutline = pdfDoc.outlineRoot;
  if (pdfOutline.js === undefined)
    return;
  // Initialize positions to first outline
  let currentOutline = pdfOutline.childAtIndex(0);
  const startPage = currentOutline.destination.page;
  const startPos = {point: currentOutline.destination.point, page: startPage};
  stop = $();
  const result = {}
  /* Loop over the 2nd to the last outline */
  for (let i = 1; i < pdfOutline.numberOfChildren; i++) {
    const nextOutline = pdfOutline.childAtIndex(i);
    const page = nextOutline.destination.page;
    const point = nextOutline.destination.point;
    let txt = pdfDoc.selectionFromPageAtPointToPageAtPoint(startPos.page, startPos.point, page, point).string;
    let wordCount = 0;
    txt.enumerateSubstringsInRangeOptionsUsingBlock($.NSMakeRange(0,txt.length),$.NSStringEnumerationByWords,
      (string, subRange, enclosingRange, stop) => {wordCount++});
    result[currentOutline.label.js] = {
      wcObjC: wordCount, 
      wcJS: txt.js.split(/\s+/).filter(t => t !== '').length};
    startPos.point = point;
    startPos.page = page;
    currentOutline = nextOutline;
  }
  return result;
}

(() => {
  const app = Application("DEVONthink 3");
  let text = '';
  app.selectedRecords().filter(r => r.type() === 'PDF document').forEach(r => {
    const result = countWordsBetweenOutlines(r.path());
    if (result) {
      text += `## ${r.name()}\n\n` 
      + Object.keys(result).map(k => `${k}: ${result[k].wcObjC} / ${result[k].wcJS} words`).join('\n\n');
      + '\n\n';
    }
  })
  if (text.length) {
    const newRecord = app.createRecordWith({type: "markdown", content: text, name: "Word counts"}, {in: app.incomingGroup()})
  }
})()

Explanation: countWordsBetweenOutlines loops over all first-level elements of the PDF Outline if there is one. It calls selection:fromPage:atPoint:toPage:atPoint, using the destinations of the current and the next outline as parameters, to find the text between these two outline destinations.

It then calls enumerateSubstringsInRange… on this text to count the words. In addition, it splits the JavaScript version of the text as described above and saves this count, too.