Script: Add table of contents to PDF generated from Markdown

chrillek · February 4, 2023, 4:36pm

Some time ago, a forum participant noted that the PDF generated from a Markdown document in DT doesn’t contain a usable table of contents (TOC).. That means: The TOC is visible, and it seems to contain links to the headlines. But clicking on the links doesn’t do anything.

I’m proposing a script to add an invisible TOC as metadata to a PDF document during its conversion from Markdown. As can be expected, the script is written in JavaScript. I’m sure it could be implemented in AppleScript as well, with considerably more typing. Anyway, here it comes:

ObjC.import('PDFKit');
ObjC.import('CoreGraphics'); //Needed only for NSPoint - CGPoint conversion
const UUID = '82E64F1D-6BFE-410D-96F4-ED8CFED0E2F5'; // adjust or modify script to work with selected records
(() => {
  const app = Application("DEVONthink 3")
  const r = app.getRecordWithUuid(UUID);

  /* Bail out if not a Markdown document */
  if (r.type() !== "markdown") {
    console.log("No MD record");
    return;
  } 
  /* Bail out if MD document doesn't contain TOC directive */
  const txt = r.plainText();
  if (!/\{\{TOC}}/.test(txt)) {
    console.log("No TOC command in MD file.")
  }
  /* Find headings in MD file, skipping over code fences.
    The replace() removes all code fences, the matchAll greps all headlines and the map extracts the captured headline,
    i.e. the "#[#…] Headling".
    The array headings than contains only those strings. 
  */
  const headings = [... 
    txt.replaceAll(/^```.*?```$/smg,'').
    matchAll(/^(#+\s+.*?)$/smg)].
      map(h => h[1]);

  /* Convert the MD to PDF, get the PDFDocument from it and create the top-level Outline */
  const pdfRecord = app.convert({record: r, to: "PDF document", in: r.locationGroup()});
  const pdfDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(pdfRecord.path())));
  const outline = $.PDFOutline.alloc.init;

  /* Get the text layer of the PDF as JavaScript string */
  const pdfText = pdfDoc.string.js;
  
  /* Initialize some variables to manage TOC entry hierarchy */
  let lastLevel = -1;
  let lastParent = outline;
  let lastSibling = undefined;

  /* Loop over all headlines from the MD document */
  headings.forEach(h => {
    /* Calculate the headline level from the number of leading '#' characters */
    const currentLevel = h.match(/^#+/m)[0].length;
    
    /* Remove the leading hash signs and space(s) from headline (first replace), and
       escape characters with special meaning in regular expressions (2nd replace) */
    const headingText = h.replace(/^#+\s+/,'').replaceAll(/([+*?\[\-{}])/g,"\\$1");
    console.log(headingText);
    /* Build a regular expression from the current headline 
    matching a line with only the headline on it, ignoring leading space(s) 
    but taking into account all other characters preceding the headline, like numbers etc. */
    const headingRE = new RegExp(`^\\s*(.*?${headingText})$`,"m");
    /* Find the headline in the PDF as it's printed there. 
    It might be prefixed with characters _not_ in the original one, like numbering */
    const headingInPDF = pdfText.match(headingRE);

    /* If the heading is found ... (well, it should always be, but who konws) */
    if (headingInPDF) {
      /* Search for the textual version of the headline in the PDFDocument to find page and location on page */
      const pdfSelection = pdfDoc.findStringWithOptions($(headingInPDF[1]),0);
      
      /* If the text is found (as it should be) use the first match */
      if (pdfSelection.js.length > 0) {
        const firstSel = pdfSelection.js[0];
        const page = firstSel.pages.js[0]; // NSPage object!
        const bounds = firstSel.boundsForPage(page); //NSRect object
        /* Calculate the point for the destination a click on the TOC entry is moving to. 
        Use the upper y coordinate and the left x coordinate */
        const pt = $.NSPointFromCGPoint($.NSRectToCGRect(bounds).origin);
        pt.y = $.NSMaxY(bounds);
        
        /* Create a new PDFDestination, i.e. a target for the TOC entry */
        const destination = $.PDFDestination.alloc.initWithPageAtPoint(page, pt);
        const tocEntry = $.PDFOutline.alloc.init;
        tocEntry.destination = destination;
      
        /* Use the heading in the PDF doc as label for the TOC entry */
        tocEntry.label = headingInPDF[1];

        /* Find the appropraite parent PDFOutline to append this TOC entry to */
        const parentOutline = (() => {
          /* Current heading is bigger than last one: append to last one or outline for first heading */
          if (currentLevel > lastLevel) 
            return lastSibling || outline;
          /* Current heading is on same level as last one: append to parent of last sibling */
          if (currentLevel === lastLevel)
            return lastSibling.parent;
          /* Current heading is smaller than last one: move upwards to find matching parent */
          let targetLevel = lastLevel;
          let targetEntry = lastParent || outline;
          while (targetLevel > currentLevel) {
            targetEntry = targetEntry.parent || outline;
            targetLevel--;
          }
          return targetEntry;
         })()
        parentOutline.insertChildAtIndex(tocEntry, parentOutline.numberOfChildren);
        lastLevel = currentLevel;
        lastParent = tocEntry.parent;
        lastSibling = tocEntry;
    }
    }
  })
  /* Save the outline in the PDF document and save the document to disk */
  pdfDoc.setOutlineRoot(outline);
  pdfDoc.writeToFile(pdfRecord.path());
})()

As it stands, the code works with a single Markdown whose UUID is given at the top. It’s a trivial exercise to transform the script so that handles a set of selected records. Note that it will bail out if either the record passed in is not a Markdown document or it doesn’t contain a TOC directive. That seemed reasonable to me, since an MD without a TOC shouldn’t be converted to a PDF with a TOC.

Note The script is not thoroughly tested, lacking suitable Markdown documents. I ran it on the source for the “CSS in Markdown” series, and it worked ok there. Also, the TOC does not appear in the PDF document itself. It’s accessible as “Table of contents” in DT, Preview and PDFpen. Acrobat Reader shows it in its “Bookmark” section. The code is heavily commented (for my standards), and there’s an explanation of the approach available elsewhere.

Room for improvement

It would, of course, be nice if a PDF containing a visible TOC (which is the default when converting from MD) would allow for the TOC entries to be clickable and leading to the appropriate place in the document. I’ll see if I can figure that out.

cgrunenberg · February 22, 2023, 7:19am

This is planned for future releases but things can actually get quite complicated as soon as a document uses the same headline multiple times or when the headline appears many times in the text too. Just have a look at the readme of the Dropbox SDK for such an example

Amontillado · March 17, 2024, 3:28pm

I needed - er, wanted - a way to navigate a PDF that didn’t have a TOC. It’s not exactly the same thing, obviously, but a bullet list of PDF page links in the attachment file for the PDF worked for what I needed.

Attachment files - meta-metadata. Very cool things.

BLUEFROG · March 17, 2024, 3:31pm

Annotation files?

Amontillado · March 17, 2024, 3:34pm

Yes, quite so. I typed in haste. I do that. Often.