Retaining TOC from Markdown document after PDF conversion

AW2307 · March 18, 2023, 6:54pm

Hi,

I have a workflow where I regularly convert structured Markdown documents to PDF.

However, while the headlines from the Markdown document are displayed in the Table of Contents inspector, the same inspector is empty for the PDF that resulted from the conversion.

It would be useful if the Table of Contents from the Markdown documents could be retained in the PDFs.

Does anyone know a solution to accomplish this from within DevonThink, without using third-party applications?

chrillek · March 18, 2023, 7:07pm

That’s a known problem. Also, you can’t click on the lines in the visible toc in the PDF – they look like links but take you nowhere.

I posted a script here assume weeks ago that tries to remedy the situation. It works for toc created with the {{TOC}} directive in MD

AW2307 · March 18, 2023, 7:20pm

Thanks @Chrillek.

I saw those proof of concept scripts and they seem promising. However, due to the number of Markdown documents I need to convert, only something that could be run via a smart rule on a group’s contents would be feasible.

For me personally, it would not be trivial to modify the scripts accordingly because I am inexperienced with javascript.

In case it would be for you, I would greatly appreciate a smart rule capable version of the script you posted recently.

And my guess is that others here might find it useful as well

chrillek · March 18, 2023, 7:37pm

Something like that… the check for the record for to be MD is not needed in smart rule working with MD files only.

ObjC.import('PDFKit');
ObjC.import('CoreGraphics'); //Needed only for NSPoint - CGPoint conversion

function performsmartrule(records) {
  const app = Application("DEVONthink 3")
  records.forEach(r => {
  /* Bail out if not a Markdown document */
  if (r.type() !== "markdown") {
    console.log("No MD record");
    return;
  } 
  /* Bail out if MD document doesn't contain TOC directive */
  const txt = r.plainText();
  if (!/\{\{TOC}}/.test(txt)) {
    console.log("No TOC command in MD file.")
  }
  /* Find headings in MD file, skipping over code fences.
    The replace() removes all code fences, the matchAll greps all headlines and the map extracts the captured headline,
    i.e. the "#[#…] Headling".
    The array headings than contains only those strings. 
  */
  const headings = [... 
    txt.replaceAll(/^```.*?```$/smg,'').
    matchAll(/^(#+\s+.*?)$/smg)].
      map(h => h[1]);

  /* Convert the MD to PDF, get the PDFDocument from it and create the top-level Outline */
  const pdfRecord = app.convert({record: r, to: "PDF document", in: r.locationGroup()});
  const pdfDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(pdfRecord.path())));
  const outline = $.PDFOutline.alloc.init;

  /* Get the text layer of the PDF as JavaScript string */
  const pdfText = pdfDoc.string.js;
  
  /* Initialize some variables to manage TOC entry hierarchy */
  let lastLevel = -1;
  let lastParent = outline;
  let lastSibling = undefined;

  /* Loop over all headlines from the MD document */
  headings.forEach(h => {
    /* Calculate the headline level from the number of leading '#' characters */
    const currentLevel = h.match(/^#+/m)[0].length;
    
    /* Remove the leading hash signs and space(s) from headline (first replace), and
       escape characters with special meaning in regular expressions (2nd replace) */
    const headingText = h.replace(/^#+\s+/,'').replaceAll(/([+*?\[\-{}])/g,"\\$1");
   
    /* Build a regular expression from the current headline 
    matching a line with only the headline on it, ignoring leading space(s) 
    but taking into account all other characters preceding the headline, like numbers etc. */
    const headingRE = new RegExp(`^\\s*(.*?${headingText})$`,"m");
    /* Find the headline in the PDF as it's printed there. 
    It might be prefixed with characters _not_ in the original one, like numbering */
    const headingInPDF = pdfText.match(headingRE);

    /* If the heading is found ... (well, it should always be, but who konws) */
    if (headingInPDF) {
      /* Search for the textual version of the headline in the PDFDocument to find page and location on page */
      const pdfSelection = pdfDoc.findStringWithOptions($(headingInPDF[1]),0);
      
      /* If the text is found (as it should be) use the first match */
      if (pdfSelection.js.length > 0) {
        const firstSel = pdfSelection.js[0];
        const page = firstSel.pages.js[0]; // NSPage object!
        const bounds = firstSel.boundsForPage(page); //NSRect object
        /* Calculate the point for the destination a click on the TOC entry is moving to. 
        Use the upper y coordinate and the left x coordinate */
        const pt = $.NSPointFromCGPoint($.NSRectToCGRect(bounds).origin);
        pt.y = $.NSMaxY(bounds);
        
        /* Create a new PDFDestination, i.e. a target for the TOC entry */
        const destination = $.PDFDestination.alloc.initWithPageAtPoint(page, pt);
        const tocEntry = $.PDFOutline.alloc.init;
        tocEntry.destination = destination;
      
        /* Use the heading in the PDF doc as label for the TOC entry */
        tocEntry.label = headingInPDF[1];

        /* Find the appropraite parent PDFOutline to append this TOC entry to */
        const parentOutline = (() => {
          /* Current heading is bigger than last one: append to last one or outline for first heading */
          if (currentLevel > lastLevel) 
            return lastSibling || outline;
          /* Current heading is on same level as last one: append to parent of last sibling */
          if (currentLevel === lastLevel)
            return lastSibling.parent;
          /* Current heading is smaller than last one: move upwards to find matching parent */
          let targetLevel = lastLevel;
          let targetEntry = lastParent || outline;
          while (targetLevel > currentLevel) {
            targetEntry = targetEntry.parent || outline;
            targetLevel--;
          }
          return targetEntry;
         })()
        parentOutline.insertChildAtIndex(tocEntry, parentOutline.numberOfChildren);
        lastLevel = currentLevel;
        lastParent = tocEntry.parent;
        lastSibling = tocEntry;
    }
    }
  })
  /* Save the outline in the PDF document and save the document to disk */
  pdfDoc.setOutlineRoot(outline);
  pdfDoc.writeToFile(pdfRecord.path());
}) // forEach
} // function performsmartrule

AW2307 · March 18, 2023, 8:02pm

Much appreciated!

This is a good solution until there is (hopefully) native support for converting the Markdown document’s outline into one that can be displayed in the PDF’s Table of Contents inspector.

chrillek · March 19, 2023, 10:46am

If you’re using that regularly: could you let me know, please, if something goes awry? The code is not really well tested, so there might be cases where it fails.

AW2307 · March 19, 2023, 11:14am

It worked on my first test. The smart rule script is clearly an improvement compared to not having a TOC at all, however I may ultimately need to resort to a workflow involving third-party solutions.

My goal is to be able to view the PDF’s contents and navigating it via the outline, just as is possible already with Markdown documents.

cgrunenberg · March 20, 2023, 7:03am

A future release will support this.

AW2307 · March 20, 2023, 7:15am

You guys are incredible! Thank you.