JavaScript: Preserving the TOC through OCR – a shorter solution

@clang proposed an AppleScript script that preserves the table of contents through OCR about two years ago:

I took the liberty to re-write that in JavaScript which resulted in a shorter script and avoids running external programs as well as creating temporary files outside of DT as the original did.

ObjC.import('PDFKit');
(() => {
  const app = Application("DEVONthink 3");
  const records = app.selectedRecords();
  records.filter(r => r.type() === 'PDF document').forEach( r => {

    /* Loop over all selected PDF documents */
    const path = r.path();

    /* get a PDFDocument object for the old PDF */
    const PDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(path)));

    /* get the old PDF's TOC ("outlineRoot" in Apple parlance) */
    const TOC = PDFDoc.outlineRoot;

    /* OCR the record. The timeout was enough for a 544p book */
    const newRecord = app.ocr(path, { file: path, to: r.locationGroup(), waitingForReply: true}, 
         {timeout: 1200});
	const newPath = newRecord.path();  

    /* get the PDFDocument of the OCRd record */
    const newPDFDoc = $.PDFDocument.alloc.initWithURL($.NSURL.fileURLWithPath($(newPath)));

    /* Set the new PDFDocument's TOC to the old one's */
    newPDFDoc.outlineRoot = TOC;

    /* Persist the new PDFDocument */
    newPDFDoc.writeToFile($(newPath));

    /* Uncomment the next line to delete the old record */
//   app.delete(r);
  })
})()
6 Likes

The latest version of the OCR engine actually retains the table of contents. Do you use Intel or Apple Silicon Macs?

I’m on Apple silicon. And this was just a programming exercise for me – the few PDFs with a TOC that I have always come with a text layer.

What’s the difference between the two? My desktop is Intel-based but my laptop is Apple Silicon.

I think you are suggesting OCR turns out superior with DT when using an Apple Silicon machine?

Not at all. Just usual questions to reproduce things.

1 Like