Accessing PDFKit from JXA

I am planning to write a script for which I need to access the text of a PDF separated for each page of the document.

Am I correct that I cannot do this solely with the DT3 scripting library?

If I use PDFKit to do this, is it possible to access PDFKit from JXA without needing to load a Node.js package as a dependency?

If I do not go that route I believe this could be done by scripting a 3rd party PDF app which can access text on a per-page basis; I believe PDfPenPro and Adobe Acrobat can do that but not Preview or the free Adobe reader. Are any of these preferred over another?

AFAIK: Yes. DT3 can only return the complete text of the PDF, not separately for each page.

Yes, using the ObjC bridge. Very sketchy like this

ObjC.import('PDFKit');
const file = "..."
const URL = $.NSURL.fileURLWithPath($(file)); /* use $() to convert from JS string to NSString */
const PDFDoc = $.PDFDocument.alloc.initWithURL(URL);
const pageCount = PDFDoc.pageCount;
for (let i = 0; i  < pageCount; i++) {
  const page = PDFDoc.pageAtIndex(i);
  const txtNS = page.string; /* this is an NSString! */
  const txtJS = txtNS.js; /* this is a JavaScript string! */
}

Not tested at all, but it should work along these lines. For more details, check out

3 Likes

Excellent -that works perfectly. And that seems like a good reference to understand further.

Huge thanks.