How to set data property of record to PDF document?

Hi,
I seem to remember a thread dealing with the task
record.data = <PDFDocument> (in JavaScript parlance) or
set record's data to <PDFDocument> (in AppleScript?). Unfortunately, I can’t find that thread anymore. If some kind soul could point me to it (@pete31, perhaps?), I’d be grateful.

In the meantime: I tried this

const pdfData =  $.PDFDocument.alloc.initWithData(decodedData);
const record = app.createRecordWith({name: filename, type: 'pdf'});
record.data = pdfData;

(the whole script is a bit too long, and I’m only interested in this part here). That code should (I think) create a new PDF record with the PDFDocument created in the first line as content. However, the record doesn’t show anything, and its size is 0.

Now, if I write the decodedData to a file and import that into DT, everything is find (i.e. I get a PDF record, and it has the right content). Also of interest: The number of pages in pdfData is 1, as well it should be. So it seems that the PDFDocument produced by initWithData is valid. Shouldn’t it then be possible to assign it to the data property of the record?

Thanks. But I’m still stuck. Here’s what I try to do

  • take an EML file stored in DT
  • for all the attachments in it (well, PDF, images and HTML)
    • create a new record in DT containing just the attachment

Everything works ok if I write the attachment data to a file and then import that. But the direct way, i.e. creating a new record and assigning to its data property, gives only empty records. The binary attachment data (i.e. after decoding from Base64) is stored in decodedData. And either

  • record.data = decodedData, or
  • record.data = $.PDFDocument.alloc.initWithData(decodedData)
    result in an empty record.

What does work, though, is copying the data property from one record to another one (as described in the quoted thread).

It seems that data() gives a string representation (something like “****($…”)). If that is what the data property expects, too, one can obviously not use a PDFDocument nor decodedData, though.

It doesn’t, these string representations used by JXA are now actually supported. The complete source might be useful as this record.data = decodedData should work.

That’s a bit long… Anyway, you’ll have to replace the UUID with then one of an EML record containing at least one PDF attachment.

ObjC.import('Foundation');
ObjC.import('PDFKit');
/* Associate Content-type with a DT record type. This is currently 
only used to weed out unsupported types */
const typeFromMIME = {
  'application/pdf': 'PDF Document',
  'image/jpeg': 'image',
  'image/jpg' : 'image',
  'image/png' : 'image',
  'image/tiff': 'tiff',
  'text/html' : 'html'
};

(() => {
  const app = Application("DEVONthink 3")
  app.includeStandardAdditions = true;
  /* For testing: fixed DT record */
  const path = app.getRecordWithUuid("%3C4B9A77CE-DC90-4917-822D-377BE19325A0@bru6.de%3E").path();
  
  /* Get the filesystem path of the first selected record */
  const error = $();
  /* Read the content of the record into an NSString object, return a JavaScript string */
  const content = $.NSString.stringWithContentsOfFileEncodingError($(path), $.NSUTF8StringEncoding, error).js;
  
  
  /* Build a regular expression to match all boundaries */
  const boundaries = [... content.matchAll(/boundary="?(.*?)"?;?\n/g)];
  if (! boundaries || boundaries.length < 1) {
    console.log(`No boundary found in EML`);
  }
  
  const allBoundaries = boundaries.map(b => b[1]).join('|');
  const boundaryRE = new RegExp(`^--(${allBoundaries})?\n`,'ms');
  
  /* Split the content at the boundaries. */
  const parts = content.split(boundaryRE);
  
  /* parts now contains all the message, i.e. body & attachments. Loop over them */
  parts.forEach((p,i) => {
    
    /* Split the current part at two subsequent empty lines */
    const subparts = p.split(`\n\n`);
    
    /* Split the first part of the current part into lines, store them in header */
    const header = subparts[0].split(`\n`);
    
    /* Save the main part of the current part in body */
    const body = subparts[1] ;    
    
    /* Handle attachments: the first element of the header must contain a Content-Disposition: */
    if (/Content-Disposition: (inline|attachment);/.test(header[0])) {
      
      /* Get the header lines with the raw filename and MIME types */
      const filenameRaw = header.filter(h => /filename=/.test(h))[0];
      const mimeTypeRaw = header.filter(h => /Content-Type:/.test(h))[0];
      
      /* convert raw filename and MIME type to the correct strings */
      const filename = filenameRaw.match(/filename="?([^"]*)"?/)[1];
      const mimeType = mimeTypeRaw.match(/: (.*)?;/)[1];
      
      /* Get DT's record type corresponding to the current MIME type */
      const DTtype = typeFromMIME[mimeType];
      if (!DTtype || DTtype !== 'PDF Document') {
        /* ignore all attachments with unsupported MIME types */
        console.log(`mimetype ${mimeType} not suppored`);
        return; 
      }
      
      /* Decode the body of the attachment into an NSData object. 
      Remove the last boundary first, otherwise the decode will fail */
      const decodedData = $.NSData.alloc.initWithBase64EncodedStringOptions($(body.replace(/^--.*--$/m,"")), $.NSDataBase64DecodingIgnoreUnknownCharacters);
      
      const PDFDoc =  $.PDFDocument.alloc.initWithData(decodedData);
      const record = app.createRecordWith({name: filename, type: DTtype});
      record.data = decodedData; // Gives an empty PDF 
      record.data = PDFDoc; // doesn't work either
      
      return;
    }
  })
})()

Does the decoding definitely work? I just saw this in the Console:

2022-07-26 16:15:41.501 DEVONthink 3[96489:3759813] setData (DTRecord <2CF1BC43-2FBD-4BDA-8157-AC820050A093> (/Test.pdf/)): Invalid image.

Just had a closer look, the data received by DEVONthink contains only 60 bytes over here whereas it should be 5 MB. It’s somewhat similar to the stuff here…

http://mail.machomeautomation.com/pipermail/xtensionlist/2016-June/008168.html

…containing dle2, reco or usrflist too but useless for DEVONthink.

As I said above: Writing decodedData to a file results in the correct PDF. And PDFDocument.pageCount returns 1, which is also correct. Both seem to indicate (to me, that is) that decodedData is in fact a working binary PDF.

I’m not sure what the error message means and where the file name comes from in it. I’ve tested it here with a single EML containing two PDF attachments. It’s conceivable that the splitting etc. is not working correctly for all EMLs, though. If body is too small (i.e. less than 5MB), that’s an indication that something before didn’t work.

I’ll just send you my EML file for testing, ok?

Sure. Although I assume that it’s more likely another weird JXA conversion issue and therefore writing the file works, setting the data doesn’t.

Probably.

Same results with your email. A valid NSData object but just containing 60 bytes. But I’m not sure if it’s really JXA related or just because the AppleScript suite is not really intended for Objective-C objects created via AppleScript or JXA.

And i"m just getting a zero-byte PDF from this script.

That’s exactly what I’m seeing here, too. But (see above): When I write the data to a file (with the Foundation framework stuff, that is), I get a working PDF.

That might well be the case. So let’s just ignore it – at least there’s a workaround by writing the stuff to a file and importing that.