Custom metadata import

mbriordan · December 8, 2023, 7:42pm

I use scripting (python) to set various properties of a file based on where it appears on a website - I am currently saving it as custom metadata on a PDF, with the same values as my custom meta data fields in devonthink. Could this be imported with the file (I can’t work out how), or would I need to write an Applescript to do it?

cgrunenberg · December 9, 2023, 10:22am

That’s the only possibility. By the way, which properties do you actually set and how do you retrieve them?

chrillek · December 9, 2023, 11:18am

This description is so general that it’s nearly impossible to see what you’re doing exactly. It seems that you have files “appearing” on a website (which site? what kind of files?). And then you are saving “it” (what is “it” here – the files? their appearance? the website?) as custom metadata on a PDF – how do you do that? With a program (which one?), with a script (the same that you use to “set various properties of a file”? another one?)?

While it is possible to add custom metadata to a PDF, Apple’s PDFKit framework does not seem to be able to do that (nor read the custom metadata). It only offers the usual set of standard metadata (title, producer, creation date etc.). Nor does there seem to be a way to access an XMP part of a PDF with this framework.
In my opinion, your best bet would be to retrieve the metadata with the same environment you use to set them and then add them as custom metadata to the DT record after creating it. You can employ PyDT3 (search the forum for it) if you want to use Python.

PS If you really want to suffer, you could perhaps(!) use AppleScript/JavaScript-Objective-C-Bridging and the Core Graphics/Quartz2 PDF functions. They are even less documented than PDFKit, but perhaps you can loop the PDF’s catalog… At least, Core Graphics can write XML metadata (but there doesn’t seem to be a simple way to read them – bugger).

mbriordan · December 9, 2023, 12:05pm

Sorry I was vague: I am using Python and PyPDF to save data to PDF metadata as I download them from various sites to my computer. Embedding this data in the files seems natural to me since then I can access this data whenever I work with the files without having to save separate JSON, CSVs or whatever.


def set_pdf_metadata(filepath, organization, date_of_document):
    with open(filepath, 'rb') as file:
        reader = PdfReader(file)
        writer = PdfWriter()

        # Copy over the existing document to the writer
        writer.clone_document_from_reader(reader)

        # Ensure metadata values are strings
        organization_str = str(organization) if organization else ''
        date_of_document_str = date_of_document.strftime("%Y-%m-%d") if isinstance(date_of_document, datetime) else str(date_of_document)

        # Set custom metadata
        metadata = {
            '/org': organization_str,
            '/dateofdocument': date_of_document_str
        }
        writer.add_metadata(metadata)

        # Save the PDF with new metadata
        with open(filepath, 'wb') as f:
            writer.write(f)

The names of the metadata fields match the metadata fields I’ve set in DevonThink. My assumption had been that DT would be able to read custom PDF metadata as it would read standard PDF metadata, hence my question about whether I needed to use AppleScript. I have now managed to do this with AppleScript and ExifTool:


tell application "Finder"
	set pdfFolder to choose folder with prompt "Select the folder with PDF files"
end tell

tell application id "DNtp"
	set currentGroup to current group
	
	tell application "Finder"
		set pdfFiles to every file of pdfFolder whose name extension is "pdf"
	end tell
	
	repeat with thisPDF in pdfFiles
		set pdfPath to the POSIX path of (thisPDF as alias)
		
		-- Updated exiftool command
		set metadataCommand to "/opt/homebrew/bin/exiftool -s -T -Org -Dateofdocument " & quoted form of pdfPath
		set extractedMetadata to do shell script metadataCommand
		
		-- Manually split the extracted metadata at the tab
		set org to my extractFieldFromMetadata(extractedMetadata, 1)
		set dateofdocument to my extractFieldFromMetadata(extractedMetadata, 2)
		
		-- Import the PDF to the current group in DEVONthink
		set theDoc to import pdfPath to currentGroup
		
		-- Set the custom metadata for theDoc, mapping "Org" to "organization"
		set custom meta data of theDoc to {organization:org, dateofdocument:dateofdocument}
	end repeat
end tell

-- Helper function to manually extract a field from tab-delimited metadata
on extractFieldFromMetadata(metadata, fieldNumber)
	set AppleScript's text item delimiters to tab
	set metadataParts to text items of metadata
	if fieldNumber ≤ (count of metadataParts) then
		return item fieldNumber of metadataParts
	else
		return ""
	end if
end extractFieldFromMetadata

Next time I’ll consider PyDT3: much of my document processing is done in Python.

chrillek · December 9, 2023, 1:03pm

Probably nicer than AppleScript. I took a look at what exiftool does, and it seems they are accessing the PDF catalog directly (and with Perl!). Amazing.

Aside: Where is that date_of_document coming from? I’m asking because PDF provides internal metadata for creation and modification date, and DT has a document date property as well. So, maybe that date_of_document is not needed?

mbriordan · December 9, 2023, 2:00pm

ExifTool is a great tool written in Perl: GitHub - exiftool/exiftool: ExifTool meta information reader/writer

This was more for testing - in reality the metadata I use is more complex: various dates and other info extracted from websites and PDFs using BeautifulSoup, pdfplumber, regex and so on.

chrillek · December 16, 2023, 5:21pm

Thanks for having provided me with a PDF sample. I came up with a JavaScript version of your script that does not rely on any external tool:

ObjC.import("CoreGraphics");
(() => {
  const customMetadataKeys = ['org', 'addeddate'];
  const app = Application('DEVONthink 3');
  app.selectedRecords().filter(r => r.type() === 'PDF document').forEach(r => {
    /* Create a CGPDFDocument from the PDF file */
    const PDFfile = r.path();
    const NSURL = $.NSURL.fileURLWithPath($(PDFfile));
    const CGPDFDoc = $.CGPDFDocumentCreateWithURL(NSURL);
    
    // Get the information dictionary
    const infoDict = $.CGPDFDocumentGetInfo(CGPDFDoc);
    // Skip this record if no info dict exists
    if (!infoDict || $.CGPDFDictionaryGetCount(infoDict) === 0) {
    //  console.log(`${r.name()}…skipped`);
      return;
    }
    /* Loop over all entries in information dictionary and add them
       to the record's custom metadata if the key exists in 'customMetadataKeys'.
       All values must be strings. 
       If a key contains 'date', convert the value is converted to a 'Date'
    */
    $.CGPDFDictionaryApplyBlock(infoDict, (key, val, info) => {
      const stringVal = $();
      const isString = $.CGPDFDictionaryGetString(infoDict, key, stringVal);
      if (customMetadataKeys.includes(key)) {
        const rawValue = $.CGPDFStringGetBytePtr(stringVal)

        // Add to DT custom metadata, possibly converting a date first
        // Note that the dates have a space between the date and the time.
        // It's replaced with a 'T' here to make the date string ISO-compliant
        const DTvalue = /date/.test(key) ? new Date(rawValue.replace(' ','T')) : rawValue;
        app.addCustomMetadata(DTvalue, {for: key, to: r});
      }
    }, null);
  })
})()

It works on the currently selected records in DT, filtering out all non-PDF ones, and relies on the names of the custom metadata being defined in customMetadataKeys as strings. Custom metadata names in the PDF and in DT need to be identical, though that restriction could easily be lifted by using an object mapping names instead of an array.

All metadata in the PDF must be stored in the Info dictionary (that seems to be what the Python lib does, anyway) and must be strings (that’s a requirement of the standard). If a custom metadata field contains the string date (as in date_of_document), the corresponding value from the Info dictionary will be converted to a Date before.

This approach does neither require an external tool nor parsing of its output, making it quite straightforward (and a tad bit more compact).