I use scripting (python) to set various properties of a file based on where it appears on a website - I am currently saving it as custom metadata on a PDF, with the same values as my custom meta data fields in devonthink. Could this be imported with the file (I can’t work out how), or would I need to write an Applescript to do it?
That’s the only possibility. By the way, which properties do you actually set and how do you retrieve them?
This description is so general that it’s nearly impossible to see what you’re doing exactly. It seems that you have files “appearing” on a website (which site? what kind of files?). And then you are saving “it” (what is “it” here – the files? their appearance? the website?) as custom metadata on a PDF – how do you do that? With a program (which one?), with a script (the same that you use to “set various properties of a file”? another one?)?
While it is possible to add custom metadata to a PDF, Apple’s PDFKit framework does not seem to be able to do that (nor read the custom metadata). It only offers the usual set of standard metadata (title
, producer
, creation date
etc.). Nor does there seem to be a way to access an XMP part of a PDF with this framework.
In my opinion, your best bet would be to retrieve the metadata with the same environment you use to set them and then add them as custom metadata to the DT record after creating it. You can employ PyDT3 (search the forum for it) if you want to use Python.
PS If you really want to suffer, you could perhaps(!) use AppleScript/JavaScript-Objective-C-Bridging and the Core Graphics/Quartz2 PDF functions. They are even less documented than PDFKit, but perhaps you can loop the PDF’s catalog… At least, Core Graphics can write XML metadata (but there doesn’t seem to be a simple way to read them – bugger).
Sorry I was vague: I am using Python and PyPDF to save data to PDF metadata as I download them from various sites to my computer. Embedding this data in the files seems natural to me since then I can access this data whenever I work with the files without having to save separate JSON, CSVs or whatever.
def set_pdf_metadata(filepath, organization, date_of_document):
with open(filepath, 'rb') as file:
reader = PdfReader(file)
writer = PdfWriter()
# Copy over the existing document to the writer
writer.clone_document_from_reader(reader)
# Ensure metadata values are strings
organization_str = str(organization) if organization else ''
date_of_document_str = date_of_document.strftime("%Y-%m-%d") if isinstance(date_of_document, datetime) else str(date_of_document)
# Set custom metadata
metadata = {
'/org': organization_str,
'/dateofdocument': date_of_document_str
}
writer.add_metadata(metadata)
# Save the PDF with new metadata
with open(filepath, 'wb') as f:
writer.write(f)
The names of the metadata fields match the metadata fields I’ve set in DevonThink. My assumption had been that DT would be able to read custom PDF metadata as it would read standard PDF metadata, hence my question about whether I needed to use AppleScript. I have now managed to do this with AppleScript and ExifTool:
tell application "Finder"
set pdfFolder to choose folder with prompt "Select the folder with PDF files"
end tell
tell application id "DNtp"
set currentGroup to current group
tell application "Finder"
set pdfFiles to every file of pdfFolder whose name extension is "pdf"
end tell
repeat with thisPDF in pdfFiles
set pdfPath to the POSIX path of (thisPDF as alias)
-- Updated exiftool command
set metadataCommand to "/opt/homebrew/bin/exiftool -s -T -Org -Dateofdocument " & quoted form of pdfPath
set extractedMetadata to do shell script metadataCommand
-- Manually split the extracted metadata at the tab
set org to my extractFieldFromMetadata(extractedMetadata, 1)
set dateofdocument to my extractFieldFromMetadata(extractedMetadata, 2)
-- Import the PDF to the current group in DEVONthink
set theDoc to import pdfPath to currentGroup
-- Set the custom metadata for theDoc, mapping "Org" to "organization"
set custom meta data of theDoc to {organization:org, dateofdocument:dateofdocument}
end repeat
end tell
-- Helper function to manually extract a field from tab-delimited metadata
on extractFieldFromMetadata(metadata, fieldNumber)
set AppleScript's text item delimiters to tab
set metadataParts to text items of metadata
if fieldNumber ≤ (count of metadataParts) then
return item fieldNumber of metadataParts
else
return ""
end if
end extractFieldFromMetadata
Next time I’ll consider PyDT3: much of my document processing is done in Python.
Probably nicer than AppleScript. I took a look at what exiftool
does, and it seems they are accessing the PDF catalog directly (and with Perl!). Amazing.
Aside: Where is that date_of_document
coming from? I’m asking because PDF provides internal metadata for creation and modification date, and DT has a document date
property as well. So, maybe that date_of_document
is not needed?
ExifTool is a great tool written in Perl: GitHub - exiftool/exiftool: ExifTool meta information reader/writer
This was more for testing - in reality the metadata I use is more complex: various dates and other info extracted from websites and PDFs using BeautifulSoup, pdfplumber, regex and so on.
Thanks for having provided me with a PDF sample. I came up with a JavaScript version of your script that does not rely on any external tool:
ObjC.import("CoreGraphics");
(() => {
const customMetadataKeys = ['org', 'addeddate'];
const app = Application('DEVONthink 3');
app.selectedRecords().filter(r => r.type() === 'PDF document').forEach(r => {
/* Create a CGPDFDocument from the PDF file */
const PDFfile = r.path();
const NSURL = $.NSURL.fileURLWithPath($(PDFfile));
const CGPDFDoc = $.CGPDFDocumentCreateWithURL(NSURL);
// Get the information dictionary
const infoDict = $.CGPDFDocumentGetInfo(CGPDFDoc);
// Skip this record if no info dict exists
if (!infoDict || $.CGPDFDictionaryGetCount(infoDict) === 0) {
// console.log(`${r.name()}…skipped`);
return;
}
/* Loop over all entries in information dictionary and add them
to the record's custom metadata if the key exists in 'customMetadataKeys'.
All values must be strings.
If a key contains 'date', convert the value is converted to a 'Date'
*/
$.CGPDFDictionaryApplyBlock(infoDict, (key, val, info) => {
const stringVal = $();
const isString = $.CGPDFDictionaryGetString(infoDict, key, stringVal);
if (customMetadataKeys.includes(key)) {
const rawValue = $.CGPDFStringGetBytePtr(stringVal)
// Add to DT custom metadata, possibly converting a date first
// Note that the dates have a space between the date and the time.
// It's replaced with a 'T' here to make the date string ISO-compliant
const DTvalue = /date/.test(key) ? new Date(rawValue.replace(' ','T')) : rawValue;
app.addCustomMetadata(DTvalue, {for: key, to: r});
}
}, null);
})
})()
It works on the currently selected records in DT, filtering out all non-PDF ones, and relies on the names of the custom metadata being defined in customMetadataKeys
as strings. Custom metadata names in the PDF and in DT need to be identical, though that restriction could easily be lifted by using an object mapping names instead of an array.
All metadata in the PDF must be stored in the Info
dictionary (that seems to be what the Python lib does, anyway) and must be strings (that’s a requirement of the standard). If a custom metadata field contains the string date
(as in date_of_document
), the corresponding value from the Info
dictionary will be converted to a Date
before.
This approach does neither require an external tool nor parsing of its output, making it quite straightforward (and a tad bit more compact).