PDF Metadata parse keywords script help

I know there are a couple of other threads that I’ve tried to wade through as I navigate transitioning to DT from paperless. The upgrade to Tahoe has rendered my main computer unable to continue to use paperless.
I’ll try to keep this generic so perhaps it will help more than just my unique use case.

I have PDFs that have had metadata written to them. In the OSX inspector the Keywords have been written and include name=value items, separated by commas. I have no control over how these get formatted so I can’t delimit in any custom way for parsing later.

I would like to work through these with AppleScript and transfer some or all of what is contained in these Keywords to DT. Ideally I would do this with a smart rule on import but I’m open to doing it other ways too if it makes sense.

The problem is that the name=value pairs contain one specific name of “tags” and the value is “value 1,value 2”

Here is an example:

subcategory=Tax, imported=9/6/25, tags=Tax 2025,Charitable Contribution, amount=$500.00, category=Personal, date=9/6/25, vendor=MBA Music Donation

(emphasis added)

Ideally I would assign category, subcategory, date, vendor (to Name), amount (to a custom data field), and parse the tags to DT tags.

Can someone help me script this?

Welcome @altoid
Check this out…

Thanks, I’ve already waded through this thread. Although I have a fairly decent background in programming, I’ve not done anything with AppleScript so the syntax I get lost in. I guess I just need to wade in a little deeper and trial and error it. I have done a bit of text parsing with perl in the past, but not with AppleScript.

What do you need to wade into? Have you tried the script?

  1. Select a PDF with properties in DEVONthink.
  2. Press the Open in Script Editor button in the forum post.
  3. Press the Compile button then the Run button in Script Editor.
  4. Check the custom metadata in the Info > Data inspector.

I think that DT mangles the (badly conceived) keywords when it imports the data: What was tax=Tax 2025, Charitable Contribution, becomes tax=Tax 2025 and Charitable Contribution – keywords are comma-separated in DT. I tried it here: In Preview, I could add a keyword like tags=t1,t2, which in DT then appeared as tags=t1 and t2, i.e. two separate keywords.

Your best choice would be to get a PDFDocument object from the file in (AS-ObjC) and retrieve its documentAttributes dictionary. The Keywords entry in it are what one would expect, i.e. tags=t1,t2.

2 Likes

AppleScript is not compulsory. You can also use JavaScript, which might be more to your taste if you have a programming background.

I actually got a JavaScript working today. Because I’ve used Paperless fairly consistently and used the same Tags for the same type of files with the same conventions, I was able to coerce the script to fit my needs. Basically I assume that if the keywords are not a name=value pair I call it a Tag. I strip based on newlines and commas and get everything I need. Thanks for the nudge to go for it. Now to start unlocking more scripting and automations in DT.

Perhaps you could post the code or relevant parts of it? It might help others who stumble upon this thread later.

2 Likes

Ok, I’ll attempt to do so here. A couple of comments:

I haven’t done any programming for 20 years so forgive my stab at this. Probably could be done more efficiently. Also, there are some unique parts to the code based on my use case and how I used Paperless. Hence some of the code looks for things I know to exist in my data.

Also, if keywords are not present in your files when you’re extracting them from the paperless bundle, you might want to check out Migration from Mariner Paperless to Devonthink - #43 by everkleer

And finally, as previously mentioned we are looking for name=value pairs and we assume they are on separate lines as in this post: Moving from Mariner Paperless to DevonThink - #54 by chrillek

Just put this script in DT, select the files you want to massage your Paperless data from and hit go. There is no need to use the sql database file with this approach as long as you export the metadata to your pdf files within paperless as in the first link.

(() => {
	const app = Application("DEVONthink");
	app.includeStandardAdditions = true;
	let records = app.selectedRecords();
	
	records.forEach (r => {
		var keywords = {};
		//Try to get KeyWords Metadata
		//Note:  If you had not selected the option in paperless to write metadata to PDF files you won't have any thing useful to grab.
		//See: https://discourse.devontechnologies.com/t/migration-from-mariner-paperless-to-devonthink/24496/43
		try {
			// Assume keywoard name=value pairs are each on separate lines
			// See: https://discourse.devontechnologies.com/t/moving-from-mariner-paperless-to-devonthink/77817/54
			keywords = r.metaData().kMDItemKeywords.split('\n');
		} catch (e) {
			console.log(e);
		}
		const tags = [];
		let CE = false;
		if(keywords.length > 0) {
			keywords.forEach(k => {
			
			
			//Name value pairs separated by = sign.  
			//See: https://discourse.devontechnologies.com/t/moving-from-mariner-paperless-to-devonthink/77817/54
			const vals = k.split('=');
			if(vals.length > 1) {
				const label = vals[0].replace(/,\s*$/, "");
				const value = vals[1].replace(/,\s*$/, "");				
				
				
				if(label == 'tags') { 
					//Split up by commas if present.  Sometimes they are just a single tag=word but in some cases Paperless does funky things
					//and puts them tag=tag1,tag2...so we can attempt to get them here and push them into the tag list for this record.
					var tagList = value.split(',');
						tagList.forEach (t=> {
						console.log(t);
						tags.push(t); 
					})
				} 
				
				/* Unused
				
				else if(label == 'amount') {
				
					//This is a custom use case for the way I used Paperless.  I lived with amount to keep track of CE hours, but
					//Decided I would add a custom metadata in DT and acutally call it hours.

					if(value == 'Continuing Education') {
						app.addCustomMetaData(value, { for: "hours", to: r });
					} else {
						app.addCustomMetaData(value, { for: label, to: r });
					}
				} */
				
				else if(label == 'category' || label == 'amount' || label == 'imported' || label =='date') { 
					//These are all straightforward.					
					app.addCustomMetaData(value, { for: label, to: r });
				} else if (label == 'subcategory') {

					//Custom use case for my situation and related to how I uniquely used Paperless
					//In this case we check to see if the subcateogory is Continuing education, 
					//and flag if present so we can change amount to hours later.

					if(value == 'Continuing Education') { 
						CE = true; 
					}
					app.addCustomMetaData(value, { for: label, to: r });

				} else if (label == 'vendor') {
					//Not needed because Title is already set acccording to vendor
				}
				
			} else {
				//assume single label without a value is a tag
				tags.push(vals[0].replace(/,\s*$/, ""));
			}
			
			//Now check if CE, if so, switch amount to hours.
			//Again...custom use case to how I used paperless.
			if(CE) {		
				const hours = app.getCustomMetaData({for: 'amount', from: r});
				app.addCustomMetaData(hours, { for: 'hours', to: r});
				app.addCustomMetaData('', { for: 'amount', to: r});
				CE = false;
			}
		})
		
		}
		//Add the tags...
		r.tags = tags;	
		//Now Work on Title
		try {
			const title = r.metaData().kMDItemTitle;
			
			if(title != '' && title != null) { r.name = title;}
		
		
		//console.log('Title:' + title);
		} catch(e) {
			console.log(e);
		}
	})
})();

I should add that this should be run after you drag your PDFs to DT. With regards to doing that, you will want to do a little bit of housekeeping before you drag them in. You have to show bundle on the .paperless file. Within that file you’ll find your documents. I used a bash script to delete the deepest folders. If you do not delete the deepest folders you’ll end up with duplicates.

Here is the script, but please be careful using it… use it on a 2nd set of data as it will delete your files if you use it incorrectly.

#!/bin/bash

# Check if correct number of arguments are provided
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <start_directory> <depth_level>"
    exit 1
fi

START_DIR="$1"
DEPTH="$2"

# Ensure depth is a positive integer
if ! [[ "$DEPTH" =~ ^[0-9]+$ ]] || [ "$DEPTH" -eq 0 ]; then
    echo "Depth level must be a positive integer."
    exit 1
fi

echo "Deleting folders at depth $DEPTH within $START_DIR..."

# Find directories at the specified depth and delete them
# -maxdepth "$DEPTH" ensures we don't go deeper than the target level
# -mindepth "$DEPTH" ensures we only target directories at the exact depth
# -type d targets only directories
# -exec rm -rf {} + executes rm -rf on the found directories
find "$START_DIR" -maxdepth "$DEPTH" -mindepth "$DEPTH" -type d -exec rm -rf {} +

echo "Deletion complete."
1 Like

Just some suggestions:

  • try {…} catch(e) { console.log(e)}
    I wouldn’t do that.
    • console output will only be visible if you run the script in Script Editor or via osascript. To make sure that you see the output, add app.logMessage().
    • You continue after the catch anyway. If I really get an error there, I’d just have it thrown and stop execution.
  • if (keywords.length > 0) is not necessary before the forEach: If keywords is empty, the forEach will simply do nothing.
  • Calling replace to remove the trailing blanks (why not leading ones?) could be simplified to trimEnd (cheaper than a regular expression and easier to read, too).
  • if (label == 'tags') is ok here, but I’d use === as that doesn’t coerce its operands. In this case, it’s not a problem, but I prefer using the non-coercing operator in general to prevent unwanted side effects. Like 1 == '1' will be true, while 1 === '1' will not.
  • Instead of (label == 'category' || label == 'amount'… ) (see previous remark for ==), one could use ['category', 'amount', …].includes(label). Less to write and easier to read, perhaps.
  • if(title != '' && title != null) could be simplified to if (title)