PDF Metadata parse keywords script help

Ok, I’ll attempt to do so here. A couple of comments:

I haven’t done any programming for 20 years so forgive my stab at this. Probably could be done more efficiently. Also, there are some unique parts to the code based on my use case and how I used Paperless. Hence some of the code looks for things I know to exist in my data.

Also, if keywords are not present in your files when you’re extracting them from the paperless bundle, you might want to check out Migration from Mariner Paperless to Devonthink - #43 by everkleer

And finally, as previously mentioned we are looking for name=value pairs and we assume they are on separate lines as in this post: Moving from Mariner Paperless to DevonThink - #54 by chrillek

Just put this script in DT, select the files you want to massage your Paperless data from and hit go. There is no need to use the sql database file with this approach as long as you export the metadata to your pdf files within paperless as in the first link.

(() => {
	const app = Application("DEVONthink");
	app.includeStandardAdditions = true;
	let records = app.selectedRecords();
	
	records.forEach (r => {
		var keywords = {};
		//Try to get KeyWords Metadata
		//Note:  If you had not selected the option in paperless to write metadata to PDF files you won't have any thing useful to grab.
		//See: https://discourse.devontechnologies.com/t/migration-from-mariner-paperless-to-devonthink/24496/43
		try {
			// Assume keywoard name=value pairs are each on separate lines
			// See: https://discourse.devontechnologies.com/t/moving-from-mariner-paperless-to-devonthink/77817/54
			keywords = r.metaData().kMDItemKeywords.split('\n');
		} catch (e) {
			console.log(e);
		}
		const tags = [];
		let CE = false;
		if(keywords.length > 0) {
			keywords.forEach(k => {
			
			
			//Name value pairs separated by = sign.  
			//See: https://discourse.devontechnologies.com/t/moving-from-mariner-paperless-to-devonthink/77817/54
			const vals = k.split('=');
			if(vals.length > 1) {
				const label = vals[0].replace(/,\s*$/, "");
				const value = vals[1].replace(/,\s*$/, "");				
				
				
				if(label == 'tags') { 
					//Split up by commas if present.  Sometimes they are just a single tag=word but in some cases Paperless does funky things
					//and puts them tag=tag1,tag2...so we can attempt to get them here and push them into the tag list for this record.
					var tagList = value.split(',');
						tagList.forEach (t=> {
						console.log(t);
						tags.push(t); 
					})
				} 
				
				/* Unused
				
				else if(label == 'amount') {
				
					//This is a custom use case for the way I used Paperless.  I lived with amount to keep track of CE hours, but
					//Decided I would add a custom metadata in DT and acutally call it hours.

					if(value == 'Continuing Education') {
						app.addCustomMetaData(value, { for: "hours", to: r });
					} else {
						app.addCustomMetaData(value, { for: label, to: r });
					}
				} */
				
				else if(label == 'category' || label == 'amount' || label == 'imported' || label =='date') { 
					//These are all straightforward.					
					app.addCustomMetaData(value, { for: label, to: r });
				} else if (label == 'subcategory') {

					//Custom use case for my situation and related to how I uniquely used Paperless
					//In this case we check to see if the subcateogory is Continuing education, 
					//and flag if present so we can change amount to hours later.

					if(value == 'Continuing Education') { 
						CE = true; 
					}
					app.addCustomMetaData(value, { for: label, to: r });

				} else if (label == 'vendor') {
					//Not needed because Title is already set acccording to vendor
				}
				
			} else {
				//assume single label without a value is a tag
				tags.push(vals[0].replace(/,\s*$/, ""));
			}
			
			//Now check if CE, if so, switch amount to hours.
			//Again...custom use case to how I used paperless.
			if(CE) {		
				const hours = app.getCustomMetaData({for: 'amount', from: r});
				app.addCustomMetaData(hours, { for: 'hours', to: r});
				app.addCustomMetaData('', { for: 'amount', to: r});
				CE = false;
			}
		})
		
		}
		//Add the tags...
		r.tags = tags;	
		//Now Work on Title
		try {
			const title = r.metaData().kMDItemTitle;
			
			if(title != '' && title != null) { r.name = title;}
		
		
		//console.log('Title:' + title);
		} catch(e) {
			console.log(e);
		}
	})
})();

I should add that this should be run after you drag your PDFs to DT. With regards to doing that, you will want to do a little bit of housekeeping before you drag them in. You have to show bundle on the .paperless file. Within that file you’ll find your documents. I used a bash script to delete the deepest folders. If you do not delete the deepest folders you’ll end up with duplicates.

Here is the script, but please be careful using it… use it on a 2nd set of data as it will delete your files if you use it incorrectly.

#!/bin/bash

# Check if correct number of arguments are provided
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <start_directory> <depth_level>"
    exit 1
fi

START_DIR="$1"
DEPTH="$2"

# Ensure depth is a positive integer
if ! [[ "$DEPTH" =~ ^[0-9]+$ ]] || [ "$DEPTH" -eq 0 ]; then
    echo "Depth level must be a positive integer."
    exit 1
fi

echo "Deleting folders at depth $DEPTH within $START_DIR..."

# Find directories at the specified depth and delete them
# -maxdepth "$DEPTH" ensures we don't go deeper than the target level
# -mindepth "$DEPTH" ensures we only target directories at the exact depth
# -type d targets only directories
# -exec rm -rf {} + executes rm -rf on the found directories
find "$START_DIR" -maxdepth "$DEPTH" -mindepth "$DEPTH" -type d -exec rm -rf {} +

echo "Deletion complete."
1 Like