Automatically capture and annotate items (to use with Obsidian)

mdbraber · April 7, 2022, 10:19pm

The content of this workflow / scripts is too large to fit in one post - see these posts for the other parts/scripts:

Additional resources - Automatically capture and annotate items: Markdown Annotation Helper
Additional resources - Automatically capture and annotate items: DEVONthink helper, Smart rule scripts, JS/Markdown helper

I’ve read several threads on this forum about creating/extracting highlights, backlinking, working together with Obsidian etc. I’ve been developing a system to clip content, automatically capture PDFs (for long-term reference and search), automatically create/update annotations for captured resources, be able to link / use annotations in sync with Obsidian.

What these scripts do

Automatically capture clipped bookmarks / URLs to PDF
Automatically create an annotation for all captured content (bookmarks and PDF)
Automatically link annotation files to captured content
Update information in DT when update the annotation (.md) file (e.g. changing tags)
Update information in .md file when updating MD (e.g. changing tags)
Show links to original files and DT items based on Markdown metadata using JS (see last script)
Use Keyboard Maestro to open files in Finder (circumventing Obsidian file:// restrictions)

Caveat emptor
This is a highly personal setup. I’m providing these scripts and workflows because it might help others (and I’ve been able to build this based on the very helpful posts and comments on this forum myself!). There might be some generic pieces which could be interesting e.g. on processing Markdown files in the Helper scripts below.

I’d be surprised if anyone gets this set up (or even wants to) in the same way I have (as my requirements are probably highly peculiar anyways) I’m probably not able to offer much support so this is mostly if you’re quite familiar with scripting in DEVONthink. It’s all AppleScript so I’m just waiting for @chrillek to write a JXA version of all of this

Why such an elaborate workflow?
My reasoning for this is that the annotation file can hold all “outgoing” information (the original URL, title, capture date, DT links, tags), but at the same time I’m staying ‘independent’ of DT when it might go away, not be available. It also prevents having to save all my captured (PDF) content directly to my Obsidian vault, while still getting all the context (URLs, highlights, notes etc.) - this makes it all more lightweight for daily use. Linking to a ‘resource’ in Obsidian, means linking to the .md file which hold all the relevant context to proceed from or add information to.

Workflow
When you’ve installed everything in this post (a lot!) you’re able to clip something as a bookmark or a PDF and automatically create / update annotations or items. As an added bonus you can also capture content by clipping it via an imported markdown file (e.g. via MarkDownload. Actually: using MarkDownload to clip content was how I originally started - currently I’m mostly clipping bookmarks or PDFs directly.

Annotation files
An annotation file looks like this (below) and is ‘linked’ via set annotation. I’m not using the standard Annotation group or naming which DT uses, but I’m putting all annotations in a single annotationsGroup e.g. /Notes/Content, which is a folder in my Obsidian vault (this vault is also indexed in my DT)

---
date: 2022-03-13 22:50
url: https://gist.github.com/itst/780dee5c510db6d1327c34c39166eb0f
itemurl: x-devonthink-item://D6C8E1D6-B386-44BC-98DB-6FA7E08F9BDF
annotationurl: x-devonthink-item://95CCC418-3AED-4016-A02E-C4FCC7A67B9B
path: Resources/fiddle/pkm/read-later/Import and regularly replicate your Pinboard bookmarks in DEVONthink.pdf
tags: [fiddle,pkm,read-later,devonthink]
---

Excerpt:: Import and regularly replicate your Pinboard bookmarks in DEVONthink. - Pinboard.scpt

To install these scripts:

Check the other posts and scripts: Additional resources - Automatically capture and annotate items: Markdown Annotation Helper and Additional resources - Automatically capture and annotate items: DEVONthink helper, Smart rule scripts, JS/Markdown helper
Create a “/Content” group in a database (e.g. Resources) to put all your content in. I’m using Group Tags instead of folders
All captured content is put in “/Content/00-captured”
Add a custom metadata item ‘originaltags’ (Single line text) . This is needed to be able keep original tags (added with clipping content) when using Classify
Download and install [RegexAndStuffLib v1.0.7] to (https://s3.amazonaws.com/latenightsw.com/ShaneLibs/RegexAndStuffLib_stuff.zip) into ~/Library/Script Libraries/ - see RegexAndStuffLib Script Library - AppleScript - Late Night Software Ltd. for more info
Install Readability.js in /Users/mdbraber/Library/Application Scripts/com.devon-technologies.think3/Smart Rules
Install the scripts below in the Smart Rules directory (/Users/mdbraber/Library/Application Scripts/com.devon-technologies.think3/Smart Rules)
Set up the Smart Rules and inline Applescripts - I’m using a Smart Rule on the Inbox and put all captured content in “/Content/00-captured” for further processing (mostly tagging)

Bugs / TODO

Probably this whole thing is hard to figure out anyway, so I’d be surprised if anyway gets this setup, but maybe there are bits and pieces which are useful for someone
Comments from clipped content (e.g. a bookmark) are considered an Excerpt in the .md file. I still need to add some regex to be able to also add comments and an excerpt inside the Comments field
I’ve got some code to automatically extract highlights from PDFs and do the reverse: use text from an annotation file as highlight in PDFs. It’s mostly barebones for now, I might share this at a later stage.

Applescript: Process incoming annotation

use DT : script "DEVONthink helper"
use ma : script "Markdown Annotation helper"
use script "RegexAndStuffLib" version "1.0.7"
use scripting additions

on run
	tell application id "DNtp" to my performSmartRule(selection as list)
end run

-- Run as smart rule
on performSmartRule(theRecords)
	tell application id "DNtp"
		repeat with theRecord in theRecords
			repeat 1 times -- fake loop to create a simulated continue
				
				-- initialize variables
				set captureRecord to missing value
				set maRecord to missing value
				set pdfRecord to missing value
				set theRecordType to (type of theRecord as string)
				set maText to ""
				
				set theDatabase to database of theRecord
				
				-- check if group for processed pdf exists
				set processedGroup to get record at "/Content/00-captured" in theDatabase
				if processedGroup is missing value then
					error "No processed group \"/Content/00-captured\" found in current database - create the group first"
				end if
				
				-- check if group for annotations exists
				set theAnnotationsGroup to "/Notes/Content"
				set annotationsGroup to get record at (theAnnotationsGroup) in theDatabase
				if annotationsGroup is missing value then
					error "No annotations group (" & theAnnotationsGroup & ") found in current database - create the group first"
				end if
				
				if theRecordType is in {"markdown", "«constant ****mkdn»"} then
					-- process markdown record
					set maRecord to theRecord
					set maText to plain text of maRecord
					
					set maTitle to name without extension of maRecord
					set maTitle to DT's sanitize(maTitle)
					set maURL to ma's getURL(maText)
					set maDate to ma's getDate(maText)
					set maTags to ma's getTags(maText)
					set maExcerpt to ma's getExcerpt(maText)
					
					-- Fix the URL of ma file which has base64 content because of MarkDownload
					set URL of maRecord to missing value
					
					if maURL is not equal to "" then
						-- Create a temporary record to capture
						set captureRecord to create record with {URL:maURL, type:bookmark} in current group
					else
						log message "No URL found - skipping"
						exit repeat
					end if
					
				else if theRecordType is in {"bookmark", "«constant ****DTnx»"} then
					-- process bookmark record
					
					set maTitle to name without extension of theRecord
					set maTitle to DT's sanitize(maTitle)
					set maURL to URL of theRecord
					set maCreationDate to creation date of theRecord
					set maDate to DT's formatDate(maCreationDate) as string
					set maTags to {}
					
					if comment of theRecord is not equal to "" then
						set maExcerpt to comment of theRecord
					else
						set maExcerpt to ""
					end if
					
					-- Set the bookmark as the record to capture (will be deleted after capture)
					set captureRecord to theRecord
					
				else if theRecordType is in {"pdf", "PDF document", "«constant ****pdf »"} then
					-- Clean up Item title (we can't be sure DT already sanitized the filename,
					-- e.g. from old imports before sanitizing filenames was added)
					set maTitle to name without extension of theRecord
					set maTitle to DT's sanitize(maTitle)
					-- Title of theRecord is always leading, so will overwrite whatever is in the maFile
					--set name of theRecord to maTitle & ".pdf"
					-- If we don't include ".pdf" it goes wrong when the title ends with another valid extensions
					set name of theRecord to maTitle & ".pdf"
					
					set maCreationDate to creation date of theRecord
					set maDate to DT's formatDate(maCreationDate) as string
					
					if (exists annotation of theRecord) then
						set currentAnnotationType to type of (annotation of theRecord) as string
						if currentAnnotationType is in {"markdown", "«constant ****mkdn»"} then
							set maRecord to annotation of theRecord
							set maText to plain text of maRecord
							
							-- Get URL from theRecord or otherwise from annotation
							if URL of theRecord is not "" then
								set maURL to URL of theRecord
							else
								set maURL to ma's getURL(maText)
							end if
							
							-- Get tags from annotation
							set maTags to ma's getTags(maText)
							set maExcerpt to ma's getExcerpt(maText)
							
							if maExcerpt is equal to missing value and comment of theRecord is not equal to "" then
								set maExcerpt to comment of theRecord
							end if
							
						else
							error "Annotation of selected Item is not of type markdown - cancelling"
						end if
					else
						set maURL to URL of theRecord
						set maTags to {}
						
						if comment of theRecord is not equal to "" then
							set maExcerpt to comment of theRecord
						else
							set maExcerpt to ""
						end if
					end if
					
					set pdfRecord to theRecord
				else
					error "Cannot process this type of record"
				end if
				
				-- capture pdf if necessary
				if captureRecord is not missing value and maURL is not "" then
					set captureWindow to open window for record captureRecord with force
					delay 2
					set bounds of captureWindow to {0, 0, 900, 900}
					
					-- If it's already a Item, don't need to do more.
					if (maURL ends with ".pdf") is not true then
						-- Wait until it's finished loading.
						repeat while loading of captureWindow
							delay 0.5
						end repeat
						
						-- Some pages load content dynamically, with elements not
						-- displayed until they come into view. This is a hopeless
						-- situation in general but the following heuristic improves
						-- outcomes for some cases. We scroll the window by quarters
						-- to try to trigger loading of more page elements.
						repeat with n from 1 to 4
							set scroll to "window.scrollTo(0," & n & "*document.body.scrollHeight/4)"
							do JavaScript scroll in current tab of captureWindow
							delay 0.75
						end repeat
						
						-- Return to the top. Do it twice because sometimes on some
						-- pages (notably Twitter), the first attempt gets stuck in
						-- some random location.	 (Ugh, what a hack this is.)
						do JavaScript "window.scrollTo(0,0)" in current tab of captureWindow
						delay 0.5
						do JavaScript "window.scrollTo(0,0)" in current tab of captureWindow
						delay 0.25
					end if
					
					-- Get the content of this current viewer window, in Item form.
					set contentAsPDF to get PDF of captureWindow
					
					-- Create the new record in the the Item group
					set pdfRecord to create record with {name:maTitle, URL:maURL, type:PDF document} in processedGroup
					set data of pdfRecord to contentAsPDF
					
					-- Match dates of pdfRecord to theRecord
					set recordCreationDate to creation date of theRecord
					set recordModificationDate to modification date of theRecord
					set creation date of pdfRecord to recordCreationDate
					set modification date of pdfRecord to recordModificationDate
					
					-- tell application "Finder" to set theCurrentDirectory to container of (path to me) as alias
					-- FIXME
					set theCurrentDirectory to "Macintosh HD:Users:mdbraber:Library:Application Scripts:com.devon-technologies.think3:Smart Rules:"
					set readabilityScriptFile to ((theCurrentDirectory & "Readability.js") as text) as alias
					set readabilityScript to read readabilityScriptFile
					
					-- Get an excerpt of the page or use the comment of the current record
					if (exists comment of captureRecord) is not true then
						do JavaScript readabilityScript in captureWindow
						set theExcerpt to do JavaScript "var article = new Readability(document).parse(); article.excerpt;" in captureWindow
					else
						set theExcerpt to comment of captureRecord
					end if
					
					close captureWindow
				end if
				
				set theLocation to location of theRecord
				if theLocation does not start with "/Content" then
					move record pdfRecord to processedGroup
				end if
				
				-- Update comments
				if maExcerpt is not equal to "" then set comment of pdfRecord to maExcerpt
				
				-- Update annotation
				if maRecord is missing value then set maRecord to create record with {name:maTitle, type:markdown} in annotationsGroup
				set maItemURL to (reference URL of pdfRecord as string)
				set maAnnotationURL to (reference URL of maRecord as string)
				set maTags to DT's uniqueList((tags of pdfRecord) & maTags)
				set maPath to path of pdfRecord
				
				set maText to ma's updateText("", maDate, maTitle, maExcerpt, maURL, maItemURL, maAnnotationURL, maPath, maTags, true)
				
				set plain text of maRecord to maText
				set creation date of maRecord to (creation date of pdfRecord)
				--set modification date of maRecord to (modification date of pdfRecord)
				move record maRecord to annotationsGroup
				
				-- Update pdfRecord annotation and tags
				set annotation of pdfRecord to maRecord
				set the tags of pdfRecord to maTags
				
				set originalTags to join strings maTags using delimiter ","
				add custom meta data maTags for "originaltags" to theRecord
				
				try
					delete record captureRecord
				end try
				
			end repeat
		end repeat
	end tell
end performSmartRule

chrillek · April 8, 2022, 6:45am

Don’t hold your breath. I don’t have time right now and it’s far too much code. But I might give one script a try, die purely educational purposes.

BTW: in your KM JavaScript code, you might want to use a string template at the end to avoid all the escaping inside the string.

mdbraber · April 8, 2022, 7:26am

I should have a included a bigger wink there A script like this just shows that Applescript isn’t really suitable for many text based operations (thanks to RegexAndStuffLib for making it somewhat less hard). JS would be much better, unfortunately it would mean another learning curve and much trial and error on my side.

Can you give a simple example of what you mean?

mlevison · May 9, 2022, 8:51pm

Wow, wow, wow. This is effectively what I was asking for here: Export PDF Highlights and Annotations to Obsidian?

I naively hope, that the DT team eventually build something like this into the app. In the meantime perhaps we could do something reduce the maintenance burden? What would happen if put all of the applescript in Github and then made it accessible under a license like CreativeCommons?

mdbraber · May 10, 2022, 6:49am

Glad to see it could be useful to you.

Don’t expect that they will - and in my view: for the better. DEVONtechnologies has focused on building an extensible app by investing in e.g. broad use of AppleScript, Smart Rules and other tools which allows for people to build highly specific and custom workflows, without having to rely on their (or anyone’s) decision or approach. Your best investment here would be to use (some of) these scripts to build your own personal workflow.

With regards to maintainability you could use any (part) of these scripts under CC - I’ve shared them here exactly for the reason of being useful to others. When I would have more time I would put them in a Github repository, but don’t count on that anytime soon. Also as my workflow is still evolving it would mean quite some updating. I might look into it after using it for a longer period of time and being happy with it. In the meantime: feel free to try it your yourself and suggest edits / improvements, that’s what making DT (and this forum) fun to use

nomasprime · May 15, 2022, 2:29pm

I’m fairly new to DEVONthink and AppleScript…

Does this refer to the single script under ‘Applescript: Process incoming information’?

Does this mean create a Smart Rule on the Inbox that executes the previously mentioned script? When setting the rule, under ‘Perform the following actions’, what would you recommend?

mdbraber · May 15, 2022, 3:31pm

Yes indeed. Good to know: scripts using Helpers and frameworks sometimes don’t function (well) as inline scripts. Also external scripts are easier to edit externally. In case sometimes something doesn’t seem to work, a restart of DT would be a first thing to check.

Yes indeed. This my rule:

system · May 14, 2025, 3:31pm

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.