This recent post on ‘deep linking’ reminded me of an issue that I’ve been meaning to raise for a while.
I have the following use case for DEVONthink:
- Take any pdf labelled ‘Finished reading.’
- Extract all annotations of kind ‘Underline.’
- Create a markdown file containing that text.
- Link the URL of that new record to the precise location in the PDF, using the
search=
parameter (e.g.x-devonthink-item://C2FE6B87-1208-4D14-8FF8-6DF6B1A9188A?page=10&search=this%20text%20here%
)
Through this, I am able to easily (and semi-automatically) create a library of facts and ideas that can be tagged, replicated, commented, and so on. I have been able to do this by using the summarize highlights of
command. By selecting sheet
as output, one gets a list of all the annotations for the PDF, the kind of each annotation, the text of each annotation (for highlight/underline, etc.), and any comments added, as well as the page number. So, I have a script that extracts this information, uses it, and then deletes the sheet.
This works great except in one circumstance: when an annotation crosses across a paragraph break.
The text for this annotation thus contains a double line break. However, summarize highlights of
as sheet
strips line breaks from the text, like so:
“Changes relative to 1900 are calculated by adding 0.158 m (observed global mean sea level rise from 1900 to 1995–2014) to simulated and observed changes relative to 1995–2014. Panel (e) Global mean sea level change at 2300 in metres relative to 1900.”
This means that the the search=
link won’t work. The search function, activated by the URL, is smart enough to ignore where single line breaks have been removed, but not where double line breaks are removed. It won’t recognise it as the same text. A small flaw (the link will still find the right page, of course), but an annoying one nevertheless.
In any case, this brings me to my point: ‘Deep linking’ is more or less possible for underline/highlight-type annotations. One way of achieving this in an automated fashion is using the above method. However, my solution, creating the summary sheet and then deleting it, is a bit of a hack, and breaks down in the circumstance just described.
What would solve this problem, for me at least, would be if one could generate an annotation summary not as a sheet, nor as a markdown or RTF document (which are the other two options), but instead as an AppleScript record, unprocessed. I.e. it would look something like this: {document:"IPCC_AR6", page:23, type:"Underline", text:"Changes relative to 1900 are calculated [and so on]"}
.
As a bonus, it would be particularly useful if this output could also provide other metadata that are programatically accessible but not included in the existing summarize highlights of
reports. I’m thinking particularly of the timestamp of the annotation, which may serve as a pseudo-unique identifier for it. This would be useful, for example, for being able to run my script on the same PDF multiple times, ignoring previously extracted annotations.
I’ve looked for some open source tools that could substitute for summarize highlights of
, but nothing quite fits. pdfannots comes close and could be modified for the purpose, but it also cleans up the text, so runs into the same problem as above.