Limitations of `summarize highlights of` command, deep linking

P12 · January 11, 2023, 3:00pm

This recent post on ‘deep linking’ reminded me of an issue that I’ve been meaning to raise for a while.

I have the following use case for DEVONthink:

Take any pdf labelled ‘Finished reading.’
Extract all annotations of kind ‘Underline.’
Create a markdown file containing that text.
Link the URL of that new record to the precise location in the PDF, using the search= parameter (e.g. x-devonthink-item://C2FE6B87-1208-4D14-8FF8-6DF6B1A9188A?page=10&search=this%20text%20here%)

Through this, I am able to easily (and semi-automatically) create a library of facts and ideas that can be tagged, replicated, commented, and so on. I have been able to do this by using the summarize highlights of command. By selecting sheet as output, one gets a list of all the annotations for the PDF, the kind of each annotation, the text of each annotation (for highlight/underline, etc.), and any comments added, as well as the page number. So, I have a script that extracts this information, uses it, and then deletes the sheet.

This works great except in one circumstance: when an annotation crosses across a paragraph break.

The text for this annotation thus contains a double line break. However, summarize highlights of as sheet strips line breaks from the text, like so:

“Changes relative to 1900 are calculated by adding 0.158 m (observed global mean sea level rise from 1900 to 1995–2014) to simulated and observed changes relative to 1995–2014. Panel (e) Global mean sea level change at 2300 in metres relative to 1900.”

This means that the the search= link won’t work. The search function, activated by the URL, is smart enough to ignore where single line breaks have been removed, but not where double line breaks are removed. It won’t recognise it as the same text. A small flaw (the link will still find the right page, of course), but an annoying one nevertheless.

In any case, this brings me to my point: ‘Deep linking’ is more or less possible for underline/highlight-type annotations. One way of achieving this in an automated fashion is using the above method. However, my solution, creating the summary sheet and then deleting it, is a bit of a hack, and breaks down in the circumstance just described.

What would solve this problem, for me at least, would be if one could generate an annotation summary not as a sheet, nor as a markdown or RTF document (which are the other two options), but instead as an AppleScript record, unprocessed. I.e. it would look something like this: {document:"IPCC_AR6", page:23, type:"Underline", text:"Changes relative to 1900 are calculated [and so on]"}.

As a bonus, it would be particularly useful if this output could also provide other metadata that are programatically accessible but not included in the existing summarize highlights of reports. I’m thinking particularly of the timestamp of the annotation, which may serve as a pseudo-unique identifier for it. This would be useful, for example, for being able to run my script on the same PDF multiple times, ignoring previously extracted annotations.

I’ve looked for some open source tools that could substitute for summarize highlights of, but nothing quite fits. pdfannots comes close and could be modified for the purpose, but it also cleans up the text, so runs into the same problem as above.

cgrunenberg · January 16, 2023, 8:16am

Any chance that you could post the code of the used script? Then we could check whether it’s an issue of the script or needs some improvements on our side.

P12 · January 17, 2023, 2:46pm

Sure, no problem. This is a simplified version, but it demonstrates the issue:

use AppleScript version "2.8"
use scripting additions
use framework "Foundation"

tell application id "DNtp"
	
	------------------------------------------------
	-- Get selected PDF record
	------------------------------------------------
	set theSelection to selection
	set theRecord to item 1 of theSelection
	if type of theRecord is not PDF document then error "Not a PDF"
	------------------------------------------------
	-- Summarise
	------------------------------------------------
	set theSheet to summarize highlights of records {theRecord} to sheet
	------------------------------------------------
	-- Split all "Underline" to variables and encode URL
	------------------------------------------------
	set theCells to cells of theSheet
	set theLinks to ""
	repeat with eachLine in theCells
		if item 3 of eachLine is "Underline" then
			set theURL to item 6 of eachLine
			set encodedText to my encodeForURL(item 4 of eachLine)
			set theLinks to theLinks & linefeed & linefeed & theURL & "&search=" & encodedText
		end if
	end repeat
	------------------------------------------------
	-- Clean up
	------------------------------------------------
	delete record theSheet
	return texts 3 through -1 of theLinks
	
end tell


on encodeForURL(theText)
	
	set nsInput to current application's NSString's stringWithString:theText
	set characterSet to current application's NSCharacterSet's URLQueryAllowedCharacterSet()
	return (nsInput's stringByAddingPercentEncodingWithAllowedCharacters:characterSet) as text
	
end encodeForURL

I just tried it on the following PDF page, with two annotations:

The script successfully produces two URLs:

x-devonthink-item://AEA5FB1B-25F9-4117-91F8-29A5017859E6?page=9&search=As%20political%20speech%20and%20culture%20over%20the%20past%20half-dozen

x-devonthink-item://AEA5FB1B-25F9-4117-91F8-29A5017859E6?page=9&search=as%20many%20predicted%20and%20some%20even%20hoped%20for.%20When%20Stephen%20Colbert%20spoke

The first URL opens the PDF at the correct page and jumps to/highlights the text. The second URL opens the page but does not find the text. This is because of the paragraph break.

Revisiting this, I realise that I slightly misdescribed the problem in the original post above when I said that it is a double line break that is the issue. That was incorrect. The problem is that summarize highlights of turns all line breaks into spaces—thus, encoded, there is no %0A (a line break), only %20 (a space).

If I take the second annotation/URL and replace the %20 after the first paragraph with a %0A, like this:

x-devonthink-item://AEA5FB1B-25F9-4117-91F8-29A5017859E6?page=9&search=as%20many%20predicted%20and%20some%20even%20hoped%20for.%0AWhen%20Stephen%20Colbert%20spoke

Then the URL works as expected, jumping to and highlighting the appropriate text.

So, the real question is whether it is possible to get the text of an annotation with line breaks preserved. If so, then the above script could provide a pretty useful ‘deep linking’ solution for PDF annotations.

More broadly, it would be nice to have more direct access to PDF annotation metadata via DEVONthink, but that’s a more general feature request.

chrillek · January 17, 2023, 5:02pm

Disclaimer: I have no idea what DT does under the hood to retrieve the annotations (i.e. what summarize highlights does to get the highlights).

But after having played around with Apple’s PDFKit framework a bit, I found out:

You can get all annotations on a page with the annotations method.
In the case of “markup”, i.e. underline, strike through and highlighting, the framework does not deliver the text itself
Instead, you get a bunch of rectangles bounding the words making up the marked text. So, in your example, you’d probably get three rectangles. The software would have to find all words contained in them – but how could it possibly tell that the end of “predicted and” is different from the end of “hoped for.”?

If it were simply looking at the text itself, it could perhaps do that. But I suppose it’s not, instead trying to find the text runs inside the rectangles. I have no idea if that makes sense.

What you could try to do is figuring out the line breaks yourself:

Take the search parameter.
Remove everything between its first and its last word (so that only “As”/“half-dozen” and “as”/“spoke” remain in your second example.
Search for this string (with a wildcard) in the PDF’s text
In the result of this search, replace spaces with “%20” and newlines with “%0A”.

In JavaScript:

const searchParam = DTURL.replace(/^.*search=/,""); // get the search string, i.e. the annotated text
// should save the rest of the URL, too, so it can be rebuild later 

/* Find the first and the last word of the annotation text using a regular expression
  with two capturing groups for the first/last word */
const lastFirstWords = searchParam.match(/^([^%]+).*%20(.*)$/;
if (lastFirstWords) {
  /* Build a regular expression consisting of the first word, anything at all and the last word */
  const newSearch = new RegEx(`${lastFirstWords[1]}.*${lastFirstWords[2]}`);
 /* Find the text matching this regular expression */
  const match = record.plainText().match(newSearch);
  if (match) {
   /* If "firstword <anything> lastword" is found, encode the match properly */
    const newSearchParam = encodeURIComponent(match[0]);
    // rebuild DT URL and write it back to the table.
  }

(I’m sure it can be done in AppleScript with the appropriate amount of set text delimiter calls, too). This code is just a sample, I only tested the regular expression for first/last word.

This approach is not failsafe, though: If you happen to have more than one annotation beginning with “word A” and ending with “word B”, it would only fix the first reference.

cgrunenberg · January 18, 2023, 2:37pm

The next release will support deep linking by introducing new item link parameters. Tools > Summarize Highlights will also support this and therefore your workaround probably won’t be necessary anymore.

P12 · January 20, 2023, 2:08pm

@cgrunenberg
Sounds good! It never ceases to amaze me how responsive you can be to requests made on here.

@chrillek
Thanks for taking a look at this. I thought of trying the method you suggest, with speculatively inserting line breaks at sentence ends until something is found, but I hadn’t gotten around to testing it. I suspect it wouldn’t be performant for some cases. However, given the apparently imminent deep linking feature, I can probably just use pdfannots to grab the text, since it outputs to JSON, and bypass the creating/deleting sheet stage. I believe that pdfannots works (as you mention) by inferring annotation text from the location on the page. (Adobe have given us some crazy formats over the years…)

P.S. I’m just starting to teach myself JXA (after picking up AppleScript over the past 18 months or so), and I will be referring back to your site (and various posts) quite regularly!

chrillek · January 21, 2023, 10:44am

You could simply ignore the question where the linefeeds are.

take the search string and replace all occurrences of %20 (the URL-encoded space character) by \s+: one or more white space symbols, including line feeds, tabs etc.
use the result for a regular expression search in the text of the document.
the match will be the original string with all spaces, line breaks etc at the correct position.
URL-encode that to get a new search parameter for your link

That’s feasible even in Applescript, but requires a trip to the ObjC Foundation framework. In JavaScript, it’s straight forward.

P12 · January 23, 2023, 3:38pm

I hadn’t thought about it like that, you are right! Thanks again.