Use Metatags to tag New York Times using Javascript parsing

Hello all,

Those of you who (like I do) love Devonthink and DevonAgent for their scripting abilities will probably find this of interest. This is not really a finished script (there’s no error checking), but more of a proof of concept. It shows how powerful the (new, improved) Javascript layer of Devonthink can be in conjunction with Applescript.

The following little script creates a small Javascript function that extracts the metatags from a New York Times article and sets the same tags in Devonthink. Those of you with experience with Applescript’s anemic string handling should rejoice at this—Javascript is pretty decent at parsing strings with its regex. I was excited to discover that this possible in Devonthink/DevonAgent, and I will create a few posts to show how powerful Javascript can be to extract stuff from the underlying html.

So here’s the script:


----tag nytimes article with its meta tags by Eric Oberle
----this proof-of-concept assumes you have devonthink currently selecting a "bookmark" or a "WebArchive" document
---it will then tag the article in Devonthink according to how the New York times tagged it.   
tell application "DEVONthink Pro"
	set thewin to window 1
	set {sel} to selection
	
	
	set js to "function GetMeta(meta_name) {var all_metas=document.getElementsByTagName('META'); for (var counter=0; counter<all_metas.length; counter++) { if (all_metas[counter].name.toLowerCase() == meta_name.toLowerCase()) { return all_metas[counter].content;}} return 'FALSE';}"
	
	set init to (do JavaScript js in thewin)
	set headline to do JavaScript "GetMeta('hdl')" in thewin
	set byline to do JavaScript "GetMeta('byl')" in thewin
	
	---apply headline
	set name of sel to headline
	
	---apply tags
	set the_tags to (do JavaScript "GetMeta('des')" in thewin)
	if the_tags is not equal to "FALSE" then
		set taglist to {byline} & my find_between(the_tags, ";", "")
	else
		set taglist to {byline}
	end if
	
	set current_tags to get tags of sel
	set new_tags to current_tags & taglist
	set tags of sel to new_tags
end tell



on find_between(this_text, start_string, end_string)
	----this routine returns a LIST (n.b.!) containing all chunks of text found between the start_string and end_string.  If it finds nothing it returns an empty set {}.  v.1.2
	
	if (this_text = "") or (this_text does not contain start_string) then return ""
	set good_set to {}
	---my write_error_log("in find_between start:" & start_string & "end: " & end_string, 4)
	if end_string is "" and start_string is not "" then ---not paired delimiters, but CSV for example
		
		set AppleScript's text item delimiters to the start_string
		set the item_list to every text item of this_text
		log item_list
		set good_set to item_list
	else if (this_text does not contain end_string) then
		return
	else if (start_string is equal to end_string) then --every other one is good in this case	
		set AppleScript's text item delimiters to the start_string
		set the item_list to every text item of this_text
		set the item_list to rest of item_list --remove first one
		
		repeat ((count of item_list) div 2) times
			set text_between to first item of item_list
			set good_set to good_set & text_between
			set item_list to rest of rest of item_list
			---my write_error_log(("**find-between:start=end: " & text_between), 4)
		end repeat
		
	else if end_string contains start_string then
		set AppleScript's text item delimiters to the end_string
		set the item_list to every text item of this_text
		set the item_list to items 1 through ((length of item_list) - 1) of item_list --remove last one
		set AppleScript's text item delimiters to the start_string
		
		repeat with this_block in item_list
			set text_between to (second text item of this_block)
			set good_set to good_set & text_between
			---my write_error_log(("***find-between:beginning-in-end " & item_list), 5)
		end repeat
		
	else -- if end string and start string are not equal, and end string does not contain start, then go ahead and find the start tags then the end tags
		set AppleScript's text item delimiters to the start_string
		set the item_list to every text item of this_text
		set the item_list to rest of item_list --remove first one
		
		set AppleScript's text item delimiters to the end_string
		
		repeat with this_block in item_list
			set text_between to (first text item of this_block)
			set good_set to good_set & text_between
			---my write_error_log("**find-between:diffend:" & text_between, 5)
		end repeat
	end if
	set AppleScript's text item delimiters to ""
	return good_set
	---my write_error_log("****end find_between", 5)
end find_between

The script (blindly) assumes that you have selected a bookmark link or a webarchive link and that the HTML in question points to the New York Times. Those are some big assumptions, but obviously the real utility of a script like this is if it could work against a whole library of meta-tags that various big websites use to meta-tag their data. Having Devonthink do it automatically and handle all the parsing tricks I think is a big idea, and I might take it on, but since I don’t think many are even aware that this possible, I wanted to share this right now. It’s pretty cool.

Oh, and by the way: the only bad thing about this whole system of using Javascript is that one has to issue the commands to Javascript “blind” (i.e. one doesn’t know what error occurred after the Javascript fires, the reply is just empty.) HOWEVER, one can use the “Web Inspector” feature of Webkit and query the Javascript console in Devonagent if one issues the command on the terminal when DevonAgent is closed:


defaults write  ~/Library/Preferences/com.devon-technologies.agent WebKitDeveloperExtras -bool TRUE

A control-click on the webpage in devonagent will then pull up the webinspector, just like safari—but with all of DevonAgent’s applescript power! Here http://developer.apple.com/library/safari/#documentation/AppleApplications/Conceptual/Safari_Developer_Guide/1Introduction/Introduction.htmlis more on the Webinspector

Hopefully it won’t be long before devonthink also has the webinspector/javascript console! (hint, hint)

More posts to come, but I think this is a BIG DEAL and it’s inspired me to do scripting again…

I have to say, this makes me really want to see Devonthink support custom metadata, such as say, the ability to embed the AUTHOR and TITLE and DATE in some sort of standard metadata format that would display in the columns of three-pane view…but that’s another post!

Erico

thanks for the script! There’s also a script “Convert keywords to tags” available via the Support Assistant, supporting e.g. the stored & indexed keywords of web archives, rich text documents or PDF documents. Bookmarks, however, are not supported.