Meta tags from HTML code to comments

I have archived a few thousand webpages in DevonThink from various news sites, grouped them and named them after the HTML title using Automator (if anyone is interested: Get specified URLs, Get Link URLs from Webpages, Filter URLs, Display Webpages in DevonAgent, Export Webpages as web archive, Set Current Group, Add Items to Current Group).

So far so good… what I ideally want now is a script (or set of Automator actions) to do the following:

  • Search for a specified meta tag inside the HTML code of the selected items.
  • Add the content of this meta tag to the comment field in DevonThink.

For instance, I have an archive of the following webpage:
http://news.bbc.co.uk/1/hi/uk_politics/vote_2005/england/4412281.stm

… which contains the following HTML code:


<meta name="OriginalPublicationDate" content="2005/04/22 01:56:10" />

I would like to add the publication date (2005/04/22 01:56:10) to the comment field… or better still, use this to set a new creation / modification date of the selected item (then I can use the comment field for my keywords).

I have tried looking through the Automator actions and the scripts included with DevonThink, as well as searching the forums, but can’t seem to get my head around how to do this. Any help would be much appreciated.

Thanks!

There are some applescript routines for HTML parsing discussed here:

apple.com/applescript/guideb … 4.htm#1001

So you might use one of these:


meta_tag = read_parse (this_file, "<meta ", "", false)
meta_tag = read_parse(this_file, "<meta name=\"OriginalPublicationDate", "", false)

Note that the input must be a file, you must define the provided read_parse routine in your script, and you still need to parse the content property out of the meta tag, e.g.:


set url_str to "<html><meta name=\"OriginalPublicationDate\" content=\"2005/04/22 01:56:10\" /></html>"

set meta_tag to text from character (offset of "<meta name=\"OriginalPublicationDate\"" in url_str) to (length of url_str) of url_str

set content to text from character ((offset of "content=" in meta_tag) + 9) of meta_tag to (length of meta_tag) of meta_tag

set pub_date to text from character 1 to ((offset of "\"" in content) - 1) of content

Then you can modify the DT record as necessary.

Personally, though, I prefer regular expressions. Using Automator, I’d write a shell or Perl script to extract all the metatags and send them to STDOUT, e.g.:


wget -qO /dev/stdout http://news.bbc.co.uk/1/hi/uk_politics/vote_2005/england/4412281.stm | grep \<meta | sed 's/ *<meta name="\([^"]*\)" *content="\([^"]*\)" *\/>/\1:\2/' | grep -v '<meta'

Apple should never have provided me a shell if they wanted me to use Applescript :wink:

Thanks, although I’m really confused still…

The read_parse routine requires a file, but I want it to read from the database in a similar way to the “Find similar contents” script by Christian (i.e. news.bbc.co.uk/1/hi/uk_politics/ … 412281.stm is actually a webarchive in DT and I don’t want to access the URL, but the entry in the db). How do I go about modifying this routine to take account of this?

The second part, where you explain I need to “parse the content property out of the meta tag” is a bit confusing, but how do I then modify the comments field in DT? Would it be anything like this:


repeat with this_item in this_selection
set current_comment to the comment of this_item
set new_item_comment to current_comment & meta_tag
set the comment of this_item to new_item_comment
end repeat

I also tried your Perl script in Automator, but it only allows me to pass input to stdin and it comes up with an error.

Appreciate your help and patience!

Sorry, thought you were stuck on the meta tag problem, not on the actual scripting; the Applescript and the Shell script were just examples of getting the Publication date (or any metatag info) from the HTML page.

This script will look for a record with the URL you gave, and will update the comment field as you requested. Updating the date does not work; I think DT’s “date” declaration overrides the Applescript “date” routine. I’ll post when I get it going.


tell application "DEVONthink Pro"

	-- Replace this with appropriate applescript to get the record you need to modify, e.g.
	-- set rec_list to selection of application

	set url_str to "http://news.bbc.co.uk/1/hi/uk_politics/vote_2005/england/4412281.stm"
	
	set rec_list to lookup records with URL url_str
	
	set rec to item 1 of rec_list
	-- End 'Replace this...'. Make sure 'rec' contains record to modify.


	-- Set 'html__str' to the HTML source in record	
	-- For URL record use something like:
	-- set html_str to download markup from URL of rec

	set html_str to source of rec
	
	-- Get META tag with name 'OriginalPublicationDate' into 'meta_str'

	set meta_str to texts from character (offset of "<meta name=\"OriginalPublicationDate\"" in html_str) to (length of html_str) of html_str

	-- Get 'content' property of META tag
	
	set content_str to texts from character ((offset of "content=" in meta_str) + 9) of meta_str to (length of meta_str) of meta_str

	-- Get publication date from value of 'content' property 
	
	set pub_date to texts from character 1 to ((offset of "\"" in content_str) - 1) of content_str

	-- Set record comment to "Publication Date: $pub_date"
	
	set comment of rec to "Publication Date: " & pub_date

	-- Set record creation date to publication date
	-- Doesn't work:
	-- set creation date of rec to date pub_date
end tell

Run this in ScriptEditor (should be in Applications) and it should set the comment for the first record containing the specified URL.

I’m not much of an Applescripter; it’s an awkward language to use once you’ve programmed in, well, just about anything else. If you open the Script Editor, you can use File->Open Dictionary to select DevonThink; this will give you the list of DevonThink commands.

The following commands should be useful to you:


   set html_str to download markup from url_string
   if exists record with URL url_string
   set rec to get record at path_str
   set rec_list to lookup records with comment comment_str
   set rec_list to lookup records with url url_str
   set rec_list to selection of application


You will probably want to start out creating a script for each META tag that you want to support, then passing the script a selection and making the changes to each record in the selection.

With Automator actions, it should be possible to extract the meta tags and present the user with a dialog box where they select the appropriate tag. I think this requires an xcode project for the Automator action; an easier alternative is to use CocoaDialog if you have it installed (and if you know shell scripting).

Looks like Applescript supports this too via StandardScriptingAdditions dialogs (‘set meta_tag_name to choose from list meta_tag_name_list with prompt “Select a META tag type:”’). Parsing the meta tags in Applescript looks painful, but once you have them it’s pretty straightforward.

OK, here is some drop-in code for you:


on extract_content_from_meta_tag(html_src, tag_name)
	set meta_tag to "<meta name=\"" & tag_name & "\""
	set meta_str to text from character (offset of meta_tag in html_src) to (length of html_src) of html_src
	set content_str to text from character ((offset of "content=" in meta_str) + 9) of meta_str to (length of meta_str) of meta_str
	
	set result to text from character 1 to ((offset of "\"" in content_str) - 1) of content_str
end extract_content_from_meta_tag

on fix_bbc_date(bbc_date)
	-- Converts YYYY/MM/DD to MM/DD/YYYY
	set text item delimiters to " "
	-- date_time is { date, time }
	set date_time to text items of bbc_date
	set text item delimiters to "/"
	-- ymd (YearMonthDay) is {year, month, day}
	set ymd to text items of item 1 of date_time
	-- construct date string 'MM/DD/YYY hhhh:mm:ss'
	set result to item 2 of ymd & "/" & item 3 of ymd & "/" & item 1 of ymd & " " & item 2 of date_time
end fix_bbc_date

tell application "DEVONthink Pro"
	set rec_list to the selection
	if rec_list is {} then error "Please select a captured web page"
end tell

repeat with rec in rec_list
	tell application "DEVONthink Pro"
		set html_str to source of rec
	end tell
	
	set pub_date to extract_content_from_meta_tag(html_str, "OriginalPublicationDate")
	
	set date_str to fix_bbc_date(pub_date)
	set pub_date to date date_str
	
	tell application "DEVONthink Pro"
		set comment of rec to "Publication Date: " & date_str
		set creation date of rec to pub_date
	end tell
	
end repeat

Took me a bit to get past the fact that AppleScript’s ‘date’ datatype is overridden by DT’s ‘date’ property. That was easy enough to figure out, but I tried all kinds of references and explicit declarations (e.g. “AppleScript’s date” and “date of AppleScript”), to no avail.

Ended up breaking the tell block up. I suppose this is better: only include the code that talks to the app inside a tell block, rather than using it as a level of scope in the script. AppleScript is an annoying language to be sure.

Anyways, this script helps me with some back-burnered projects, so the time spent figuring this out isn’t totally lost :wink:

Waow! I really, really appreciate this! This has helped me enormously… I will work to modify it now to suit my other webpages (CNN and Al-Jazeera) too.

Only problem with your second code was the formatting of the date, which I had to change to:


set result to item 3 of ymd & "/" & item 2 of ymd & "/" & item 1 of ymd & " " & item 2 of date_time

…as the date format was DD/MM/YYYY.

I’ve done some programming before (PHP, Java, VB and such like), but AppleScript baffled me completely! Looking at your finished code I should be able to work out how to extend it to suit other needs though.

Again, really appreciate your help with this - and I’m pleased it has helped you out too! :slight_smile: