Extract information from Webpages and save to DevonThink?

I’m interested in this subject on a general basis - but let’s take the Internet Movie Database as an example:
I look up a movie and want to store relevant information in a text-file in DevonThink - without the layout, the ads and only the information i am interested in. Let’s say: title, year, director, cast, plotsummary.
Can anybody tell me how to achieve this or direct me to some instructions on the web?
Thank you very much, Louise :smiley:

Hi, louise:

I took a look at the site. It seems they use a standard format. Think about this from the manual approach first. So you could get everything you noted (and a bit more, which could be edited out after the rich note capture, if you wish) this way:

  • Scroll down to the Trivia line;

  • Click below the Trivia line and drag upwards to include the movie title, and click to select that range;

  • In Safari press Command-) to capture a rich text note, or in DEVONagent or the DT Pro browser, Control-click the selection and select the appropriate option to capture as rich text.

For that site you will have captured a “standard” set of information about movies.

I don’t know of any convenient way to automatically select and organize the information. I’d say that if I went obsessive I could capture bits and pieces of the information into the fields of records and end up with a sheet containing rows and columns of standardized information for as many movies as I wished. Whew! That’s grunt work.

Could this be done by scripting or Automator actions? That will require parsing the information from the initial capture. The list of characters/actors will vary in length from movie to movie. Parsing probably will require defining some sort of markers, which may or may not have to be entered.

You can definitely do what you want using AppleScript (and some other tools built into Mac OS X). I have a small tutorial on my website about running web searches and a little about processing the resulting web page.

Did you want something automated that FINDS the movie and retrieves just the desired info? If so, my little tutorial should help: Form Post and Process with AppleScript.

I don’t have a lot in there about grabbing just the parts of the page you want, but that is text-processing. You could learn Regular Expressions, but from what I see of the IMDB site you won’t need it. The trick is to use a sub-script (an AppleScript “handler”) that allows you to grab just part of the html code of the page that results from your search. My sample script has some useful handlers that allow you to grab just part of a big page without having to write all the text-processing code yourself. You’d just have to find some unique bit of HTML markup that comes before and/or after the stuff you want.

Let me know if you get stuck and I’ll see if I can help.


Here’s a script that does what you want (more or less). It assumes that you create imdb “bookmarks” in devonthink and that you have those selected before running the script. Then it scans through those links, finds the basic movie info, and appends the info for all selected records to the end of the rich text file called “imbd clips”. Obviously, you’ll need to modify it to do what you want further, but the perl search and replace subroutine is pretty powerful and relatively fast implementation of regular expressions into Devonthink that will work on multiline strings.


-----get info from internet movie database 
-----by Eric Oberle
-----assumes that you have bookmarks to imdb files selected in your current window
-----searches these bookmarks and then adds the "info" for these records to the *most recently modified rich text file* in your 
---- dt pro database that contains in its title the phrase specified by "clip_file_name" variable below 
-----modify at will! 

set all_text to ""
set clip_file_name to "imdb clips"

-----search internet movie database info in current selection of "links"
tell application "DEVONthink Pro"
	set cuPos to selection of window 1
	repeat with this_item in cuPos
		set the_Window to open window for record this_item
		repeat while loading of the_Window
			delay 1
		end repeat
		set the_source to the source of the_Window
		set original_url to URL of the_Window
		set search_string to "</table>[\\r\\n]{0,5}<div class=\\\"info\\\">(.*?)<hr/>"
		set the_info to "<html><body> <div class=\"info\"> " & my perl_strip(the_source, search_string, "$1", "", false) & "</html>"
		set the_title to get title of the_source
		set the_text to the_title & return & (get text of the_info & "
		---------------------") & return
		-----display dialog the_text
			set comment of this_item to the_text
		end try
		close the_Window
		set all_text to all_text & the_text
	end repeat
	---now find target file in devonthink
	tell application "System Events"
		set old_date to date "Thursday, January 1, 2004 12:00:00 AM"
	end tell
	set target_records to search clip_file_name within titles
	if target_records is {} then
		display dialog "no records found named " & clip_file_name
		set newest_record to first item in target_records
		repeat with the_record in target_records
			if type of the_record is not group then
				get name of the_record
				set new_date to get modification date of the_record
				---set state of the_record to false
				if (new_date > old_date) then
					log "date change"
					set newest_record to the_record
					set old_date to new_date
				end if
			end if
		end repeat
		get name of newest_record
		log message "Clip " & (name of newest_record) & " added: " & all_text
		----should put this on growl
		----display dialog z
		set x to rich text of newest_record
		set x to x & (all_text as styled text)
		set rich text of newest_record to x
	end if
end tell

on perl_strip(inputstring, targetstring, replacementstring, filterstring, multiline)
	---version 1.1.5
	----Uses perl regexp to extract all phrases that match TARGETSTRING.  
	----One can then use $1 $2 etc constructions in REPLACEMENTSTRING to structure the returned list 
	------ FILTERSTRING allows for all items containing a list of filters, separated by | to be excluded. 
	-----double slashes needed for TARGETSTRING and REPLACEMENTSTRING, FILTERSTRING should be a complete truth condition
	------e.g., ($2 =~ /<img|<IMG|scale/)
	if multiline is true then
		set perl_end_string to "gis"
		set perl_end_string to "gi"
	end if
	set filter_command to ""
	set foundlist to {}
	if length of inputstring is greater than 245000 then -----we must write data to a file if it is this large
		set the_data_file to "/tmp/perlstrip"
		open for access POSIX file the_data_file with write permission
		write (inputstring as text) to POSIX file the_data_file
		close access POSIX file the_data_file
		if filterstring is not "" then set filter_command to " unless " & filterstring
		set shellscript to "/usr/bin/perl -e 'open(FILE, \"" & the_data_file & "\") or die \"Unable to open file\"; " & ¬
			"$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;" & ¬
			"local $/;my $content = <FILE>;if ( $content =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\")  " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print  $the_line . \"<perllistitem>\"" & "" & "}} '"
		log shellscript
		set theResult to (do shell script shellscript)
		set inputstring to my replace_chars(inputstring, (ASCII character 194), "<br>")
		set inputstring to my replace_chars(inputstring, "|", "+vertical-bar+")
		set inputstring to my replace_chars(inputstring, "'", "‘")
		---set inputstring to quoted form of inputstring
		---if filterstring is not "" then set filterstring to " unless $the_line =~ /" & filterstring & "/"
		if filterstring is not "" then set filter_command to " unless " & filterstring
		set shellscript to "/usr/bin/perl -e  '$qt=q|\"|;$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;$thisvar=q|" & inputstring & "|;" & ¬
			"while ($thisvar =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\")  " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print  $the_line . \"<perllistitem>\"" & "" & "}} '"
		log shellscript
		set theResult to (do shell script shellscript)
		set theResult to my replace_chars(theResult, "+vertical-bar+", "|")
	end if
	if theResult is not "" then
		----turn item results into list
		set oldDelims to AppleScript's text item delimiters
		set AppleScript's text item delimiters to "<perllistitem>"
		set foundlist to text items of theResult
		set AppleScript's text item delimiters to oldDelims
		log (count of foundlist)
		if (count of foundlist) is greater than 2 then
			set foundlist to items 1 through ((count of foundlist) - 1) of foundlist
		else --if there was only one result, just eliminate the <perlistitem> divider
			set the_text to (characters 1 through ((offset of "<perllistitem>" in (item 1 of foundlist)) - 1) of item 1 of foundlist) as text
			set foundlist to {the_text}
			log foundlist
		end if
	end if
	return foundlist
end perl_strip

on replace_chars(this_text, search_string, replacement_string)
	set AppleScript's text item delimiters to the search_string
	set the item_list to every text item of this_text
	set AppleScript's text item delimiters to the replacement_string
	set this_text to the item_list as string
	set AppleScript's text item delimiters to ""
	return this_text
end replace_chars