I’m interested in this subject on a general basis - but let’s take the Internet Movie Database as an example:
I look up a movie and want to store relevant information in a text-file in DevonThink - without the layout, the ads and only the information i am interested in. Let’s say: title, year, director, cast, plotsummary.
Can anybody tell me how to achieve this or direct me to some instructions on the web?
Thank you very much, Louise
Hi, louise:
I took a look at the site. It seems they use a standard format. Think about this from the manual approach first. So you could get everything you noted (and a bit more, which could be edited out after the rich note capture, if you wish) this way:
-
Scroll down to the Trivia line;
-
Click below the Trivia line and drag upwards to include the movie title, and click to select that range;
-
In Safari press Command-) to capture a rich text note, or in DEVONagent or the DT Pro browser, Control-click the selection and select the appropriate option to capture as rich text.
For that site you will have captured a “standard” set of information about movies.
I don’t know of any convenient way to automatically select and organize the information. I’d say that if I went obsessive I could capture bits and pieces of the information into the fields of records and end up with a sheet containing rows and columns of standardized information for as many movies as I wished. Whew! That’s grunt work.
Could this be done by scripting or Automator actions? That will require parsing the information from the initial capture. The list of characters/actors will vary in length from movie to movie. Parsing probably will require defining some sort of markers, which may or may not have to be entered.
Louise,
You can definitely do what you want using AppleScript (and some other tools built into Mac OS X). I have a small tutorial on my website about running web searches and a little about processing the resulting web page.
Did you want something automated that FINDS the movie and retrieves just the desired info? If so, my little tutorial should help: Form Post and Process with AppleScript.
I don’t have a lot in there about grabbing just the parts of the page you want, but that is text-processing. You could learn Regular Expressions, but from what I see of the IMDB site you won’t need it. The trick is to use a sub-script (an AppleScript “handler”) that allows you to grab just part of the html code of the page that results from your search. My sample script has some useful handlers that allow you to grab just part of a big page without having to write all the text-processing code yourself. You’d just have to find some unique bit of HTML markup that comes before and/or after the stuff you want.
Let me know if you get stuck and I’ll see if I can help.
Louise,
Here’s a script that does what you want (more or less). It assumes that you create imdb “bookmarks” in devonthink and that you have those selected before running the script. Then it scans through those links, finds the basic movie info, and appends the info for all selected records to the end of the rich text file called “imbd clips”. Obviously, you’ll need to modify it to do what you want further, but the perl search and replace subroutine is pretty powerful and relatively fast implementation of regular expressions into Devonthink that will work on multiline strings.
-erico
-----get info from internet movie database
-----by Eric Oberle
-----assumes that you have bookmarks to imdb files selected in your current window
-----searches these bookmarks and then adds the "info" for these records to the *most recently modified rich text file* in your
---- dt pro database that contains in its title the phrase specified by "clip_file_name" variable below
-----modify at will!
set all_text to ""
set clip_file_name to "imdb clips"
-----search internet movie database info in current selection of "links"
tell application "DEVONthink Pro"
set cuPos to selection of window 1
repeat with this_item in cuPos
set the_Window to open window for record this_item
repeat while loading of the_Window
delay 1
end repeat
set the_source to the source of the_Window
set original_url to URL of the_Window
set search_string to "</table>[\\r\\n]{0,5}<div class=\\\"info\\\">(.*?)<hr/>"
set the_info to "<html><body> <div class=\"info\"> " & my perl_strip(the_source, search_string, "$1", "", false) & "</html>"
set the_title to get title of the_source
set the_text to the_title & return & (get text of the_info & "
---------------------") & return
-----display dialog the_text
try
set comment of this_item to the_text
end try
close the_Window
set all_text to all_text & the_text
end repeat
---now find target file in devonthink
tell application "System Events"
set old_date to date "Thursday, January 1, 2004 12:00:00 AM"
end tell
set target_records to search clip_file_name within titles
if target_records is {} then
display dialog "no records found named " & clip_file_name
else
set newest_record to first item in target_records
repeat with the_record in target_records
if type of the_record is not group then
get name of the_record
set new_date to get modification date of the_record
---set state of the_record to false
if (new_date > old_date) then
log "date change"
set newest_record to the_record
set old_date to new_date
end if
end if
end repeat
get name of newest_record
log message "Clip " & (name of newest_record) & " added: " & all_text
----should put this on growl
----display dialog z
set x to rich text of newest_record
set x to x & (all_text as styled text)
set rich text of newest_record to x
end if
end tell
on perl_strip(inputstring, targetstring, replacementstring, filterstring, multiline)
---version 1.1.5
----Uses perl regexp to extract all phrases that match TARGETSTRING.
----One can then use $1 $2 etc constructions in REPLACEMENTSTRING to structure the returned list
------ FILTERSTRING allows for all items containing a list of filters, separated by | to be excluded.
-----double slashes needed for TARGETSTRING and REPLACEMENTSTRING, FILTERSTRING should be a complete truth condition
------e.g., ($2 =~ /<img|<IMG|scale/)
if multiline is true then
set perl_end_string to "gis"
else
set perl_end_string to "gi"
end if
set filter_command to ""
set foundlist to {}
if length of inputstring is greater than 245000 then -----we must write data to a file if it is this large
set the_data_file to "/tmp/perlstrip"
open for access POSIX file the_data_file with write permission
write (inputstring as text) to POSIX file the_data_file
close access POSIX file the_data_file
if filterstring is not "" then set filter_command to " unless " & filterstring
set shellscript to "/usr/bin/perl -e 'open(FILE, \"" & the_data_file & "\") or die \"Unable to open file\"; " & ¬
"$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;" & ¬
"local $/;my $content = <FILE>;if ( $content =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\") " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print $the_line . \"<perllistitem>\"" & "" & "}} '"
log shellscript
set theResult to (do shell script shellscript)
else
set inputstring to my replace_chars(inputstring, (ASCII character 194), "<br>")
set inputstring to my replace_chars(inputstring, "|", "+vertical-bar+")
set inputstring to my replace_chars(inputstring, "'", "‘")
---set inputstring to quoted form of inputstring
---if filterstring is not "" then set filterstring to " unless $the_line =~ /" & filterstring & "/"
if filterstring is not "" then set filter_command to " unless " & filterstring
set shellscript to "/usr/bin/perl -e '$qt=q|\"|;$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;$thisvar=q|" & inputstring & "|;" & ¬
"while ($thisvar =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\") " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print $the_line . \"<perllistitem>\"" & "" & "}} '"
log shellscript
set theResult to (do shell script shellscript)
set theResult to my replace_chars(theResult, "+vertical-bar+", "|")
end if
if theResult is not "" then
----turn item results into list
set oldDelims to AppleScript's text item delimiters
set AppleScript's text item delimiters to "<perllistitem>"
set foundlist to text items of theResult
set AppleScript's text item delimiters to oldDelims
log (count of foundlist)
if (count of foundlist) is greater than 2 then
set foundlist to items 1 through ((count of foundlist) - 1) of foundlist
else --if there was only one result, just eliminate the <perlistitem> divider
set the_text to (characters 1 through ((offset of "<perllistitem>" in (item 1 of foundlist)) - 1) of item 1 of foundlist) as text
set foundlist to {the_text}
log foundlist
end if
end if
return foundlist
end perl_strip
on replace_chars(this_text, search_string, replacement_string)
set AppleScript's text item delimiters to the search_string
set the item_list to every text item of this_text
set AppleScript's text item delimiters to the replacement_string
set this_text to the item_list as string
set AppleScript's text item delimiters to ""
return this_text
end replace_chars