an "archive selected links" script with regexp sub

erico · January 22, 2007, 10:48am

Here’s a program that will give you a pop-up list of links on current page, and will allow you to choose which of them to download and webarchive. The notable thing is that the program uses some facy subroutines to use perl regexp to get its work done. Hopefully they are useful to somebody. Please if you find a bug or have fixed one, send me a note…
erico



tell application "DEVONthink Pro"
	-----mass archive links by eric oberle
	------this program will give user a list of all links on current html page.  User can then select links to download and they will be placed in archive
	-------off of subfolder.  Scripts needs more debugging, errorchecking and feedback.  But I thought perl subroutines are useful enough in themselves to warrant some scrutiny.
	
	
	set the_source to get source of window 1
	set the_title to get name of window 1
	set source_url to get URL of window 1
	if source_url is "" then return
	set stub_url to item 1 of my perl_strip(source_url, "(http[s]*)://(.*?)/", "$1://$2", "", false)
	
	
	if the_source is not "" then
		(*   
      ----fix html with full links
      set y to my perl_replace(the_source, "(.*?)<a\\s+href=\"/(.*?)</a>(.*?)", "<a href=\"" & stub_url & "/$2", true)
      
      ----an example of finding all links
      set y to my perl_strip(the_source, "<a\\shref(.*?)</a>", "<a href$1 </a>", "", true)
      
      *)
		
		-----a more complex example: download selected links for user and make webarchive. 
		set link_list to {}
		set name_list to {}
		
		-----fetch list of all links in present document
		set link_list to my perl_strip(the_source, "<a\\shref=\"(.*?)\".*?>(.*?)</a>", "$1", "($2 =~ /<img|scale|<SPAN/i)", true)
		-----fetch list of the link names in present document
		set name_list to my perl_strip(the_source, "<a\\shref=\"(.*?)\".*?>(.*?)</a>", "$2", "($2 =~ /<img|<SCALE|SPAN/i)", true)
		
		set chosen_names to {choose from list name_list with prompt "Please choose link(s) for which to create an archive " with multiple selections allowed}
		
		log chosen_names
		if chosen_names is {false} then return
		if chosen_names is {} then return
		if chosen_names is not {} then set chosen_names to item 1 of chosen_names
		
		try
			set target_group to current group
			if target_group is {} then beep
		on error
			set target_group to root of current database
		end try
		
		set dest_group to create record with {name:the_title, type:group, URL:source_url} in target_group
		
		
		repeat with this_name in chosen_names
			set counter to 1
			repeat with step_name in name_list
				log step_name & "=?" & this_name
				if step_name does not contain this_name then
					set counter to counter + 1
				else
					set this_link to item counter of link_list
					exit repeat
				end if
			end repeat
			
			
			if this_link does not start with "http" then
				if this_link begins with "/" then
					if this_link does not start with "/" then
						set this_link to stub_url & "/" & this_link
					else
						set this_link to stub_url & this_link
					end if
				else
					set longer_stub to item 1 of my perl_strip(source_url, "(http[s]*)://(.*)/(.*?)", "$1://$2", "", false)
					
					if source_url does not start with "/" then
						set this_link to longer_stub & "/" & this_link
					else
						set source_url to longer_stub & this_link
						
					end if
				end if
			end if
			
			try
				if this_link ends with ".pdf" then
					set the_data to download URL this_link
					set newrecord to create record with {name:this_name, type:picture, URL:this_link, data:the_data} in dest_group
					set data of newrecord to the_data
				else
					
					set theData to download web archive from this_link
					set theRecord to create record with {name:this_name, type:webarchive, URL:this_link} in dest_group
					set data of theRecord to theData
					
				end if
			end try
			
			set skip_this_link to false
		end repeat
		
		
		
	end if
	
end tell



on perl_strip(inputstring, targetstring, replacementstring, filterstring, multiline)
	---version 1.1.5
	----Uses perl regexp to extract all phrases that match TARGETSTRING
	---The subroutine returns a list.  
	----But one can format how the returned items look, using  $1 $2 etc constructions in REPLACEMENTSTRING  
	---- FILTERSTRING allows for all items containing a list of filters, separated by | to be excluded. 
	----double slashes needed for inserting escaped characters into  TARGETSTRING and REPLACEMENTSTRING
	---FILTERSTRING should be a complete perl truth condition, e.g., ($2 =~ /<img|<IMG|scale/) would allow you to specify that you 
	---------want to return found examples that do not contain in the second fond group <img> tags.
	if multiline is true then
		set perl_end_string to "gis"
	else
		set perl_end_string to "gi"
	end if
	set filter_command to ""
	set foundlist to {}
	
	if length of inputstring is greater than 245000 then -----we must write data to a file if it is this large
		set the_data_file to "/tmp/perlstrip"
		open for access POSIX file the_data_file with write permission
		write (inputstring as text) to POSIX file the_data_file
		close access POSIX file the_data_file
		if filterstring is not "" then set filter_command to " unless " & filterstring
		set shellscript to "/usr/bin/perl -e 'open(FILE, \"" & the_data_file & "\") or die \"Unable to open file\"; " & ¬
			"$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;" & ¬
			"local $/;my $content = <FILE>;if ( $content =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\")  " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print  $the_line . \"<perllistitem>\"" & "" & "}} '"
		log shellscript
		set theResult to (do shell script shellscript)
		
	else
		
		set inputstring to my replace_chars(inputstring, (ASCII character 194), "<br>")
		set inputstring to my replace_chars(inputstring, "|", "+vertical-bar+")
		set inputstring to my replace_chars(inputstring, "'", "‘")
		---set inputstring to quoted form of inputstring
		---if filterstring is not "" then set filterstring to " unless $the_line =~ /" & filterstring & "/"
		if filterstring is not "" then set filter_command to " unless " & filterstring
		set shellscript to "/usr/bin/perl -e  '$qt=q|\"|;$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;$thisvar=q|" & inputstring & "|;" & ¬
			"while ($thisvar =~ /$trgt/gis ) { push(@lines,\"" & replacementstring & "\")  " & filter_command & " }; if (@lines) {foreach $the_line(@lines) {print  $the_line . \"<perllistitem>\"" & "" & "}} '"
		log shellscript
		set theResult to (do shell script shellscript)
		set theResult to my replace_chars(theResult, "+vertical-bar+", "|")
		
	end if
	
	
	if theResult is not "" then
		----turn item results into list
		set oldDelims to AppleScript's text item delimiters
		set AppleScript's text item delimiters to "<perllistitem>"
		set foundlist to text items of theResult
		set AppleScript's text item delimiters to oldDelims
		log (count of foundlist)
		if (count of foundlist) is greater than 2 then
			set foundlist to items 1 through ((count of foundlist) - 1) of foundlist
		else --if there was only one result, just eliminate the <perlistitem> divider
			set the_text to (characters 1 through ((offset of "<perllistitem>" in (item 1 of foundlist)) - 1) of item 1 of foundlist) as text
			set foundlist to {the_text}
			log foundlist
		end if
	end if
	return foundlist
end perl_strip





on perl_replace(inputstring, targetstring, replacementstring)
	
	set inputstring to my replace_chars(inputstring, (ASCII character 194), "<br>")
	set inputstring to my replace_chars(inputstring, "|", "+vertical-bar+")
	set inputstring to my replace_chars(inputstring, "'", "&#8216;")
	set shellscript to "/usr/bin/perl -e '$rpl=q|" & replacementstring & "|;$trgt=q|" & targetstring & "|;$thisvar=q|" & inputstring & "|;$thisvar=~s|$trgt|" & replacementstring & "|gi; print $thisvar;'" ---log shellscript
	--	try
	set theResult to (do shell script shellscript)
	
	set theResult to my replace_chars(theResult, "+vertical-bar+", "|") as Unicode text
	set theResult to my replace_chars(theResult, "&#8216;", "'") as Unicode text
	(*	on error
		set theResult to inputstring
	end try *)
	return theResult
end perl_replace



on replace_chars(this_text, search_string, replacement_string)
	set AppleScript's text item delimiters to the search_string
	set the item_list to every text item of this_text
	set AppleScript's text item delimiters to the replacement_string
	set this_text to the item_list as string
	set AppleScript's text item delimiters to ""
	return this_text
end replace_chars

Bene · January 22, 2007, 4:36pm

Thanks for that great script!
A little “bug”: If you cancel the link selection, a folder will be created anyhow.

Bye.

erico · January 22, 2007, 11:11pm

thanks for the feedback! It’s always good to know if anybody else finds these useful. I fixed that bug and another one or two into the post above…

erico

darwin · January 26, 2007, 4:24pm

Hi,

sounds like an interesting script. I have two questions:

What actually does it download: The links or the websites the links refer to?
Can I use this with Devonagent, too?

Greetings

Marcus

milhouse · January 30, 2007, 3:22am

The script looks neat. I installed it in my script menu. Do I use it when browsing in Safari or DTP?

When I select it in either application, nothing happens.

I pasted all of the code in the window to the script editor and compiled and saved the script.

Sorry, I am an scripting newbie.