a script which captures web archives for all linked pages

erico · February 8, 2006, 10:44am

dear all,

Here’s a hacked together script with very little error checking that nonetheless seems to save a lot of time for me. It captures all links on the current page to webarchives. It is a great thing if you need to suck down a bunch of pages, but don’t like the archive system. give it a try and tell me what you think…

erico



-- webarchive all links 1.0
-- Created by Eric Oberle, shameless borrowing code snippets from Christian Grunenburg
---When you run this script, it will seek to create webarchives of all links on the currently selected page.  It will put these archives in a folder in the current group, named according to the title of the current web page.   This script is very handy for 
-- I should probably turn this into two scripts.  One that downloads all links to a separate subfolder of current.  Second that turns all links in current folder into webarchives. 


tell application "DEVONthink Pro"
   activate
   try
   	set current_selection to selection of viewer window 1
   	set current_group to current group
   	set include_string to ""
   	
   	
   	if not (exists window 1) then error "No window is open."
   	
   	set original_url to the URL of window 1
   	set this_source to the source of window 1
   	
   	set include_query to display dialog "Please type phrase that all desired links contain, or press return for all " default answer "" buttons {"continue", "cancel"} default button "continue"
   	if (text returned of include_query is not "") then
   		set include_string to (text returned of include_query)
   		set these_links to get links of this_source base URL original_url containing include_string
   		set temp_folder_name to include_string & " links "
   	else
   		set these_links to get links of this_source base URL original_url
   		set temp_folder_name to the name of viewer window 1
   	end if
   	
   	set dest_group to create record with {name:temp_folder_name, type:group} in current_group
   	
   	
   	set this_title to (the name of viewer window 1)
   	
   	repeat with this_link in these_links
   		if not (exists record with URL this_link) then
   			
   			with timeout of 120 seconds
   				
   				set theRecord to create record with {name:this_title, type:link, URL:this_link} in dest_group
   				set theWindow to open window for record theRecord
   				
   				repeat while loading of theWindow
   					delay 1
   				end repeat
   				
   				set theURL to URL of theWindow
   				set theSource to source of theWindow
   				set theName to get title of theSource
   				
   				set theData to web archive of theWindow
   				set theArchive to create record with {name:theName, type:html, URL:theURL} in dest_group
   				set data of theArchive to theData
   				set source of theArchive to theSource
   				delete record theRecord -- Closes window 
   			end timeout
   		end if
   	end repeat
   	
   	
   	
   on error the error_message number the error_number
   	if the error_number is not -128 then
   		display dialog the error_message buttons {"OK"} default button 1
   	else
   		error number -128
   	end if
   end try
end tell

erico · February 8, 2006, 11:04am

The worst downfall to the previous script is when one of the links points to a pdf. Christian, it would be really nice if DTP had a way to capture via scripting a loaded page containing a pdf viewed with tiger’s webkit.

I tried something like:


tell application "DEVONthink Pro"
	set this_url to the URL of window 1
	set this_source to the source of window 1
	set this_archive to web archive of think window 1
	set x to create record with {name:"pdf-test", type:pdf, URL:this_url}
end tell

but though it seems to load a huge binary stream into this_archive, it won’t store it as a pdf, or a picture (interestingly if I query a pdf file’s type in dtpro set x to selection; set y to first item of x; set z to type of y it tells me that I have an “picture”, but that aside…

What I’d like to suggest that there be some way to 1.) programmatically “know” that the current page were a pdf (or that a link pointed to a pdf, I suppose) perhaps through a property to a think window such as “pdf loaded”, and 2) if there were some way to capture that pdf and store it into the database. Otherwise, it seems like a lot of scripts that pull down links are simply destined to fail if those links point to pdfs. Yes, I know that I could do something like look at each link, see if it has .pdf in its name, add it to the downloads with dt and fish it out of the archive…but that requires pre-scanning every link, and there’s no way from applescript to turn downloading on and off. I can’t seem to get “download markup from” to capture a .pdf either.

I should also mention that this scripting issue is relevant to surfing. It would be so nice to be able to browse, find and view a pdf I want to keep, and have it easily stored into devon think without major effort. Right now I end up contextual-menu opening .pdfs in Preview, fishing the file out of the temp folder, and then dragging them to dt. Or sometimes I use the download manager, but then have to go get the file and put it away out of the archive. Either way, it’s a real pain…

Just a thought,

Eric

annard · February 8, 2006, 4:24pm

Shameless plug: try DEVONagent 2.0b with the “Get Link URLs” Automator action. It would easily get the PDF links of a webpage for import into DEVONthink Pro.