easy capture of PDF from web

kugs10 · March 2, 2007, 7:48pm

I am having trouble finding a simple way to add PDFs from the web into my DevonThink Pro database. So far the only solution I have is to save the pdf to the disk and then import it manually. Even when browsing from within DevonThink, if I come across a PDF I cannot “capture” it directly.

Am I missing something?

Bill_DeVille · March 2, 2007, 8:28pm

Only partially. There is a script that would let you capture the PDF via its URL. I don’t use that one currently, as it saves the PDF into the ‘body’ of the database rather than into the internal Files folder. As a consequence, adding many PDFs with that script increases the memory demand for loading the database. (That problem will go away in the future version 2.0.)

When I come across a PDF in my DT Pro browser and wish to capture it, I click in the URL address to open that page in my default browser, then use Save As to save it to the disk.

Tip: Save the PDF to a Finder folder that has a Folder Action script attached to it. DT Pro provides (see the download disk image Extras folder) a Folder Action script that will automatically Import the PDF to your database. Periodically I’ll go to that Finder folder and empty it, as I’ve copied the PDF into my database and can empty the folder.

ndouglas · March 3, 2007, 6:13pm

This is a royal pain in the ass.

Your options, judging from what I’ve seen:

Buy HTMLDOC or download and compile v1.9 (unstable), which (supposedly) works with CSS.
Download HTML2PS, which doesn’t work with CSS.
Buy DA and set up an automator workflow that opens all HTML files and then exports them as PDF and then drag them into your database.
Get used to clicking a whole lot.

kugs10 · March 3, 2007, 7:26pm

Thanks for the replies. The script-folder method will meet my needs, although it’s not ideal. Is there something inherent to PDFs that prohibits a direct capture method? The bookmarklet allowing direct import of webarchive is very efficient, it would be nice to have similar functionality for PDF.

ndouglas · March 3, 2007, 7:44pm

Ignore me, the obvious solution has been in front of me for over a year now.

derailer.org/paparazzi/

Paparazzi will convert a webpage to PDF (I thought it was only “traditional” images like PNG/JPG/etc). And it’s AppleScriptable.

I’m making an AppleScript to take webpages you’ve selected in DEVONthink, view them, and save them as PDFs. I’ll post it here when it’s done.

kugs10 · March 3, 2007, 8:11pm

Unless I misunderstand, I don’t think the script you’re working on will help. What I need is a way to save PDFs directly from the web. For example, there are many journal articles available as PDFs. I would like to be able to import them directly to DevonThink without having to save the PDF to disk first.

I am already able to convert web pages to PDF rather easily by printing to PDF, which allows me to print directly to DevonThink.

ndouglas · March 3, 2007, 8:21pm

You’re right. Jesus. I don’t know what the hell I was reading, but it wasn’t this thread

kugs10 · March 3, 2007, 8:25pm

Thanks anyway!

parlar · March 3, 2007, 11:22pm

There’s an AppleScript that someone posted on these boards a few weeks ago that works great for me (for when I’m using Safari and not DA).


tell application "Safari"
	try
		if not (exists document 1) then error "No browser is open."
		set theURL to URL of document 1
		if theURL is missing value or theURL is "" then error "No page loaded."
		if (theURL does not contain "pdf") and (theURL does not contain "PDF") then error "No PDF loaded."
		
		set this_name to ""
		repeat while this_name is ""
			display dialog "Saving PDF to DEVONthink Pro. Please enter a file name:" default answer this_name
			set this_name to the text returned of the result
		end repeat
		
		tell application "DEVONthink Pro"
			activate
			if not (exists current database) then error "Please open a database before using this script!"
			set theDestination to display group selector "Destination" buttons {"Cancel", "OK"}
			set thePDF to download URL theURL
			set theRecord to create record with {name:this_name, type:picture, URL:theURL} in theDestination
			set data of theRecord to thePDF
		end tell
		activate
	on error error_message number error_number
		activate
		if the error_number is not -128 then
			try
				display alert "Safari" message error_message as warning
			on error number error_number
				if error_number is -1708 then display dialog error_message buttons {"OK"} default button 1
			end try
		end if
	end try
end tell

It goes into Library/Scripts/Applications/Safari

Is this what you’re looking for?

erico · March 4, 2007, 1:21am

Dear all,

Here’s a much fancier version of the “pdf capture script” I wrote a couple of years ago. I keep this in my ~/Library/Scripts/ folder and set it to a hotkey with fastscripts lite. Based on which browser you are using (and it has code to support just about all of 'em), it will try to download either a webarchive or a pdf or a jpg/png, as is appropriate to the URL in the current browser. It also has a few blocks of code that allow you to automatically route your clippings based on the URL (I for instance like to put all of my new york times clippings together). And it has some code to allow you to have the clippings go to a special “sort me folder”, or, alternatively to the current group in DTPRO, providing that the current group has the first type of label (red on my computer) tagged to it. This allows me to easily vary the way I clip things, and to only have to remember one “hotkey”. This is not, mind you, the simplest version of the “clip me” script, but the most complicated: it does nice things like detecting if you are alreading in devonthink pro and, if so, doing you the extra favor of not re-downloading the pdf if it is already in the browser. Can save a minute or so in many circumstances. It needs a bit more error-checking, I suppose, but what it does now is just embeds the url link in dtpro if it can’t for some reason get the pdf to work. This I find preferable to the error message, because I can go try to figure out the problem later from dtpro.

-Eric Oberle

p.s. If you don’t use all the browsers I do, you might want to comment out the unwanted browsers in the script. Otherwise, everytime you open the script in script editor, it will open all of the browsers active. That can be a lot of launching…



(* Capture Webarchive or PDF from current Program
Captures current browser page as a webarchive or a pdf or a gif/jpg file into current database in DevonThink Pro, by trying to guess which kind of file it is. 
Written by Eric Oberle, Stanford University, borrowing code from various scripts by Christian Grunenberg for the amazing DevonThink Pro package.
If you reuse code from this script, please put an acknowledgment like this one at the top of your script.  

*)

property default_app : "Opera" ----set this variable to the application to use when running manually inside script editor

property target_app : application "DEVONthink Pro"
property the_user_agent : "Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418 (KHTML, like Gecko) Safari/417.9.2"
set pdf_no_source to false
set this_title to "rename me"

-----check if dtpro has database open   
tell target_app
	using terms from application "DEVONthink Pro"
		if ((count of databases) is 0) then
			display dialog "please make sure devonthink has a database open"
			return 1
		end if
		set this_source to ""
	end using terms from
end tell






----figure out which browser is front most, and act appropriately  
---note: if there are some of these  browsers that you would never use, comment out the corresponding code blocks
----- by enclosing it with (*   *).  Otherwise applescript editors will open them whenever you try to end this script.

tell application "System Events"
	set front_prog to displayed name of first process whose frontmost is true
end tell
set front_prog to front_prog as string


if front_prog contains "Script" then set front_prog to default_app


if (front_prog contains "Safari") or (front_prog contains "WebKit") then
	using terms from application "Safari"
		tell application front_prog
			try
				if not (exists document 1) then error "Safari seems to have no document open."
				set this_url to the URL of document 1
				set this_source to the source of document 1
				set this_title to the name of window 1
			end try
		end tell
	end using terms from
else if (front_prog contains "DEVONthink") then
	tell application front_prog
		set this_url to the URL of window 1
		set this_title to the name of window 1
		set this_source to source of window 1
		set theName to name of window 1
		
	end tell
else if (front_prog contains "DevonAgent") then
	tell application "DEVONagent"
		if not (exists browser 1) then error "DevonAgent seems to have no document open"
		
		set this_url to the URL of browser 1
		set this_title to the name of window 1
		set this_source to the source of window 1
	end tell
else if front_prog contains "Vienna" then
	tell application "Vienna"
		set this_url to link of current article
		set this_title to title of current article
		set this_source to documentHTMLSource
	end tell
else
	
	----these programs cannot supply the source, so we'll have to download it.
	if front_prog contains "Camino" then
		using terms from application "Camino"
			tell application "Camino"
				set this_url to URL of window 1
			end tell
		end using terms from
		
	else if front_prog contains "Firefox" then
		using terms from application "Firefox"
			tell application "Firefox"
				set this_title to «class pTit» of window 1
				set this_url to «class curl» of window 1
			end tell
		end using terms from
	else if front_prog contains "Opera" then
		using terms from application "Opera"
			tell application "Opera"
				set this_url to URL of document 1
				set this_title to name of document 1
			end tell
		end using terms from
		
	else if front_prog contains "OmniWeb" then
		using terms from application "OmniWeb"
			tell application "OmniWeb"
				if not (exists browser 1) then error "No browser is open."
				
				set this_url to address of browser 1
			end tell
		end using terms from
		
		
		
		
		
		
		(* else if front_prog contains "NetNews" then
	using terms from application "NetNewsWire"
		tell application "NetNewsWire"
			set tab_num to index of selected tab
			if (tab_num is greater than 0) then
				set some_urls to URLs of tabs
				set this_url to item (tab_num + 1) of some_urls
				set tab_titles to titles of tabs
				set this_title to item (tab_num + 1) of tab_titles
			else
				set this_url to get URL of selectedHeadline
				if this_url is "" then error "Please make sure you have a web page in view in Netnewswire"
			end if
			end tell
				end using terms from
		  *)
	else
		display dialog "browser unrecognized " & front_prog
		return
		
	end if
	
	------download source if necessary
	set pdf_no_source to true
	(*try
		tell application "DEVONthink Pro"
			set this_source to download markup from this_url agent the_user_agent
		end tell
			set pdf_no_source to true
	end try *)
end if




-------Determine where in database to store incoming
-------weblinks.  Currently, I have it place  New York Times files in a special
-------folder, and I have a rule that if the current devonthink folder is tagged with the "red" label 
-------(Label 1), then I have devonthink store the pdf in that folder.
-------- Feel free to customize this code to your needs. 

tell target_app
	using terms from application "DEVONthink Pro"
		try
			if current group is "current application" then
				set cuPos to {}
			else
				set cuPos to {current group}
			end if
		on error
			try
				set cuPos to selection of think window 1
			on error
				set cuPos to {}
			end try
			
		end try
		
		
		
		(* 
		----uncomment out these lines to have the script always put all pdfs in the same place
		 set cuPos to get record at "/file elsewhere" in current database
				set cuPos to {cuPos}
				
		------then comment out the "special handling" block  below
				*)
		
		-------begin special handling code
		-----special location for urls from the New York Times
		if this_url contains "nytimes.com" or this_url contains "-nyt." then
			---create storage place for imported records
			if not (exists record at "/NewYorkTimes") then
				set import_location to create location "/file elsewhere/NewYorkTimes"
			else
				set import_location to (get record at "/file elsewhere/NewYorkTimes" in current database)
			end if
			if cuPos is "current application" then set cuPos to {}
			
			
			--Put items with red folder pointing to a red labelled folder? 
		else if (cuPos is not {}) and (label of first item of cuPos is 1) then
			if kind of item 1 of cuPos is "Group" then
				set import_location to first item of cuPos
			else if parent of item 1 of cuPos is not {} then
				set import_location to last item of parent of item 1 of cuPos
			else
				set import_location to root of current database
			end if
			
		else
			---or alternatively, use the next three lines to just put all clippings into one folder.
			set import_location to (get record at "/file elsewhere" in current database)
			set cuPos to import_location
			set cuPos to {cuPos}
			
		end if
		------end special handling code
		
	end using terms from
end tell




----------------------------------
--capture webarchive & source of original link (if orginal front program was devonthink or another browser that already has
---given us the source (i.e. DevonAgent, Safari) then don't reload it)
----------------------------------
tell application "DEVONthink Pro"
	if (front_prog contains "DevonThink Pro") then
		set pdf_no_source to false
		try ------Force an error condition if no source in current link.  This usually means a pdf or a gif file is loaded in window.
			if front_prog contains "DevonThink" then
				set this_source to source of window 1
				set the_archive to webarchive of window 1
				set the_record to create record with {name:the_title, type:html, URL:the_url, comment:the_comment} in import_location
				set the URL of the_record to last downloaded URL
			end if
		on error
			set pdf_no_source to true
		end try
		
		if pdf_no_source then
			using terms from application "DEVONthink Pro"
				tell application front_prog
					try
						with timeout of 400 seconds
							repeat while loading of window 1
								delay 1
							end repeat
							set pdf_record to create record with {name:this_title, type:picture, URL:this_url} in import_location
							get PDF of think window 1
							set data of pdf_record to result
							
							return
						end timeout
					on error
						try
							delete record pdf_record
							set link_record to create record with {name:this_title, type:link, URL:this_url} in import_location
							set this_window to open window for record link_record
							repeat while loading of this_window is true
								delay 1
							end repeat
							set pdf_record to create record with {name:this_title, type:picture, URL:this_url} in import_location
							get PDF of this_window
							set data of pdf_record to result
							close window this_window
							delete record link_record
							
						on error
							set pdf_record to create record with {name:"failed: " & this_title, type:link, URL:this_url} in import_location
							return
						end try
					end try
				end tell
			end using terms from
		end if
		
		
	else
		
		try
			tell application "DEVONthink Pro"
				if this_source is "" then
					log "no source" & this_source
					
					set this_source to download markup from this_url agent the_user_agent
					log this_source
				end if
				if this_source does not contain "head" or this_source does not contain "html" then
					set pdf_no_source to true
				else
					log "the source length is " & length of this_source
					set this_title to get title of this_source
					with timeout of 300 seconds
						set the_archive to download web archive from this_url agent the_user_agent
					end timeout
					set the_record to create record with {name:this_title, type:html, URL:this_url, source:this_source} in import_location
					set the data of the_record to the_archive
					return
				end if
			end tell
		on error
			try
				
				if this_source is "" then
					log "still no source"
					set this_title to get title of this_source
					set the_archive to download web archive from this_url
					set the_record to create record with {name:this_title, type:html, URL:this_url, source:this_source} in import_location
					set the data of the_record to the_archive
					return
				else
					set the_record to create record with {name:this_title, type:html, URL:this_url, source:this_source} in import_location
				end if
			on error
				log "fell through"
				set pdf_no_source to true
			end try
		end try
	end if
end tell


-----If this is a pdf or other image file , then capture it. 
if pdf_no_source then
	if front_prog contains "DevonThink" then
		using terms from application "DEVONthink Pro"
			tell application front_prog
				try
					with timeout of 400 seconds
						repeat while loading of window 1
							delay 1
						end repeat
						set pdf_record to create record with {name:this_title, type:picture, URL:this_url} in import_location
						get PDF of think window 1
						set data of pdf_record to result
						
						return
					end timeout
				on error
					try
						delete record pdf_record
						set link_record to create record with {name:this_title, type:link, URL:this_url} in import_location
						set this_window to open window for record link_record
						repeat while loading of this_window is true
							delay 1
						end repeat
						set pdf_record to create record with {name:this_title, type:picture, URL:this_url} in import_location
						get PDF of this_window
						set data of pdf_record to result
						close window this_window
						delete record link_record
						
					on error
						set pdf_record to create record with {name:"failed: " & this_title, type:link, URL:this_url} in import_location
						return
					end try
				end try
			end tell
		end using terms from
		
	else
		tell application "DEVONthink Pro"
			using terms from application "DEVONthink Pro"
				if not ((this_url ends with ".pdf") or (this_url ends with ".PDF") or this_url ends with ".gif" or this_url ends with ".GIF" or this_url ends with ".jpg" or this_url ends with ".png") then
					create record with {name:"file format not clear: " & this_title, type:link, URL:this_url} in import_location
				end if
				
				set z to download URL this_url agent the_user_agent
				if this_title is "" then
					set this_title to "rename-this-pdf"
				end if
				try
					with timeout of 500 seconds
						set pdf_record to create record with {name:this_title, type:picture, source:this_source, URL:this_url} in import_location
						set data of pdf_record to z
					end timeout
				on error
					set pdf_record to create record with {name:"failed: " & this_title, type:link, URL:this_url} in import_location
					return
				end try
				return
			end using terms from
		end tell
	end if
end if







tell application "System Events"
	
	beep
	---we're finished!
end tell

darwin · March 16, 2007, 3:32pm

to Bill:

In DA there exists this action menue, where you can choose “add to Devonthink …” pdf (paginated) or pdf (one page).

Is this the same as the script, (and produce larger files in the body) or does this produce better results, like “save as” and import it, as you described?

Thanks in advance

Bill_DeVille · March 16, 2007, 11:40pm

That command (Data > Add to DEVONthink > (PDF one-page or paginated)) will produce ‘printed’ conversions to PDF of HTML pages, but the PDF version of an HTML page will lack hyperlinks that may be in the original HTML page.

But this is neat: If you display an actual PDF page in DEVONagent and use the command to add a paginated PDF to your database, the PDF will be saved to the internal Files folder inside the database, and it will retain working hyperlinks.

That’s better than using the script, as the PDF will need only the space of its text/image to load into the database, rather than the larger load space if it were saved into the ‘body’ of the database using the script.

And it’s only one step, rather than using ‘Save As’ and then importing the PDF from the Finder to the database.

I performed an experiment to confirm this, by creating a Web site that contained a PDF with hyperlinks. I displayed the PDF in DA and selected the Data… command to save the PDF to my database.

danzac · March 22, 2007, 11:58pm

Bill,

Are there plans to be able to apply the data>save PDF paginated from a link on a page?

This would save the trouble of having to open the PDF in another DA tab first. Especially when you know you need the PDF (like downloading it from JSTOR or something like that).

I would really like to see taht as a contextual menu option in a subsequent DA release.

StephenFleming · April 9, 2007, 1:46am

When I’m at the office (with a gigglebit/sec connection), I open the PDF in Safari, then right-click to open it in Acrobat. Then I take the little document icon at the top of the window (what’s that thing called, anyhow?) and drag it to the DTP icon in the dock.

You wind up downloading the PDF twice that way, but at office speeds, I don’t care. I wouldn’t do it this way from a dialup modem in a hotel room.

Bill_DeVille · April 9, 2007, 2:46am

From any browser that has the Save As command, it’s very quick to save a viewed PDF to a Finder folder that has a Folder Action script to Import the PDF to your database.

You will find the Folder Action scripts in the download disk image Extras folder. DT Pro online Help has instructions for using this approach.

Then just one keyboard shortcut does it, Command-S. Choose the folder to which the Folder Action script is attached.

True, the Save As command isn’t available in DT Pro’s built-in browser. But click in the URL field and the page containing the PDF will open in your default browser, where you may invoke Command-S and save the PDF to the folder to which your Folder Action script is contained.

stuherbert · May 6, 2007, 8:31am

Hi Bill,

It would be great to be able to import the PDF direct from the web in a future release of DEVONthink. The problem (for me, anyways) with saving the PDF to disk first is that I don’t get the PDF’s original URL into DEVONthink. It really helps me to know where the PDF has come from, for when I’m publishing a list of sources with an article.

Best regards,
Stu

darwin · May 6, 2007, 11:16am

Sorry, I´m not Bill, but perhaps this helps, I found this nice trick just two days ago (but you need Devon Agent)

Most of the time, I´m using Safari, but to save pdfs directly into my DT database, DA is much more comfortable.

So, when one see the pdf in Safari, I drop the URL with my cursor onto the DA icon in the Dock. Then DA opens the pdf in its browser very fast.

Then I just have to use the action menues (right hand the address field) “Add to Devonthink”, and it´s in (URL included).

So, if you have DA, I think this is a nice solution. (and it´s quick).

Best

Marcus

Bill_DeVille · May 6, 2007, 4:21pm

Hi, Stu. The built-in DT Pro browser doesn’t let one save a PDF. But if you click in the URL address, the PDF being viewed will open in your default browser.

In the default browser, you can copy the URL to the clipboard. Then do Save As to save the PDF to a Finder folder to which you have attached the DT Pro Folder Action script to Import the PDF.

When you choose Save As, you have the opportunity to paste the URL into the Spotlight Comment field of the PDF’s Info panel. That will then document the source of the PDF.

Of course, it is easier if you have DEVONagent.