DevonAgent/Devonthink provide DOM list via JavaScript

erico · May 23, 2011, 6:15pm

I am going to state this feature request in two ways, once in english and once in computer gibberish.

Ok, first in English: I would like devonagent and devonthink to provide access to ALL the images on the finally rendered page. I need this because about 80% of what I use my computer for is accessing journals via our university system. And the aggravating and idiotic thing about half of these systems is that they, (like googlebooks), don’t allow one to easily grab a single image displayed on the screen. They do this through all sorts of silly idiotic methods involving iframes, which one can readily get around by (for example) looking at the “Activity” window in Webkit and downloading the image directly. I would love it if devonagent’s image list included all images that were displayed in the current browser, regardless of whether iframes have been used. I think it is wholly legitimate to be able to store any image I can see on my browser, and frankly webkit/safari allows me to do this. So why shouldn’t devonagent allow me to quickly drop those images into devonthink? So I would request that all images displayed in the browser window be available in the image list.

Ok, second time, in computer gibberish. What I’d really like is for devonagent to have scripable access to the entire DOM chain of the browser via Javascript. Newer versions of Webkit nightly are actually getting close to this. If I open a javascript console in Webkit while browsing googlebooks or one of my library’s protected book pages, I can enter the command “document.images” in the console, and get the whole list of rendered images URLs. Unfortunately if I use this code:

tell application "WebKit"
	set the_tab to current tab of window 1
	set x to do JavaScript "document.images" in the_tab
	log x
end tell

Webkit returns a pathetic string: “[object HTMLDivElement]”

It is very frustrating to see that the Webkit javascript console will provide a list of all the tags but not return it via script. It’s frustrating as well that devonagent returns nothing. I would really like to be able to parse the finally rendered page…so many web programs modify the page as it was originally sent, and the crucial data is stored in the DOM (or so called .outerHTML )

So what I’d like to see is really two things: the ability for Devonthink to provide all the images in the finally rendered (post-Javascript) html / DOM chain and the ability to access this DOM with a script, and thus be able to do things like ask for


set x to do JavaScript "document.getElementById('divPage0')" in the_tab

Since I have no idea how difficult this is, I would suggest this. I know that the two requests are linked logically, I don’t know which one is more difficult. But what I can say is that if it is not that hard to provide a Javascript console to Devonagent that would allow scriptable callbacks and would be capable of returning all document DOM calls as strings, that alone would make Devonthink more useful to me than any other browser. It would be great if it worked in Devonthink as well, as I notice that devonthink these days responds to Javascript commands almost as well as devonagent (which isn’t to say it’s great, but what works, works.)

Please consider these requests seriously. The truth is that much “information work” occurs beyond password/cookie/iframe walls that really thwart much of what needs to be done. The ability to access the DOM tree as rendered and modified is a huge issue that will be part of all future browser discussions, and I’d like to see Devonthink/Devonagent arrive at the future first.

cheers,

Eric

btw, Christian, I’m happy to provide you with password/access to test this…

cgrunenberg · May 24, 2011, 11:25am

As far as I can tell, this should already be possible. DEVONagent returns the expected value in this case:


tell application "DEVONagent"
	set theWindow to browser 1
	set theImages to {}
	set cnt to (do JavaScript "document.images.length" in theWindow) as integer
	set i to 0
	repeat while i < cnt
		set theImages to theImages & (do JavaScript "document.images[" & (i as string) & "].src" in theWindow)
		set i to i + 1
	end repeat
	return theImages
end tell

erico · May 25, 2011, 5:53pm

Dear Christian,

Okay, well I’m a dork for not figuring that out! Scripting the DOM totally works in devonthink and devonagent, if you are willing to deal with some silence from the do JavaScript command. One just has to go down to a layer where you get an integer, string, or some other kind of common datatype…I was trying to pull a whole array of complex objects across the javascript- applescript barrier. You nicely show in this script that one can bring across the array—but one string at a time. Well, okay, so that’s one feature request already done.

Though there’s still a few things here I’d really like to see. It would definitely be nice if the UI in devonagent could place all the images that the little script here yields in its image tray—that way I could save a copy of them with a simple drag-and-drop. But scripting works for now----or at least mostly: the only major barrier to doing what I would prefer to do is that I end up having to redownload the links that this script pulls up. Basically, doing that sometimes fails, depending on what kind of article I’m browsing.

So this is an example where the scripting is far ahead of the UI…

This leads me to plea for another feature: is there any possibility of adding an applescript command that would check the cache for the images in question, and given a url, to return said image? [for reasons explained above, I don’t want to redownload these images]

I’m thinking an applescript command like "retrieve from cache image with URL xxxx ", returning data of type <> that one could put in a picture into devonthink. Any possibility of adding this? I assume devonagent has to be able to query its cache internally anyway…Please consider that a request

(Naturally, I’m also wondering if it might be possible to query the Cache.db to get the images out of there…but I have to admit I don’t really understand how the image blobs work in mysqllite, whether they are actually in there if I bust my head against it, and I’d prefer not to leave applescript, etc…Any hints?)

Well, that’s my begging for features for the moment. Let me end this note with a big thank you Christian, and a hope that more exposure of the DOM data structures to applescript might come in the future. I’d love to be able to at least pull a few things out of the cache with applescript and to fish out screen images through the UI.

cheers,

Eric

Note:
For anyone besides christian who is reading this and is interested in querying the DOM: the trick right now is to use a webkit nightly in debug mode, open the web inspector, and go to the javascript console. From there, you can issue queries to the “document.” structure and walk through the whole tree. Once you get down to a datatype that is compatible between Applescript and Javascript, you can then craft a “do JavaScript” command in Applescript that will pull the data out of a webpage loaded in Devonthink/Devonagent. The ability to allow webkit to parse the DOM is HUGE, not only because it beats trying to do it yourself from raw html BUT also because you can get ahold of the actually rendered final HTML, after it has been modified by javascript and all the silly Web 2.0 junk that seems to pollute our browsers these days.

cgrunenberg · May 31, 2011, 9:22am

Where exactly do you want to “save a copy”? On the deskop, in DEVONthink? This could of course be scripted too. Finally, you might have to use a referrer for downloading (usually the page’s address), see “download URL” command of DEVONagent & DEVONthink Pro.

erico · June 2, 2011, 12:17am

I use devonthink for all of my scholarly and financial record keeping, and I do a lot of this with scripts. Devonthink/agent remain the most scriptable of javascript enabled browsers that I know of on any platform. (I use beautifulsoup on python when I don’t need javascript.) But Devonthink is always my preferred organizational/storage medium of choice…I hardly use the file system.

The reason why I’m scripting something so grotesque as walking through the DOM tree is that many of the ways I access the most import data for my life and work is behind walls that are protected by cookie authentication + https + javascript “onclick” request handlers, and other silly things (like huge invisible GIFs that prevent dragging). So there end up being images on the screen that I can see but can’t directly drag out their context and/or I want to be able to script their extraction.

The problem is that many of these cases now involve https requests where a hash id code is sent, and the server balks if one asks for the image a second time without a new hash code. So I can see the image. But I can’t save it. That’s why I’d like to pull it form the cache.

Since one can’t get the image link from the pre-rendered, unmodified raw source doesn’t work, I’m now using the technique of querying the DOM described above (and thanks to your help) I can get the original URL. This works for some servers—if I use the referrer and request it from the URL with the authenticated session still active.

But what I really want is to not need to ask the remote server to send the image a second time. I’d like to just pull it from cache, and be able to retrieve it from the cache, given the URL. That would be easy to do script wise, and very, very useful.

It would also be nice to be able to just pull all those (javascript) dynamically loaded images out of the image tray in devonagent. But if I had to choose, being able to just get the image out of the cache would be my choice.

I understand there is a bit of complexity here. I suppose there’s really two types of files that I’m interested in—images and pdfs, which are handled differently by the browsers, obviously. Though I can usually figure out the pdfs; it’s the images that are the biggest problem. Having a scriptable browser that will allow me to save all these things in their native formats—gifs, tiffs, pdfs, etc.—would be a BIG deal.

So that’s why I would like an applescript command of "retrieve from cache image with URL xxxx " to return an image that, yes, could be put in devonthink pro office. So I can OCR it and I can search on it and never lose it!

I can show you some examples if I’m still not making any sense.

I appreciate the response a lot by the way…it’s stopped me from trying to reverse engineer mysql blobs! And that is priceless.

cheers,

Erico

cgrunenberg · June 2, 2011, 1:13pm

Some examples would be appreciated, that way I could check whether there’s no way yet to retrieve the images and whether there might be common interest in such an extension

erico · June 2, 2011, 5:08pm

Here’s an quick script that provides some testing ability, and I guarantee you if it were posted under a less forbidding and cryptic subject line (like “script to capture google book images”) it will garner some interest. It’s a good example of the principle of “avoiding double downloads” on a system that is publically available and doesn’t require a password.


---Google books capture visible pages script for Devonthink Pro
---This script will capture the last loaded pages of a googlebooks document (a "bookmark" link) that is open and frontmost in Devonthink Pro
---It will download and save any images it finds that are not already in Devonthink Pro, naming all pages according to Googlebook's naming system.
---Google ultimately limits the pages that you may view based on your ip address and cookies.  Please respect copyright and fair use laws. 
---Incidentally, this script can be made to work with Safari as well---the Javascript will work in most mac os x browser that use Webkit. But since storing the images in devonthink makes so much sense, why not just download them there too?

tell application "DEVONthink Pro"
	set theWindow to window 1
	set main_url to URL of theWindow
	
	
	---if user has a document window open, grab it, else get url from three pane view
	if class of theWindow is document window then
		set current_record to content record
		set the_destination to first parent of current_record
		set dest_name to name of the_destination
	else
		
		set the_destination to current group
	end if
	
	set theImages to {}
	try
		set book_author to (do JavaScript "document.getElementsByClassName('addmd')[0].childNodes[0].data" in theWindow)
		set book_title to (do JavaScript "document.getElementsByClassName('gb-volume-title')[0].childNodes[0].data" in theWindow)
		
	end try
	
	---get the post-rendered image list by querying javascript
	set cnt to (do JavaScript "document.images.length" in theWindow) as integer
	set i to 0
	repeat while i < cnt
		set the_url to (do JavaScript "document.images[" & (i as string) & "].src" in theWindow)
		set theImages to theImages & the_url
		set i to i + 1
	end repeat
	
	---now create the images if they are not already in Devonthink pro
	repeat with the_url in theImages
		if not (exists record with URL the_url) then
			if the_url contains "&pg" then
				--get page number
				set page_num to first item of my find_between(the_url, "&pg=", "&img") & " " & book_title & " --- " & book_author
				log page_num
				
				---download the page
				set the_page_data to download URL the_url referrer main_url
				---I would prefer:  set the_page_data to retrieve from cache URL the_url   (and then a check to download if cache retrieve fails.
				
				
						if the_page_data is not "" and the_page_data is not missing value then
		
					set new_rec to create record with {type:picture, name:page_num, URL:the_url} in the_destination
					set data of new_rec to the_page_data
				else
					log ("image size of page " & page_num as string) & " was zero "
				end if
			end if
		end if
	end repeat
	
end tell




on find_between(this_text, start_string, end_string)
	----this routine returns a LIST (n.b.!) containing all chunks of text found between the start_string and end_string.  If it finds nothing it returns an empty set {}.  
	if (this_text = "") or (this_text does not contain start_string) or (this_text does not contain end_string) then return ""
	set good_set to {}
	---my write_error_log("in find_between start:" & start_string & "end: " & end_string, 4)
	
	if (start_string is equal to end_string) then --every other one is good in this case	
		set AppleScript's text item delimiters to the start_string
		set the item_list to every text item of this_text
		set the item_list to rest of item_list --remove first one
		
		repeat ((count of item_list) div 2) times
			set text_between to first item of item_list
			set good_set to good_set & text_between
			set item_list to rest of rest of item_list
			---my write_error_log(("**find-between:start=end: " & text_between), 4)
		end repeat
		
	else if end_string contains start_string then
		set AppleScript's text item delimiters to the end_string
		set the item_list to every text item of this_text
		set the item_list to items 1 through ((length of item_list) - 1) of item_list --remove last one
		set AppleScript's text item delimiters to the start_string
		
		repeat with this_block in item_list
			set text_between to (second text item of this_block)
			set good_set to good_set & text_between
			---my write_error_log(("***find-between:beginning-in-end " & item_list), 5)
		end repeat
		
	else -- if end string and start string are not equal, and end string does not contain start, then go ahead and find the start tags then the end tags
		set AppleScript's text item delimiters to the start_string
		set the item_list to every text item of this_text
		set the item_list to rest of item_list --remove first one
		
		set AppleScript's text item delimiters to the end_string
		
		repeat with this_block in item_list
			set text_between to (first text item of this_block)
			set good_set to good_set & text_between
			---my write_error_log("**find-between:diffend:" & text_between, 5)
		end repeat
	end if
	set AppleScript's text item delimiters to ""
	return good_set
	---my write_error_log("****end find_between", 5)
end find_between

Save the script and a assign it a keystroke. Then find a book on google books that has a “preview”…Make a bookmark to it in Devonthink. Put that bookmark in its own group. Open the bookmark in its own window. Start scrolling through the book, and every few pages hit the keystroke. This will capture the book images into the Devonthink group. This is obviously already useful. (See for example the painful process of screen grabs described in this thread: http://www.devon-technologies.com/scripts/userforum/viewtopic.php?f=2&t=6259&p=28241.

But here’s the deal: google places a cookie on one’s machine, and counts the number of pages that are requested. Using this script will get one shut out of the book in question faster than if one is merely scrolling through the book, because it requests each page twice (once when you view it, once when you run the script.) If the page images could be pulled from the cache, then this becomes immediately more useful.

Ilibrary and other such authenticated systems are even more severe—they actually won’t give you the image if you request it again. But the google books example (which counts downloads and associates them with ipaddress and cookies, as far as I can tell) I think makes the case pretty well, and isn’t discipline or school specific…I can share more examples of course as I develop them, but this is just hacked together quickly for the purposes of continuing this (for me) exciting discussion.

Naturally I’m as interested as you are if other people are reading this thread. I imagine Bill might have something to say

cheers,

Erico

edit: corrected stupid bug with size detection

cgrunenberg · June 10, 2011, 12:25pm

I’ve just added a new command get cached data for URL … from … to DEVONagent & DEVONthink Pro. It’s using the specified/current tab and its downloaded & rendered resources, not the disk cache. Here’s a simple example:


-- Capture DOM images

tell application "DEVONthink Pro"
	set theWindow to think window 1
	set cnt to (do JavaScript "document.images.length" in theWindow) as integer
	set i to 0
	repeat while i < cnt
		if (do JavaScript "document.images[" & (i as string) & "].complete" in theWindow) is "true" then
			set theImage to do JavaScript "document.images[" & (i as string) & "].src" in theWindow
			try
				set theData to get cached data for URL theImage
				set data of (create record with {name:"", URL:theImage, type:picture} in incoming group) to theData
			end try
		end if
		set i to i + 1
	end repeat
end tell

Just send me an email in case that you’re interested in an early build.

erico · June 11, 2011, 12:06am

Awesome…! I’ll introduce a couple of scripts based on this in the next weeks. Thank you for implementing this…!