Why Can't I Script Creating Webarchives?

I want to do something I thought would be quite simple to do with DTPro scripting, but which instead seems impossible.

My quest:

  • Give DTPro a URL, and have it create a webarchive.

Either I’m missing something quite simple, or it is utterly impossible in DTPro’s scripting implementation.

How to do this by hand:

  1. Create a new link with the URL. (This step alone can be automated with AppleScript.)
  2. Capture Web Archive from the URL link. Someone has cleverly decided that not only can this step not be automated from DTPro’s AppleScript dictionary, but by placing this option only in a contextual menu (!), it can’t even be automated via GUI scripting.
  3. Delete the link.

Now, for whatever reason, folks keep repeating that to get webarchive functionality, you need to purchase DEVONAgent. Obviously, this is not true. You can do this in DTPro. But there is no possible way to automate the process.

Oh, well. It was fun to play with the demo for a day. If you ever introduced an AppleScript dictionary that would let me automate my web scrapbooking needs, I’d purchase your fine product.

I don’t have a solution offhand (or time to investigate), but I was curious about something: if you think/know you can automate this in DevonAgent and then import it into DevonThink (either sending it via DevonAgent or some other supported automated import method), why are you adverse to that? There’s a nice little bundled package available containing both DevonThink Pro and DevonAgent for under $100 US (less for academics).

I’m wondering if there’s some other showstopper about DevonThink? (And we’re probably off-topic here)

Because I don’t want to purchase a second product to correct something missing in DTPro’s AppleScript dictionary.

Because I don’t want to constantly be running an app I have no use for.

Because I don’t want the script to be slow and resource intensive by having to involve a third application.


The functionality is already present in DTPro. They don’t have to even add real AppleScript support if they’d just put the menu selection in a regular menu instead of a contextual menu - then it could be scripted by GUI scripting. Apple is pretty clear in its GUI guidelines that menu items should never only be in a contextual menu.

I have no idea if the idea is to sell more copies of DEVONAgent or not, but in my particular case, they’re losing a sale instead.

Can’t you just do it in the following way?
(all these steps can be automated)

open the URL
select all and copy to clipboard
make new rich text
paste clipboard to the new rich text
save the new rich text
delete the URL

I’m not sure this is exactly what you’re looking for, but…

Alb

Try this:


tell application "DEVONthink Pro"
	with timeout of 30 seconds
		set theRecord to create record with {name:"Temporary Link", type:nexus, URL:"http://www.apple.com"}
		set theWindow to open window for record theRecord
		
		repeat while loading of theWindow
			delay 1
		end repeat
		
		set theURL to URL of theWindow
		set theSource to source of theWindow
		set theName to get title of theSource
		set theData to web archive of theWindow
		set theArchive to create record with {name:theName, type:html, URL:theURL}
		set data of theArchive to theData
		
		delete record theRecord -- Closes window
	end timeout
end tell

Note: To capture a web archive, you need a rendered page.

Then Apple is ignoring its guidelines in almost any application too - please have a look at Safari, Mail or TextEdit for example.

1 Like

Try this:


tell application "DEVONthink Pro"
	with timeout of 30 seconds
		set theRecord to create record with {name:"Temporary Link", type:nexus, URL:"http://www.apple.com"}
		set theWindow to open window for record theRecord
		
		repeat while loading of theWindow
			delay 1
		end repeat
		
		set theURL to URL of theWindow
		set theSource to source of theWindow
		set theName to get title of theSource
		set theData to web archive of theWindow
		set theArchive to create record with {name:theName, type:html, URL:theURL}
		set data of theArchive to theData
		
		delete record theRecord -- Closes window
	end timeout
end tell

And this above seems to solve half my problems. But on several webpages, the apple homepage included, the webarchives are not real webarchives. Disconnecting myself from the network leaves these pages rendering incorrectly - apple.com - for one. Other webarchives work just fine disconnected from the network. I have no idea what the distinction.

But in any case, thanks for the code above.


And on a less important point of style:

Point taken. But the HIG to not have menu commands orphaned in contextual menus only has been pretty much observantly used by all Mac OS apps dating back to even the OS 8 days. You’re not only ignoring parts of the HIG’s that everyone else also ignores - you are ignoring a part of the HIG that everyone else observesr

petey:

l1] Some web pages – including the Apple example – are dynamic in the sense that they keep changing. Try several Web Archives in succession of the Apple home page and you will see what I mean. Sometimes you will see (in the archives) a series of iPods floating across the window, sometimes you will see a feature for the new iMac.

[2] Christian was talking about Apple’s (and many other developers’) violation of the HIG specifically re menu/contextual menu items. Yes, Apple often does ‘violate’ that HIG. There can be good reasons for that, especially in trying to keep things relatively simple for users. :slight_smile:

There’s a balance between having very involved menu options, and (contextual menu) options that are available in specific circumstances. There’s also a balance between making features simple for most users, and making life easy for power scripters.

If you check the scripting dictionary for DT Pro, you will find that it is very large already, and will continue to grow. That’s also the case for DEVONagent. Those two applications fit well together. DT Pro makes it very easy for a user to choose plain text, rich test, page or Web Archive captures of a page at a time. DEVONagent multiplies an individual’s power to capture information from the Web. A couple of weeks ago I created a new DT Pro database containing more than 10,000 documents, using DA for the Web captures and transferring them to my DT Pro database. Then I used DT Pro’s search/sort capabilities to quickly winnow down (and organize) the database to about 5,000 documents most closely reflecting my project interests.

Of course it would be possible to merge DT Pro and DA into a single program possessing all the features of both applications. But therein lies bloatware.

I’ve noticed this too and created/loaded several archives in DEVONagent, DEVONthink and Safari. Although all necessary resources were stored in the archives, none of the programs could load all pages completely while offline. Therefore it’s either a dynamic page loading different resources each time it is viewed or it’s a bug of the WebKit used by all those applications.

That’s really too bad, as it reduces the usefulness of webarchives across the entire platform. If a webarchive isn’t going to be able to replicate what the user was seeing, even in the condition where the viewed website was removed from the internet, then webarchives are kinda useless as an archival file format.

Internet Explorer for Mac had a web archive file format that didn’t suffer from these problems. Too bad the Surfin’ Safari team couldn’t have equalled that effort.

petey:

Internet Explorer archives of the Apple Web are much worse than the Web Archives created by the script. Try a succession of archive captures of the Apple home page using IE (which I had not used for some time – please don’t make me use it again).

Probably because Internet Explorer takes longer to download an archive, it is much more likely to miss page elements than is WebKit.

As I keep saying, the artifacts of the Web Archive capture script that you and Christian saw are the result of the fact that the Apple home page is dynamically changing over short time intervals. That’s the explanation – the whole and complete explanation. :slight_smile:

Don’t put down the usefulness of Web Archives. The Web Archives I captured of the Apple Home page were uniformly better than the Internet Explorer webarchives. Every Web Archive capture that I did when Apple was pushing out marching iPods captured the iPods in motion. Internet Explorer webarchive captures never captured all of the iPods, and the image was static, not moving, when viewed offline.

Every one of the Web Archive captures that I did using Christian’s script was complete, with no missing elements. More than half of the Internet Explorer webarchives had some entirely missing elements and none completely represented the Apple home page. I did the captures using a 2 GHz G5 iMac, and with a 4.1 mbps broadband Internet connection. I suspect that had I used a slower computer and/or a slower Internet connection, I would have encountered missing elements in the Web Archives captured using the script. After all, a Web Archive is a capture of a slice in time of the page, which Apple is dynamically changing rather quickly. Clearly, Internet Explorer webarchive captures are slower than WebKit, so it’s more likely to be messed up by dynamically changing Web pages. Not a single one of the IE webarchives was a full and complete version of the Apple home page. Score (on my computer): WebKit 1, Internet Explorer 0.

Have you tried disconnecting from the network before attempting to view saved webarchives?

The problem with webarchives is that they are not self-contained. They rely on being able to connect to the original website - hence their unreliability as an archival solution. The IE for Mac web archives may have had their own problems, but they were at least self-contained.

In my testing, I found over half of the webarchives I saved from a variety of sources were trying to fetch elements from the web to display.

You can keep saying it, but you really need to rethink what’s going on here.

The problem seems to not be Christian’s script, nor does it seem to have anything to do with DEVONthink. The problem seems to lie within WebKit.

To understand the problem, either view your webarchives disconnected from the web, or even better, install some tool like Little Snitch to see a detailed view of what’s going on with network connections when you view a saved webarchive.

With WebKit’s webarchives, if the website of a saved webarchive changes or disappears in the future, your ability to view your data will be severely impacted. IE for Mac’s web archives did not suffer from that problem.

Petey,

I’m not sure if this behaviour is really a bug of the WebKit. For example, let’s assume that a website uses one or more scripts to load/modify random parts of a web page. After capturing a web archive, the scripts will be executed again and might need different resources which would be loaded. That might sound like a bug… BUT: Many pages wouldn’t work without scripts. Now, what’s the better solution? Ignoring scripts or trying to download some data? Both solutions have their disadvantages.

Anyway, I still guess there’s a REAL bug in the WebKit related to loading web archives.

petey:

Yes, I will keep saying it, because I did real-world experiments and reported the results.

I most certainly did turn off Airport after making archive captures using the script and Internet Explorer. I reported what the archives looked like offline. My results using the script and WebKit were much better than the results using Internet Explorer. When viewed offline, there were defects in every single one of the archives created using Internet Explorer; none of them had successfully captured all the resources being ‘pushed’ by Apple at the moment of capture.

My point is that, whether or not there are bugs in WebKit (there are), it will always be possible to encounter Web pages that cannot be successfully archived. It will always be logically possible for a Web page developer to ‘push’ changes in a Web page using techniques that cannot be ‘seen’ by the application that’s capturing the archive.

As a practical matter, when a page is undergoing rapid changes, slower archive captures will be more likely to fail to represent a moment in time state of the Web page than will faster archive captures. Variables include script/application speed, CPU speed and Internet connection speed. On my computer, Christian’s script to capture Web Archives performed much better than Internet Explorer.

Experiment update: Today, while offline, I looked again at the series of Web Archive captures made of the Apple home page yesterday. Remember that Apple is pushing changes between views of the Web page. This is done, apparently, using a time-dependent script to switch views. So when I open one of yesterday’s Web Archives, there appears to be a time-dependent switch running. If I’m in the right time interval, I see a complete representation of the page with all elements captured. But if I close and reopen the same archive, I may see a representation of a different page, with some unloaded resources. Very interesting, and supporting my point. I also looked again at the IE web archives, all of which remained imperfect captures of the Apple home page.

So what did you expect? :slight_smile:

Hi,
I am trying to run this script in DT3Pro and I am not getting the results I expected…what should I change to make it work?
Thanks

That is impossible to say since you haven’t clarified anything about “I am not getting the results I expected”.

:slight_smile: creating webarchives from the clipboard.
Now I can but now I am not being able to capture the name of the URL. How can I run the scripts that you have provided from an outside script?

Thanks

The name of the web page should be added automatically.
What script are you referring to?