Web Archive Question

Hey all,

Looking at page 25 of the DevonThink Pro Office Manual, it describes web archive as both saving the HTML code and all resources needed to display the page.

I was curious if there’s a way to tweak how WebArchive works, such as having it pull more information than just the base page referenced. Here’s my use case.

I’m studying Japanese Grammar, and am storing the grammar for each item I come across in a website. I always get a little worried that these sites can go away at any time, or that I may not have internet access. An example page is:

renshuu.org/grammar/287/%E3 … 6%E3%81%A1

When pulling a web archive, it seems to pull just the main page, but not the “example sentences”. Their example sentences work a little differently than other sites in the sense that when you click on the link, it does an Ajax call to the server to fetch the sentences. Clicking on “next” will pull the next set, etc.

My overall goal is to pull both the main page, and all those links. I can do this programmatically, and will fall back on that if necessary, but I’m hoping for a non-programmatic solution too. So my main question is, is there a way to configure “Web Archive” to pull information as I click on things as I click around a web archived page. That way I can pull the base, and as I navigate, it pulls that information into the cache.

If you are thinking PDF may be better, it kinda would be but isn’t perfect. The problem with PDF in this case is because the clipper will reload the page before clipping, thus going back to the grammar point. I’m using Chrome for this test.

I haven’t tried DevonAgent yet, that may handle it better, I’m unsure. From a programmatic solution, I’m tempted to write Applescript that’ll go through each element here, spawn off a Python solution using Beautiful soup, which would pull down the examples programmatically, pick them out, then throw them in a file. Then the Applescript would pick up that file, and pull it into DevonThink. I haven’t written this yet, and it doesn’t sound too hard, but hoped to ask here first before I go down that road.

No, webarchives aren’t configurable. In fact, they are less useful in recent times due to dynamically loading content. This means some content may be missing in the webarchive unless you’re connected to the Internet. This is counter to the original point of the format.

Have you experimented with DEVONthink Pro Office’s website download features? See File > Import > Website … to access the dialogs. Download has a lot of parameters, including following links. Not sure how well download plays with Ajax – but the feature is worth a try.

An alternative approach is the SiteSucker application.

Both of these, download and SiteSucker, create a localized copy of a website with links intact and localized, which is different than a webarchive. Because these downloaders can grab copies of all the assets a site uses, the result can take up a lot of room.

nods yeah, the other downfall is that it wouldn’t work with classification either, given that the sentences aren’t being pulled. That’s unfortunate.

I didn’t know about the Import Website option in DevonThink, that’s kinda useful. It doesn’t look like it pulled in the ajax stuff either, unfortunately.

I’ll take a look at SiteSucker. There’s a few websites that I would like to pull local copies of. Documentation, programmatic documentation, is a common one. I did some of that in the past, pushed it into DevonThink, but the size thing you mention is something I have to be careful of. DevonThink has an upper limit of what it can pull in effectively. After that limit is reached, it starts to grind to a halt. Ran into that problem trying to import my emails.

So one big thing I’m hoping to do is to remove all the cruft from the pages too. Stuff like headers, links to open the dictionary, etc would be nice to remove and only have the content I really need. That way DT plays well longer.

I’ll check out SiteSucker though, that may help shorten the steps needed to get this working.

Since Ajax / jquery is generating the page on the fly – i.e., there’s no “there” there until you click the button to initiate the query – I think a download utility might not work unless you had the database itself and a replicant of the site.