What happens to a Web Archive if a later visit returns a 404

randykahle · February 28, 2013, 1:08pm

I opened up an older DevonThink database and clicked on some webarchive links. I believe the behavior that I noticed was that a request was sent to the saved URL and the saved page was updated with a 404 notification.

I say - I believe that’s what I saw - because it happened quickly and I was surprised, so I didn’t take careful notice.

Can someone explain exactly how a WebArchive works and why I would use this instead of a PDF print or a rich text capture of a page?

Thank you!

Randy

Bill_DeVille · February 28, 2013, 9:49pm

WebArchive is a proprietary filetype created by Apple. It’s much like an HTML capture of a Web page, except that, unlike HTML, it includes images present on the page, and so displays images even when the computer is offline. (But don’t assume that users of other operating systems can properly view your WebArchive files.)

Bugs in Apple’s WebKit code, some of which have existed for a long time, can create problems with some WebArchive files, especially older ones.

Another problem with WebArchive files is that pages that display dynamic graphics will result in an attempt to display those graphics when the file is opened online, and sometimes result in strange displays. (PDF captures sometimes display such artifacts if a dynamic image is changing during the capture.)

I rarely want to capture an entire Web page, as I’m interested in certain content, such as an article presented on the page, and want to exclude irrelevant content such as ads, etc.

My preferred capture mode is as rich text of a selected portion of the page that contains the desired articled, including links, images, tables, etc. DEVONthink’s rich text capture mode uses a Service that is invoked by pressing Command-) to capture the selected area. On occasion, where the placement of images or text boxes may be important I will instead capture the selected portion (only) as WebArchive using the Service command Command-%. (Note: These Services are not properly supported by the Chrome or Firefox browsers, so I avoid using them for captures to DEVONthink. These Services are supported by Safari, DEVONagent Pro and DEVONthink’s browser.)

In brief, if one captures an HTML Web page as a bookmark, no page content has been downloaded to your computer. If the referenced page is no longer available, a 404 error will result. If the page is captured as an HTML file and is no longer available, the text will still be displayed but images are no longer available. If the page is captured as WebArchive, the text and images should still be available even after the page vanishes from the Web. But as noted, a buggy WebArchive or one displaying dynamic images may produce “strange” results. (I haven’t experienced problems with selected portions of HTML pages captured as WebArchive, though.)

randykahle · March 1, 2013, 3:06am

Awesome, comprehensive answer.

Thank you very much – Randy