Web Archive or Add Page

lshear · June 14, 2007, 1:42pm

What’s the verdict – when is it best to import a web page by “web archive” and when best to import as “web page.” I use scripts for both.

Is one file smaller than another?
Or am I looking at six-of-one vs half-dozen of another?

Thanks in advance,

J.

Bill_DeVille · June 14, 2007, 3:05pm

Capturing the Page results in a typically rather small text file that contains the HTML code for the page.

Capturing as a WebArchive saves into your database not only the HTML source code but also the images, so that they will be displayed even when one goes offline. But this can be a much larger file than the simple HTML source file.

Personally, I capture as HTML or WebArchive only rarely. Most of my captures of data from the Web as as RTFD files of selected text and images. Why? Because very often extraneous material such as ads or other information would be captured if I brought down the entire page rather than the specific material that I want to add to my database.

So:

You can’t see images if captured as a Page, unless you are online. those images will be ‘lost’ if the source page is subsequently removed from the Internet.
You can see images if captured as a WebArchive or RTFD rich note, even if you are offline. Those images will not be lost if the source page is removed from the Internet.
In practice, the smallest file (if images are involved) will be the Page source text. Next larger, the rich text RTFD file. Largest, the WebArchive.
In practice, capture of selected text/images as a rich text note is the most efficient way to capture the data that’s of interest. (I’m assuming that images are important, as they are in most of the scientific articles that I capture.)