Web Archive: small change increases file size

hi,

i tried searching for this problem on the forum but didn’t come up with anything.
why is it, that a small change in a web archive (e.g. removing a single character) increases the file size of the archive? a bug maybe?

thanks

Is there an example URL you could post? Over here, adding or subtracting a character from a webarchive increases/decreases the file size by a byte – as I would expect. Perhaps when the archive is edited and saved it is forcing a reload of a linked ad or media element?

Any archive actually.
can you give me a URL that works for you?

reloading… maybe, though i am not sure what would get reloaded

http://www.nytimes.com/2011/12/29/world/asia/kim-jong-il-funeral-north-korea.html?hp

Adding or subtracting a character in the headline of this article changes file size by one byte. (I am going by the Info panel in Finder - not by what DTPO reports - although what DTPO reports doesn’t change either.)

does not work for me.

after import:
staff 1789249 28 Dec 16:29 Kim Jong-il Funeral Held in North Korean - NYTimes.com.webarchive

after change:
staff 1920249 28 Dec 16:31 Kim Jong-il Funeral Held in North Korean - NYTimes.com.webarchive

DT reports 1.8MB

Hmm, over here I get 1,613,026 bytes +/- one byte for each character added/subtracted.

My archive is probably smaller because I’m using the Ad Subtract style sheet ad blocker and so I might have fewer elements stuffed into the archive at the Safari level. One of those blocked elements might be the thing that’s getting refreshed – NYTimes sometimes puts up enormous banner ads. Just a theory. If I get a chance I’ll unwind Ad Subtract and and see what happens – running a before/after instance of the webarchive file through FileMerge would point out the differences.

bosie, if you capture the entire page as WebArchive, you will be attempting to capture dynamic images (I saw at least two) as well as a lot of text and images extraneous to the content of the news article. Dynamic images may differ in size each time they are updated and the document is saved (which happens automatically). I suspect that’s what you are seeing.

I first clicked the option to display the entire article in one page. Then I selected the portion of the column that includes the article of interest and pressed Command-% to capture the selected material to DEVONthink. After capture, I did some minor edits to eliminate a few extraneous elements that remained in the captured document (which further reduced the file size of the capture).

Now I have only the complete article, including pictures. The file size of the WebArchive document in my database is 130.7 KB and the document contains 1440 words.

By comparison, a capture of the complete Web page has a size (displayed in DEVONthink) of 1.7 MB and contains 2525 words. Because of the dynamic images on the page, the document goes online each time it is opened.

I saved a significant percentage of the document size by capturing only the desired information rather than the entire Web page. More importantly, I eliminated more than a thousand words that were not relevant to the information content of the article, and would have diluted the focus of searches and of the Classify and See Also assistants.

bill, i agree with you on everything you said. i am following your advice now and import almost everything via devonagent.

that being said, i think you guys are (partially at least) wrong as to why the file increases. dtp seems to wrap the content with some proprietary html-tags. i imported only the content, is 14kb. changing a single character changes it to 2kb. according to the source view, dtp adds pastebin.com/sDWuCigD to the top and also some stuff to the bottom.

anyways, thanks guys.