WebArchive Question

rdr · February 13, 2019, 4:50pm

Could you please clarify something for me? When I choose to save a Web Archive what exactly am I saving? Is it a static view of that web page (that won’t change if the actual web page does) or is it a marker (essentially a bookmark) to the actual web page?

I presume, since I have the option to save a “bookmark” to DEVONthink, that a web archive is, in fact, a static rendering of that page, but then when I look at it in DEVONthink or DEVONthink To Go, the archive appears to be loading the actual web page.

My goal is to have access to a page whether it changes (or is even online) in the future. Appreciate the clarification!

– Robert

BLUEFROG · February 13, 2019, 5:07pm

Once upon a time, that was true - and actually is for many sites still. However, many “popular”, paywalled, and click-bait filled sites have their content delivered from external sources. The webarchive will contain javascript and other code that downloads and displays the content when you view the page, ie. it’s live, not a static capture.

If you want a static capture, a PDF is generally considered the best option. Note the dynamic delivery, etc. can also sometimes produce a wrong PDF.

rdr · February 13, 2019, 5:11pm

Thanks, Jim. Your support (and speed) is exemplary. Much appreciated.

– Robert

BLUEFROG · February 13, 2019, 6:01pm

You’re welcome!

SaveEverything · January 6, 2020, 6:38am

Thanks for the details @BLUEFROG…
I’m still a bit confused… how does the algo decide which to keep static and which to keep dynamic.

Paywall sites obviously wont by static . but isnt the idea to have a snapshot representation (whatever that looks like) at the time of capture?

I have disabled JS to prevent any code executing… I understand some saved clipping wont load properly (say a fancy agency website with animations etc)…

I’d prefer not to use PDF, as I’d like to keep it fully searchable and potentially open for conversion later if exporting out to another service/app in the future…

BLUEFROG · January 6, 2020, 1:50pm

how does the algo decide which to keep static and which to keep dynamic.
… isnt the idea to have a snapshot representation (whatever that looks like) at the time of capture?

What content is (or isn’t) captured is not up to us, but up to what the mechanism imports from the site. The site’s design determines what’s static and dynamic in how it’s built and content delivered.

I’d prefer not to use PDF, as I’d like to keep it fully searchable

I’m not sure why you’d think PDF isn’t fully searchable in a web capture. It certainly would be, as much as the other formats would be.

SaveEverything · January 12, 2020, 10:51pm

Ok thank you for the thoughts.

I guess my question was simply as Op asked- if my internet connection is off, will I have access to what I’ve saved. I understand your reply now…

About the PDF, yup I get that the text is searchable…
I was thinking more for long term flexibility - a PDF feels like a flattened file that can’t really be edited outside of Adobe or some other image editing software… while an .webarchive, or other format that retains HTML + CSS, could potentially be processed later programmatically if needed.

Thought Im not sure how portable a .webarchive is…

funkydan2 · January 13, 2020, 9:06am

Since I’m only interested in the content of a website, I always capture as MarkDown. When it works (which is almost always) it’s a ‘static snapshot’ of the content of the page, and is searchable and the layout can be manipulated for printing (or even edited).

I suppose I go with this method as it’s most like the ‘clutter-free’ clipping I used to do in Evernote.

SaveEverything · January 13, 2020, 6:55pm

Thanks @funkydan2

My needs are a bit different -
I clip the entire page even with all the ads/junk, it’s the visual design of the page that I remember. When I’m seeking, scanning hundreds of search results from an archive that has many thousands, color and layout help me spot what I need quickly.

Scanning hundreds of text files means you gotta read everything