Feature request: Support for Web ARChive

filipeamoreira · May 29, 2019, 11:44am

The current format supported for web archiving is prone to have issues and it is not an open format.

Are there any plans on supporting the Web ARChive (https://en.wikipedia.org/wiki/Web_ARChive) that is currently supported by many other tools?

This would be a great way to future-proof our web archives.

cgrunenberg · May 29, 2019, 11:56am

Thanks for the suggestion! There are no such plans yet. Which apps on the Mac support this format actually?

filipeamoreira · May 29, 2019, 12:24pm

I have personally used webrecord-player (https://github.com/webrecorder/webrecorder-player) and grab-site (https://github.com/ArchiveTeam/grab-site) on the Mac. They are very useful to generate an offline archived copy of a single web page or website.

Native support for this open format would be an excellent addition to this otherwise great application.

filipeamoreira · May 29, 2019, 12:27pm

Just found this link that lists tools that support the Web ARChive format: https://www.archiveteam.org/index.php?title=The_WARC_Ecosystem

jstarek · March 3, 2020, 3:14pm

Sooo… I was just typing a looong post with some background research about this exact topic. And completely forgot to check whether the suggestion had come up in the forum already. I’ll just add my post below, sorry for hijacking the thread, but I think it would just add clutter if I started another one

jstarek · March 3, 2020, 3:16pm

For quite some time, users on these forums have expressed surprise about the behaviour of the current web archive implementation. Peculiarities of the Webkit-based implementation used by DEVONthink mean that users who save a web archive may later, when they go back to view the archive, not see the original content for various reasons:

The URL may lead to a redirect on the original page or become invalid
The original content may have moved behind a paywall (a frequent issue with newspaper articles)
The archived page may require a login on viewing
There may be new content at the same URL (think, for example, about https://www.apple.com/mac-mini/) that is different from what was saved originally

I would like to suggest that DEVONthink might take advantage of the existing, well-tested and open-sourced WARC web archive format as used by the Internet Archive.

WARC is a well-documented, standardized file format for aggregating the components of web pages, generally accompanied by a CDF-formatted metadata file.

While storing web pages as PDFs is a popular suggestion to work around the limitations in the current web archive implementation, this is often not a good fit for the medium since text flexibility and the possibility for interacting with a web-like text-based page is lost.

I would see the main advantages of adopting WARC in

data persistence (especially compared to the current situation); a WARC file remains stable and is intended to contain all resources to render a web artifact
a more widely used file format; Webkit archives are platform-specific and are not used in huge archives like WARC files are
less dependence on Apple’s future decisions regarding the archival file format (out of all the formats offered by DT, web archives are the least likely to remain stable for the future since their format is the only one controlled by just one entity)
portability: WARC files can be viewed across platforms (see, e.g., the Webrecorder Player project)

One possible hindrance: The WARC software ecosystem seems to center around Python, Java, Go and similar interpreted languages which may not be a very good match for inclusion in DEVONthink. There are, most critically, no libraries written in C or similar. Can DT use Python components internally?

Another issue will be backwards compatibility with existing, Webkit-based web archives. This, however, would likely be a question of detecting file types based on magic strings (or MIME info, where available) in the backend and making the difference somehow visible to the user.

How do you all feel about this suggestion? I know that it’s always hard to make such a decision that will affect users for many years to come, but since this topic has remained open for many years, a change of formats might be helpful.

willfoster11 · November 22, 2020, 5:48pm

This sounds really interesting and like a great idea!

I have to confess to still being a bit confused—when I use DT3 to clip a page as a “Web Archive” it seems like it does save a “.webarchive” file. But I have found that stuff behind a paywall will sometimes require a log-in.

Here are my questions:

It seems like DT3 doesn’t actually change the downloaded .webarchive file after it’s saved—since I see a “Update Captured Archive” tool when I open it (which I assume does change the original file). Is that correct?
Is there a way to force DT3 to show the original file? I’m able to show the original captured .webarchive in Finder and then open it in Safari—but not sure how to do that in DT3.

Thanks in advance!

BLUEFROG · November 22, 2020, 6:29pm

It seems like DT3 doesn’t actually change the downloaded .webarchive file after it’s saved—since I see a “Update Captured Archive” tool when I open it (which I assume does change the original file). Is that correct?

No, DEVONthink doesn’t “change the downloaded file”. There is no need to change it as the file is captured as-is.

Is there a way to force DT3 to show the original file? I’m able to show the original captured .webarchive in Finder and then open it in Safari—but not sure how to do that in DT3.

This forum has more than a few discussions about paywalled sites causing issues.
Try this…Add a bookmark to DEVONthink, log into the site, and make the capture directly in DEVONthink.

willfoster11 · November 22, 2020, 7:01pm

Thanks for the prompt and helpful reply, Jim!

I appreciate the suggestion to clip the bookmark, then log-in, and then capture/convert it to a .webarchive file.

And, yes—I did see many conversations about paywalls and the like. For me, it’s not so much an issue of a paywall, but just of capturing the page and then viewing the page as captured.

But I wasn’t able to find a conversation that answered my specific question: is there a way to have DT3 show saved .webarchive files in the same way that Safari does? I imagine that would mean, in my best layman’s terms, not automatically loading the current view of the page.

Appreciatively,

William

BLUEFROG · November 22, 2020, 7:08pm

You’re welcome.

What is the difference in what Safari is showing?

Amparose · January 19, 2021, 10:09pm

I would also like to support a better format than webarchive, which seems rife with problems. There is also a difference between saving a web archive from Safari and DT3 (file size and display).

What is the difference in what Safari is showing?

As an example I current observed, this webpage saves from Safari as about 5.2mb and when sent to DT3 to save as an archive it is about 7.7mb and the Safari-saved version displays better than the DT3 version. Both are annoying with the cookie popup and I too don’t understand why DT3 seems to load the webpage from the server instead of just loading the local webarchive file.

Something more robust would be great.

chrillek · January 20, 2021, 9:07am

Out of curiosity: what is a webarchive aiming for?

the current state of the website? Then one could use PDF or even markdown. Saving the original HTML can lead to a plethora of problems later on with external resources that are no longer available or have changed. Or does a webarchive capture all external resources recursively?
the state of the website when the link to it is opened again some time later? Then a bookmark to it would be sufficient.

Amparose · January 26, 2021, 8:30pm

I assume you are asking “why would you use web archive” rather than “what is the point of web archives” as the latter is a broader technical question. I much prefer Markdown or PDF where possible with webarchive as the fallback when they don’t capture what I need. For instance, elements with interactions, audio playback, or capturing a webpage that has a nice responsive design I’d like to remember.

Bookmarks are useful but web content changes and disappears all the time.

wally · August 10, 2021, 8:09pm

Sorry to hijack this thread but expanding on this conversation with the following question:

If I save a webpage as a web archive using DT and I verify that the saved web archive displays correctly with no issues within DT then do I need to worry about future changes to the web page?

BLUEFROG · August 10, 2021, 10:21pm

do I need to worry about future changes to the web page?

This is entirely dependent on the site you’ve clipped from, especially if it uses dynamic content delivery

wally · August 11, 2021, 4:43am

Forgive the noob question - so if a site is using dynamic content delivery, could that explain why it would “break”? I guess I’m failing to understand if the webpage and associated assets are downloaded in the web archive why does it matter what happens to the website in the future? Or is it with dynamic content it’s “impossible” to download all the associated assets?

BLUEFROG · August 11, 2021, 5:06am

With dynamic content, this is not guaranteed. And while such delivery isn’t necessarily going to not work, it certainly can cause problems with the clipped pages.

Web archives were made at a time when pages were far more statically built. There are millions of pages that would work but many newer sites don’t build such static pages.

chrillek · August 11, 2021, 8:57am

Simple example: If the site uses JavaScript to produce its pages (like Apple’s developer documentation does). The webarchive will contain the JavaScript code, but not the content that is generated by it. Obviously, because that depends on the code being executed which it will only be when the page is displayed by a browser (or the integrated WebView in DT).

So the JavaScript code remains the same, e.g. it does something like load item 12345 from server xyz. Now, item 12345 might contain “Hi there” today. But tomorrow, someone changes that to “Go away”, and than that’ll be what you see.

I tried to summarize the Pros and Cons of the different HTML download options in a post here

wally · August 11, 2021, 5:48pm

This is an awesome explanation and thank you so much for spending the time to explain it. Now I understand and makes sense. I’ve started saving in two formats - web archive and PDF - with the PDF version being my “backup” version. Normally there isn’t a different in appearance between the two but sometimes the web archive is a bit better.