For quite some time, users on these forums have expressed surprise about the behaviour of the current web archive implementation. Peculiarities of the Webkit-based implementation used by DEVONthink mean that users who save a web archive may later, when they go back to view the archive, not see the original content for various reasons:
- The URL may lead to a redirect on the original page or become invalid
- The original content may have moved behind a paywall (a frequent issue with newspaper articles)
- The archived page may require a login on viewing
- There may be new content at the same URL (think, for example, about https://www.apple.com/mac-mini/) that is different from what was saved originally
I would like to suggest that DEVONthink might take advantage of the existing, well-tested and open-sourced WARC web archive format as used by the Internet Archive.
WARC is a well-documented, standardized file format for aggregating the components of web pages, generally accompanied by a CDF-formatted metadata file.
While storing web pages as PDFs is a popular suggestion to work around the limitations in the current web archive implementation, this is often not a good fit for the medium since text flexibility and the possibility for interacting with a web-like text-based page is lost.
I would see the main advantages of adopting WARC in
- data persistence (especially compared to the current situation); a WARC file remains stable and is intended to contain all resources to render a web artifact
- a more widely used file format; Webkit archives are platform-specific and are not used in huge archives like WARC files are
- less dependence on Apple’s future decisions regarding the archival file format (out of all the formats offered by DT, web archives are the least likely to remain stable for the future since their format is the only one controlled by just one entity)
- portability: WARC files can be viewed across platforms (see, e.g., the Webrecorder Player project)
One possible hindrance: The WARC software ecosystem seems to center around Python, Java, Go and similar interpreted languages which may not be a very good match for inclusion in DEVONthink. There are, most critically, no libraries written in C or similar. Can DT use Python components internally?
Another issue will be backwards compatibility with existing, Webkit-based web archives. This, however, would likely be a question of detecting file types based on magic strings (or MIME info, where available) in the backend and making the difference somehow visible to the user.
How do you all feel about this suggestion? I know that it’s always hard to make such a decision that will affect users for many years to come, but since this topic has remained open for many years, a change of formats might be helpful.