Apple deprecates WebArchives - what does this mean for DEVONthink?

jstarek · July 1, 2020, 5:39pm

Hi all,

I came across a small “deprecated” icon in Apple’s developer documentation for the WebArchive class (see https://developer.apple.com/documentation/webkit/webarchive). I am worried that this may negatively impact large WebArchive-based collections.

While WebArchives were never perfect (as was mentioned on the old and new forums here by DEVONtechnologies members as well as many users), they do have some nice features that would be hard to replicate by, e.g., converting the existing collections to PDF.

Best regards,

Jürgen

BLUEFROG · July 1, 2020, 5:43pm

Webarchives are clearly still supported in macOS 10.15 Catalina.
They are also web content, so it’s not likely they’ll just “stop working”

However, perhaps in the future they will not be able to be captured, which may not be a bad thing considering they are far less useful now than in the past.

jstarek · July 1, 2020, 5:47pm

Sounds good, thanks for that assessment.

Just out of curiosity, why would you consider them “far less useful now”? What changed?

BLUEFROG · July 1, 2020, 5:52pm

They were intended of offline archives but due to dynamic content delivery and the prevalence of JavaScript for all manner of purposes, many times webarchives are missing content when not online. Or they make online connections which is bothersome to many.

Granted there are still sites that work fine (as the Internet is a very big place), but if you’re looking at news or many popular sites, the webarchives are really only functional when online.

apoc527 · July 2, 2020, 1:23am

If I really want an article these days, I find I get some of the best results using Safari’s Reader mode and then I print to DT3. I think that if you select clutter-free paginated PDF in Clip to DEVONthink that it’s basically the same thing, but I’m not 100% sure. Anyone know for sure?

BLUEFROG · July 2, 2020, 5:47am

Not technologically the same, but perhaps similar results.

cgrunenberg · July 2, 2020, 8:11am

As long as Safari can save & load web archives, I wouldn’t worry too much about this now. But I definitely wouldn’t recommend web archives for long-term archiving as the format is limited to Apple’s platforms (and web archives created on newer OS versions quite often don’t work anymore on older OS versions).

BLUEFROG · July 2, 2020, 1:55pm

Interestingly, I just heard Yojimbo has deprecated using webarchives.

apoc527 · July 2, 2020, 6:43pm

Well, a simple Smart Rule finds and converts all my Webarchives to paginated PDFs. Gotta love DT3!

BLUEFROG · July 2, 2020, 7:22pm

And there’s more where that come from!

KillerWhale · July 2, 2020, 9:02pm

Mayyy I plug SingleFile again here, @bluefrog …?

BLUEFROG · July 2, 2020, 10:59pm

You’re free to plug it. However it doesn’t mean your wishes will come true but it’s an open forum

PointlessOne · July 3, 2020, 3:27pm

WebArchives are important. More so for more obscure corners of internet. Small sites tend to be simpler (better preserved in a webarchive) and less available (in need of preservation).

Is there a third-party implementation of WebArchive DT could use? Or maybe switch to alternative format that can achieve the same (MAFF, WARC, MHTML, even HAR)?

Pedipalpe · July 4, 2020, 6:33am

Same here. 32 Tags appeared without warning. Don’t know the reason why. btw: I never really understood what tags are good for, never used them. All my documents are searchable (OCR applied). Don’t know what may happen when I delete the tags.

rmschne · July 4, 2020, 7:21am

Don’t know what may happen when I delete the tags

Search the DEVONthink manual for “delete tags” or go direct to page. 78 for guidance.

thersites · July 7, 2020, 10:06pm

@cgrunenberg You definitely highlight some of the problems. Web archives just aren’t practical as an offline solution. What is the best solution for saving a web page entirely offline, in such a way that it is completely captured, even from behind a paywall, while using installed ad blockers? PDF is too rigid, especially since being able to edit the captured page is key (to remove useless footer stuff, for example). The DT web clipper is mediocre. I think the folks at DT should license SingleFile’s technology for use in DT

BLUEFROG · July 7, 2020, 10:47pm

Web archives just aren’t practical as an offline solution.

Webarchives are still practical for many sites, just not all of them.

PDF is too rigid, especially since being able to edit the captured page is key (to remove useless footer stuff, for example)

PDFs are still the best option for static representations of a page. If they’re captured as Paginated PDFs, you may be able to excise pages at the end of the document.

thersites · July 8, 2020, 8:58pm

I guess, but then you get all kinds of crap that can’t be removed. And the pagination is static, too. What if you want to convert from Letter to A4 at some later point?

Have you tried the SingleFile extension? You can use it to remove practically any elements you desire, then save the page as html, which is static but also infinitely editable and which can be converted to practically any other file format at a later date.

BLUEFROG · July 8, 2020, 10:16pm

The format has been discussed but nothing decided either way.

RCK · July 9, 2020, 2:41am

can you share your rule for us non programmer types? thanks!