Convert issues webarchive

thother · March 24, 2023, 6:23pm

Running convert to generate a webarchive from a bookmark often takes more than one attempt, it will just create a duplicate bookmark on the first try. Also seems to take longer than it used to / should?

cgrunenberg · March 25, 2023, 6:42am

An example URL would be useful, especially dynamic web pages might require more time as the clipping tries to simulate scrolling by the user to load (hopefully) all contents.

thother · March 25, 2023, 6:23pm

Having trouble reproducing now. I’ll post a link when I see the issue again, thanks.

thother · April 2, 2023, 3:31pm

Okay, here’s one. I really suspect the issue may not be with the site as opposed to my setup, but this:

First try, long pause and it just generates a duplicate bookmark, no webarchive. Second try converting the same bookmark works. It’s weird.

BLUEFROG · April 2, 2023, 4:57pm

Ugh! Could they put any more crappy JavaScript and click-bait on this page?!?

I have a webarchive that wasn’t captured quickly and it doesn’t even display correctly (and yes, I have JavaScript enabled in DT).

thother · April 2, 2023, 6:58pm

Yeah, I’m sure part of the problem is just the general awfulness of the internet in 2023, contra what I said before. But the thing where it fails the first time and then not on the second attempt is weird. And subjectively it seems notably slower but who knows what that’s about or whether it’s real.

chrillek · April 2, 2023, 7:51pm

Well, “the internet” is not generally awful. In fact, it is a lot better than it was 10 years ago, when every browser did whatever it wanted and Microsoft invented new elements and attributes every day. But these Wired guys (and gals, I guess) do a really terrible job. Disgusting – having a non-sensical video play without me asking it to, peppering the page with JavaScript, which is constantly XHR’ing cloudfront for some stuff they should’ve downloaded at the beginning – I wouldn’t touch that thing with a ten foot pole, much less save it.

What you could do is to retrieve the innerText of all p elements with a class of paywall (how inventive a name). Something along these lines

(() => {
const app = Application('Safari');
const JScode = `[...document.querySelectorAll("p.paywall")].map(p => p.innerText).join('\n')`;
const pureText = app.doJavascript(JSCode, {in: app.windows[0].tabs[0]});
/* create a text record, set its plainText property to pureText */
})()

This code is not really tested – the JavaScript does what it’s supposed to do, namely return a string with all the paywall paragraphs. It requires the website to be loaded in the first tab of the topmost window of Safari (or simply the only tab of the only window).