Webarchive of Instapaper now fetching original article

lockwood · March 19, 2022, 11:09am

I used to be able to webarchive articles I had saved to Instapaper, but now when I try it archives the original source article.

Not sure if the problem is a recent change to DEVONthink or Instapaper, but looking for a solution if anyone has ideas.

chrillek · March 19, 2022, 11:39am

I guess that DT just uses Apple’s Webarchive framework. If you’re so inclined, you could export the archive to your local disk, unpack it and open the HTML file in your favourite browser. Its developer console will tell you if/why content is loaded from the web.

lockwood · March 19, 2022, 12:16pm

Yeah this used to work (last time I did it was maybe 6 months ago?), which DT was, as I understand it, using the webarchive framework then.

Thanks for the workaround suggestion. I’ve been able to workaround this by clipping to evernote, exporting as an .enex file, then importing into DT. But it was nicer when I could just archive the page directly from the browser extension.

chrillek · March 19, 2022, 12:24pm

Actually, I suggested a way to figure out what happens (and why), independently from DT. I suspect that the archive itself is build in a way that forces fetching of the original article. For example, it could contain JavaScript code that builds the page or part of it – that’s what Apple does with its documentation. Which makes it impossible to get a real archive, unless you convert it to a non-dynamic format like PDF or RTF.

cgrunenberg · March 19, 2022, 12:26pm

The latest releases actually preprocesses clipping of web archives now too (like it already in case of the other formats) to make the results more predictable & less dynamic. Ideally clipping is initiated either via selecting a running browser in the Sorter or using a browser extension.

chrillek · March 19, 2022, 12:30pm

If I understood the OP correctly, they’re using a 6 months old archive that now suddenly (?) starts to load something from Instapaper.

lockwood · March 19, 2022, 12:47pm

Ah got it @chrillek, thanks for the clarification, that’s something I can try to root cause the issue.

Sorry for the confusion, I meant I was able to archive Instapaper articles around 6 months ago using the browser extension, but now it archives the original source.

For example, I have webarchives right now of Instapaper articles and my highlights in my database. However, when I recently tried to do the same it just archived the original page.

lockwood · March 19, 2022, 1:01pm

Aha, so I assumed DT used my current browser session login to archive, but it looks like it uses the application on my computer.

I tried browsing to the Instapaper article in the app, but Instapaper redirects to the source article if you’re not logged in (which I guess is why it was archving the source article). So I tried logging into to Instapaper through the browser in the app and then clipping using the browser extension and it works.

I’m fairly certain I was not logged into Instaper through DT previously, so perhaps it was the preprocessing in the latest versions that @cgrunenberg mentioned?

cgrunenberg · March 19, 2022, 1:04pm

The preprocessing doesn’t change on its own the URL or what’s loaded. Are these archives old ones or new ones clipped using version 3.8.3? And how exactly do you actually clip them?

lockwood · March 19, 2022, 1:11pm

Not sure exactly what you mean by “Are these archives old ones or new ones clipped using version 3.8.3”. I think we’re still confused that these are old archives that I’m using from within the app that are redirecting to the source article? If so, that is not the case. I was trying to archive new pages which would archive the source article rather than the Instapaper article.

I clip them by clicking the browser extension and selecting the format “Web Archive.”

cgrunenberg · March 19, 2022, 2:00pm

And what’s the URL in the browser right before clipping and what’s the URL of the clipped archive?

lockwood · March 19, 2022, 2:35pm

The URL in the browser was something like https://www.instapaper.com/read/1491748555, and the URL of the clipped archive was the same. However, the webpage that I saw was the original source version, not the Instapaper formatted version.

Hmm, I just tried logging out in the desktop app browser and tried this again, but it works now. I also just updated to 3.8.3, so maybe this was a problem that was fixed in the latest release.

tja · March 21, 2022, 4:50pm

I constantly run into this problem, webarchives reaching out to the internet, while they shouldn’t.

I often try to fix this by first entering flight mode …

But webarchives should only refer to downloaded content in general - that’s their purpose, after all.

BLUEFROG · March 21, 2022, 9:42pm

But webarchives should only refer to downloaded content in general - that’s their purpose, after all.

Webarchives had their day in the sun. With where web technology has evolved, it is not as broadly useful as it was. However, it is still very useful for non-“news” or paywalled, or click-bait sites.

cgrunenberg · March 22, 2022, 7:36am

Clipping web archives is preprocessed since version 3.8.3 too (just like the other available formats are) to make the result less dynamic & more reliable.

chrillek · March 22, 2022, 8:24am

That’s not possible in all situations. A very simple example is Apple’s developer documentation: it is delivered on the fly using JavaScript and some long forgotten server technology. The archive will contain the JavaScript, and the browser will run it when opening the archive.

tja · March 22, 2022, 4:48pm

I don’t really understand this sentence.
What means “preprocessed” and in how far is this different to non-preprocessed?

tja · March 22, 2022, 4:51pm

A webarchive should ideally be of the HTML that is used by the browser - just what the browser is seeing and using.

Even if originally there is only a “javascript.print_webpage()”, the webarchive should contain the output of this functions - as this is, what the browser is displaying.

In my naive interpretation

cgrunenberg · March 22, 2022, 4:51pm

The HTML code is changed to make the result more predictable, e.g. scripts are stripped if possible.

tja · March 22, 2022, 4:52pm

Ah, OK.
Thanks!