Web archive option is powerful!

(NOTE that my youthful enthusiasm was premature. But I’m OK now.)

Here’s another little wow! from a repeatedly impressed DTP user.

As someone who has a taste, but not the budget, for very expensive books (usually limited press runs from an academic publisher), I sometimes have to make do with what I can piece together from publisher’s previews, Google Books excerpts and, occasionally, Kindle free chapters.

So, today I was very happy to see that the “web archive” option for capturing a Google Books webpage pulled in the entire free excerpt, which was nearly 40 pages (with a few gaps, but still …).

I had been prepared to do 40 or so screenshots, which is really tedious.

Thanks, DTP devs! My license fee has paid for itself many times over.

3 Likes

Glad it seems to have worked.
I would recommend disabling your Internet and selecting the document to see if it’s intact or being dynamically loaded.

2 Likes

I guess “seems to have” is the operative phrase in your reply!

I had assumed, since the reported size of the “web archive” in DTP was over 3.6Mb, that DTP had captured the entire excerpt from Google Books. After all, the size of the “Web Internet Location”, which I also added to DTP, was just a bit over 300 bytes.

But when I disabled the Internet, sure enough, only blank pages came up where the book’s pages were supposed to be in the web archive.

So I’ve reverted to using a Firefox extension called FireShot that pulls in the entire Google Books excerpt.

Thanks for the warning. Maybe someone else will learn from my naivete.

When web archives were first made, the Internet was a very different place. Nowadays, many popular sites have dynamically served content that’s not present on the actual page. So remote content and tons of JavaScript give the appearance of locality but it’s smoke and mirrors.

4 Likes

Glad to be a little more savvy. And that Firefox extension (which I’m guessing is above-board?) proves once again that, where there’s a will, there’s a work-around.

The extension just appears to be doing screen captures, so that should be fine.

1 Like

Out of curiosity, for the Web Archive option, what is DTP downloading that makes up those 3+ megabytes as opposed to the 300 bytes of a web location (which is just a url, I’m guessing)?

It’s a bunch of base64-encoded data.

For webarchives, you can view the source code (View > Document Display > Source). But if you’ve never looked at HTML, it might not make a lot of sense.

Or in a text editor like BBEdit :slight_smile:

True, this was more the built-in quick & easy way :wink: I do open in an external editor if I wanna do much of anything (I personally use CotEditor or Sublime).

While DEVONthink allows some WYSIWYG editing of webarchives, my experience is that it’s almost always necessary to edit the source code if I really wanna remove some of the junk that takes a lot of space. Even if I remove something in WYSIWYG, there’s often a bunch of leftover code if I look behind the curtain. Though I don’t use webarchives if I can avoid it. (They’re especially nice for simpler pages, and/or where I might wanna update the archive. And some pages where dynamic/rich media is an important part of the presentation).

Agreed and I’m a CotEditor fan myself. :slight_smile:

I think I was introduced to it on this forum. It might even have been you :wink:

1 Like

I would like to see an option for webarchives within DT NOT to query the internet but only show and use what has been downloaded as the archive!

Just turn off your IP connection before you open the archive.

In any case, your sentence is a contradictio in adjecto. An archive contains what is in the document at this very moment. Imagine you have a simple JS script in the HTML, something like

  document.addEventListener('DOMContentLoaded', function() { 
    const theNewDocument = fetch(`https://example.org/therestofthe.html`);
    document.body.innerHTML = theNewDocument;
})

So, you want this line of code in the archive. But you do not want it to be executed when the archive is opened (and the DOM content is loaded, so the listener would be triggered).

Even much simpler: What about images? They are, mostly, shown by querying the internet. Don’t you want the images in your archive?

How would you even expect DT to prevent the download in the case of an event listener? By removing the listener when it creates the web archive (so that the archive wouldn’t be an archive)? By somehow (how?) stopping WebKit to execute the listener when it opens the document, magically deciding that this code should not be executed?

And that is only a very simplistic example, where HTML is directly loaded inside a listener. Imagine you had a script loaded inside a listener, which in turn loads content and replaces the current one…

Just not feasible. The web is highly dynamic today. And we have been over that several times already:

  • If you want a snapshot, use a format like PDF that does not allow for dynamic content loading (and no, MD is not only taking a snapshot, as it uses links to images, not their data).
  • If you want the current state, use a bookmark

Everything else will give you something in between: A web archive fetches only part of the current content immutably. As do, of course, HTML and formatted notes.

You can’t have your cake and eat it, too.

1 Like

I clearly asked for such an option to DT and DTTG as a reply to the idea to disable Internet on the device to reach the same goal.
It would be VERY strange needing to disabling Internet, loosing all ssh sessions and DT/DTTG syncing just to read a webarchive, don’t you think?

And I am totally aware, that webarchives may not contain everything - but for what use exactly can you use a webarchive that does not contain everything, after the original page is gone?

I only use webarchive to preserve web sites for when they are gone!
And I would like to now how good a bad such an archive is - and it is very very uncomfortable to disable Internet just to check the quality of an archive.

So again, a toggle in the settings would be perfect - for those people who don’t want or don’t care abour any dynamic content that needs the original web site still active and working!
And no, a PDF is not what I want for this - as I want the website with all files etc.

And I suppose, that are MOST users that actually do download webarchives.

More clear now?

BTW, you would never be forced to use this toggle :wink:
You could still disable Internet instead …

Formatted notes usually embed all required resources.

I tried to explain why that is not feasible. I’ll try again. Let’s assume that someone wants to dynamically augment their website, eg with the latest news. So, they’d use a script like

const news = fetch("https://example.org/newsserver?latest");
document.querySelector("#newsfeed").innerHTML = news;

Very simple. The more complex version is what Apple (for example) did with its former developer documentation: That was nearly entirely dynamic. What would your “toggle” do exactly? Would it look for the fetch call and then – well, what? Comment that out, risking invalid script code and a possible failure of the page? How would you implement that?

And that was the simple version. What about a script that loads another script, which then loads content? Or a script that doesn’t use fetch but good old XMLHttpRequest?

A script can do whatever it wants to an HTML document. And there’s no way to prevent that short of not executing any JavaScript at all. Which might then well result in a page missing much of its content. Which is not very useful for an “archive”.

Sure. But a JS script that loads some HTML and puts that somewhere on the page is also embedded. So it is run every time the note is opened, possibly changing the visual appearance each time.

Formatted notes don’t include scripts.