Updating webarchives

mksBelper · June 4, 2021, 5:51pm

You may well be right, @pete31; you may well indeed have hit the nail on the head.

Generally, I do want (‘only’?) the latest content for any given URL. Perhaps in 99% of cases.

But there are times (and this is the old EagleFiler, from which I have migrated to DT, thinking) when a webarchive serves me better:

when the site is (perhaps temporarily) unavailable… outage or similar
when the data, specifically for which I captured it, is no longer available on that site
when the site’s URL has changed… at least I still have the data.

For those reasons I assumed - perhaps wrongly - that webarchives were (are?) the best compromise.

In all cases your help is much appreciated. Really helping me to think clearly.

mksBelper · June 4, 2021, 6:11pm

@pete31,

Some of this may turn out to be superseded by the fresh input from yourself and @chrillek on whether webarchives are - after all - the best solution for what I’m trying to do.

When I stayed with webarchives from EagleFiler, they seemed to be the best compromise.

I need to think about that and post here again as soon as I have decided whether webarchives have any advantages over bookmarks after all.

Abbreviated workflow from your post - and many thanks for your persistence!

Because:

I wasn’t fully aware of the relationship between ‘live’ URLs and what exactly webarchives contain
I (mis?) understood - from a couple of threads here - that there is now only one way to confirm (or otherwise) the accuracy of any web pages stored in any form(at) in DT - and that is the Check links script
this script appears to be deprecated, or superseded.

So, instead, I’d simply see the ‘Invalid URL’ in the ‘Invalid URLs’ Group (of replicants) which the Check Links script creates - and reload it? (And then Update Captured Archive if it’s OK?)

It’s looking that way, isn’t it.

It is. That’s what I’m aiming for as much automation as possible .

Yes. But because it now appears necessary for me to rethink the question of webarchive vs Bookmark, I might find a script that converts webarchives to Bookmarks more useful - because it looks as though this doesn’t do what I thought it would.

pete31 · June 4, 2021, 8:15pm

They are, I think.

If you decide to use bookmarks you can, of course, not search the site’s content anymore.

When I started to use DEVONthink I captured a lot of whole sites as webarchive. Although I knew that I probably would “never” need a lot of those sites again (stuff like “How do I setup X on macOS version Y”) I captured the whole site because I didn’t know a better format. Years later when I had more knowledge about DEVONthink and advantages of different file formats I used a script to convert them to bookmarks and saved the webarchive’s text inside the new bookmark’s comment (the comment field in the inspector). This way a search in DEVONthink also would bring up all bookmarks whose URL at some point contained the search terms. This was the best solution for this kind of stuff.

But I would never use bookmarks for stuff that’s important in any way because of the reasons you mentioned: content can change and URLs can become unavailable.

Script

The script Create new webarchive from selected webarchive and inherit properties does what you’re now doing manually:

It

checks whether a URL is redirected (that’s something DEVONthink automatically does). If so it uses the redirected URL to create a new webarchive. If not so it uses the URL of the “old” webarchive to create a new one
sets the properties of the new webarchive to those of the old webarchive

Assuming you did not already make use of linking records (i.e. you did not use x-devonthink-item://
links to link to your webarchives) this script is the simplest way to update all your 10.000 webarchives.

You of course would have to go manually through the results and check whether the content is what you expect - but that’s something you are doing now anyway.

Please create a test database and try the script:

create a test database
duplicate a bunch of webarchives into this database
select some or all webarchives
make sure that the navigation sidebar is visible (you’ll see the script progress there)
run script

You can distinguish new webarchives from old ones if you sort by addition date.

Let me know if it works

chrillek · June 4, 2021, 8:46pm

Web archive was probably a nice idea before Ajax and dynamic HTML content appeared on the scene. Nowadays, they’re nearly pointless for many HTML documents.
The Apple developer documentation @pete31 referred to in another thread being a very good example: the document is virtually empty and comes into being only at load time through a whole bunch of JavaScript files.
These scripts load content from a remote server. Which means that

This content can change at any point in time without the web archive changing
It is not searchable by DT because the document itself is empty (bar basic HTML markup)

There’s also the aspect that web archives are an Apple-only format. But that’s probably a mute point if one is sure to never move away from the platform.

In my opinion, these are strong arguments against using web archives for archiving purposes. PDF or markdown (if layout is not so important) are far better suited to capture the state of a web page at a point in time. Bookmarks are ok as pointers to the current state of the page. In combination with @pete31’s “put the text in the comment” approach, they can even be searched.

mksBelper · June 4, 2021, 9:47pm

Thanks for that, @chrillek. Makes a lot of sense .

I plan to spend the weekend experimenting empirically on just how (well) DT handles each of the formats I’m considering.

But I still have reservations about moving away from webarchives:

Which - perhaps paradoxically - is actually (also) a point in favour of webarchives: when I do need a snapshot (not, say, for Amazon prices…they change, of course; but maybe of the details of books which subsequently go out of print - and I’d still have that in an archive), webarchives are ideal.

Major point against.

Never! Although if it is deprecated (there seems to be debate there) or becomes so…

Thanks. I agree.

I tried that the very first day I bought DT. The results were… mediocre. I shall have to try again and see why what I got just didn’t reflect anything like what was on the page.

Which kind of argues in favour of a mixed approach: sometimes webarchive; sometimes PDF (if I can get it to work; and sometimes Bookmark.

That then leaves me with the question: how do I - when I want to - check for validity and, if necessary update, PDFs and Bookmarks?

which is a real plus.

mksBelper · June 4, 2021, 9:58pm

@pete31,

.

Maybe I need to start choosing between:

webarchives
PDFs
Bookmarks

according to circumstance?

Exactly where I have been this last month or so. Re-assuring.

Do you still have that script, please, Pete?

Again - re-assuring, and makes me think that I probably need a more nuanced, hybrid approach.

That’s the one you posted earlier, isn’t it.

== snip ==

No.

That’s right.

Will do, thanks!

As a relative newcomer to DT, am I correct in thinking that a completely new test database is 100% separate and independent from the one I have now: I don’t want to compromise preferences, script locations, settings etc in any way?

Sure !

pete31 · June 5, 2021, 1:10am

Yes, that’s definitely a good idea. I always decide based on importance, appearance of a capture method’s result and whether the content could be found somewhere else in case the URL might become unavailable.

If you don’t need a whole site there’s always the option to capture the selected part of a site as webarchive. To do so

use menu Safari > Services > DEVONthink: Capture Web Archive
(You can assign a shortcut in System Preferences > Keyboard > App Shortcuts)

or

drag the selection onto DEVONthink’s icon in the dock

Here’s a new one:

Script: Convert record to bookmark, save record’s text in bookmark’s comment and inherit properties

That’s correct.

With a test database there’s only one thing you should never do:

Do not duplicate indexed records into a test database

Duplicating indexed records into another database does not create a new file in the file system. It only creates a new record that’s pointing to the same file in the file system as the other duplicate. This means if you alter a duplicated indexed record in the test database you also alter the indexed record in the other database - because they are the same file.

mksBelper · June 5, 2021, 2:35am

Thanks. Good to know. It’s what I’ll do. Although, with several thousand URLs as webarchives, it’s a lot of work . I am a perfectionist, though; so it’ll have to be done.

== … the option to capture the selected part of a site as webarchive… ==

Thanks. I gave it a quick try. I do get errors, though…

An example URL is the webarchive for:

https://www.thecocktaildb.com/

I get this error:

Got it. Thanks. I generally don’t use Indexed records much.

pete31 · June 5, 2021, 2:47am

Yes, I didn’t remember that custom meta data is “undefined” if there’s no meta data set for a record.

Updated the script

mksBelper · June 5, 2021, 3:04am

Of course it works perfectly now. Thanks!

I shall also be getting started with AppleScript as soon as I can.

chrillek · June 6, 2021, 9:14am

Nope. All the content dynamically generated by JavaScript (eg loaded from a server) will be loaded again when you open the web archive. If this is only the price or also the image, marketing blurb, whatever is entirely up to the web site. In extremis (as for the Apple developer documentation quoted before), the web archive behaves exactly like a bookmark. You simply don’t know what the “archive” is “archiving” unless you check it yourself.

mksBelper · June 6, 2021, 6:02pm

I see what you mean, thanks, @chrillek: didn’t explain myself properly.

What I mean is that if - a year ago - the webarchive captured, say, bookshops where a certain (out-of-print) book was available at that time, then now, when, as you say, I open the webarchive, one year later and it’s gone (because of that updating), it might still be worth contacting the seller who was offering such a title a year ago to see if - say - they’ve just withdrawn as an Amazon seller…but still have it.

Or am I still not understanding webarchives? In my case, they actually don’t seem to update?

chrillek · June 6, 2021, 6:14pm

I was not talking about Web archives in general but about the web sites they’re archiving (or attempting to). All dynamically generated content on these sites is not archived but dynamically re-generated whenever you open the archive. So it reflects whatever the site owner wants to show you at this point in time. Not at some time in the past.
The archive might even appear to be broken if the end points (aka URLs) for the archived scripts are no longer functional.

mksBelper · June 6, 2021, 6:36pm

@chrillek,

Thanks. I may be laboring under a misapprehension here. Sorry .

When I used EagleFiler (up to a month or slightly more ago), the attraction of Webarchives was that they could never break, because EF captured a version of the page which was forever after unchanged.

I (wrongly?) assumed that DT’s webarchives work the same way.

I certainly have never seen any of the webarchives which I have in my DT database try and update themselves.

Specifically, that’s why I asked my first question in this thread: how does one actually update them?

Thanks to the help of Pete and you, I have learnt that there is much more to it than that

And am still working out what’s best for my purposes.

Perhaps the key phrase is ‘when you open’ because whenever I view a very old page in DT, it displays what may have been there for 20+ years.

What exactly should I try, @chrillek, please to see this re-generation on opening? Reloading them?

Thanks again for your help!

chrillek · June 6, 2021, 7:09pm

So from a time when web sites were static, not dynamic. I was talking about maximum four years ago. Of course DT displays web archives from about 20 years correctly.

mksBelper · June 6, 2021, 7:25pm

I’d have to look to confirm. Shall do

I’d have said that all webarchives in DT display just as they were captured.

What exactly do you mean by ‘…open the archive…’, please?

mksBelper · June 9, 2021, 9:44pm

@pete31 and @chrillek - thanks again for all your help!

I have given all of this a great deal of thought (and not a little experimentation!)

For the moment, I am updating (have nearly finished updating, in fact… fewer than 100 to go ) all my webarchives as webarchives.

Those which I need as ‘conventional’ bookmarks I (can later) convert. Where the PDF option works, I shall be using that too. But at least now there are next to no Invalid URLs in a database of > 10,000 records .

They should all be easier to maintain - every six months or so.

I run the ‘Check Links’ script on all webarchives.

Then for each of the ‘Invalids’ which it has found and put in the ‘Invalid URLs’ Group I clip the updated URL (if it has an existing equivalent - about 30% no longer do) into my Inbox; and Move to the Location indicated in the ‘Invalid URLs’ Group.

Not the best solution (something that hunted out (perhaps via DEVONAgent Pro?) an updated URL, if available; then Moved it automatically (perhaps using Classify?) in each case).

But I’m happy for the moment with having completed this gigantic task.

Once again - your help much appreciated!

pete31 · June 10, 2021, 5:55am

@mksBelper I’m again not sure whether I understood what you’re trying to achieve.

If you want your webarchives to be (more) up to date (than they were before) why do you only update those in the “Invalide URLs” group? Whether a webarchive got an invalide URL has nothing to do with the up-to-date-state of the content.

If your goal is to update your webarchives, i.e. to update their content, then you’ll have to update all of them - not only those with an invalide URL. If that’s what you’re after then the script I posted can be used to do that.

I don’t understand what advantage it might have to only process those with an invalide URL. Could you please clarify what you’re doing?

mksBelper · June 10, 2021, 7:29pm

I see what you mean, Yes.

I have been ignoring the distinction between ‘able-to-be-reached’ (because valid; but may also be old) and ‘up-to-date’ (will always be valid and up to date), haven’t I?

I think I have assumed that the built-in script (‘Check Links’) invariably updates links it finds. The log certainly suggests that; a typical line reads:

11:33:29: composersforum.ning.com Updated URL (https://composersforum.ning.com/)

And yet the URL as reported stays the same - unless it’s altering ‘http’ to ‘https’ because that’s happened on the site’s server since I captured the address as an EagleFiler webarchive.

Do you know what the built-in DT ‘Check links’ script is actually doing? Is it checking for the return of a 404 (etc)? And nothing more?

I shall return to your script (thanks), Yes - as soon as I’ve completed my first pass, which (without my thinking about it carefully enough) has only done half the job! Now the script will always be run against current URLs.

My reservation - as we found the other day - has been that your script doesn’t work in about 10% of cases and results in a corrupt display/output.

Not that I’m not grateful, Pete! I am :-).

And we think that the reason for this is because certain dynamic elements are not fetched.

With 2,000 URLs I suppose I’m happy to have got as far as I have. But when I also take into consideration the need to have some sites saved as conventional bookmarks as well as PDFs, it’s still a huge job .

pete31 · June 11, 2021, 12:38am

No idea what you have been doing. I explained how it works, you did something else.

I think I told you that you won’t need that script.

That might be one reason.

The other might be: I did not know that it’s necessary to use Reload and Update Captured Archive before I actually tested what’s needed to update a webarchive (I don’t capture whole sites as webarchive and assumed that you know what you’re doing).

Using the script and the correct DEVONthink menus afterwards seems to work fine. I explained that above. But again: this script is not needed in your case.

@mksBelper please read the thread again. The thread is a complete mess. But reading it again will probably make things clearer. You should be able to find answers to all your questions.