Updating webarchives

Hoping I can be forgiven for:

  • being new to DT
  • being relatively new to scripting and/in DT
  • just wanting to rely on having some form of ‘captured’ web content, or a pointer thereto whenever I need it,

I have to agree with you that ‘…The thread is a complete mess…’; I take responsibility for that - both by ‘thinking aloud’ at times and by not having the familiarity with the environment that I am now gaining - in no small part thanks to your expertise, Pete!

So I am nevertheless very grateful for your help with explanations, suggestions, and concrete solutions (your script).

You did. But I read that post of yours as meaning that I wouldn’t need the script which you had just written - as I think you’re saying here, aren’t you:

My apologies. It looks as though I won’t need either - provided Reload does detect defunct URLs, rather than flagging them as ‘Invalid’ and creating a Replicant for each one in its own directory so named. Or will I?

At the risk of trying your patience even further, which script: yours or ‘Check Links’?

I do propose to read through this thread - all 60 messages of it (!) again.

I have just done so once :slight_smile: .

In the meantime, I believe I understand you to be saying that the process is - in sequence - to:

  1. run one of the scripts… not sure which one; sorry :slight_smile:
  2. Reload
  3. Update Captured Archive, which we know from this post of @cgrunenberg, does this: ’ just uses the currently loaded web page & its resources, creates an updated web archive and updates the URL of the item. Therefore it’s useful to update captured web archives in case of changed contents (e.g. after reloading the page), it’s not intended for updating invalid URLs.

for each URL.

If I’m right, I have two questions:

  1. What is the inbuilt ‘Check URLs’ script designed to do? Just check Bookmarks? I was - in all fairness - directed to this in order to help me with webarchives. Maybe I need to do that first in order to get at least valid URLs; although - as I said in my post earlier today - it does seem to do some updating.
  2. How do I script that process for - say - each webarchive in any given directory, please?

And again - all I want to do is:

  • update URLs which are no longer valid - that is, record the latest (correct) URL in DT; this will most likely be because the domain has changed and/or because the page (I only save pages) has moved to a different location in the (menu structure of) the site
  • and then to update the contents of the webarchive inside DT in order for it to be as useful to me as possible.

There is no reasonable way to update a bunch of webarchives other than creating new ones. This was also suggested by @cgrunenberg:

This script

DEVONthink automatically follows redirected URLs when it creates a webarchive, so you get the correct URL. There is no need to use any of the other scripts.

All you have to do is:

  • Run the script
  • Compare each resulting webarchive with the old version
2 Likes

I can see that now :slight_smile: . Am learning a lot!

What is the purpose of DT’s ‘Check links’ inbuilt script?

I still find it useful to see which webarchives can be found (even though they may not be up-to-date, may not display the latest content), and so which - as you suggest - it’s still better just to re-Capture.

That’s just an extract, isn’t it… I get an error when trying to compile?

Have you in fact updated it?

I ask because I get this error:

Screen Shot 2021-06-11 at 12.14.06

And, am I right that that’s the reason why it sometimes works - but only on one at a time?

Run your script. This one - originally posted here?

This is typically used to find dead links, i.e., bookmarks pointing to non-existent resources. Remember, we have plenty of people who import bookmarks from browsers as well as adding them via Ciip to DEVONthink.

Thanks, Jim; so - bearing the contents of this whole thread in mind - not to check webarchives?

Yes. You have to click the linked post to see the whole script. But I now also added a link.

Yes. When you first tested this script you got the same error, remember?
Afterwards I updated it. Use the current version and you won’t get the error.

No, it can be used to check the URLs of webarchives…

image

I do remember. I have now re-downloaded the current version. Thanks. No errors.

This is going to be extremely helpful, Pete. Thanks so much!

But it doesn’t work every time: some webarchives like the one from this URL create what I think I understand, but isn’t complete: applewebdata://69FFF7AA-B2FD-44A4-A912-06D2A29D2E66

Not a complaint. I can see how powerful this can be. Thanks!

But, as has been said in this thread, that may not be the last word: it checks for a URL that is invalid (e.g. www.divontachnologies.cim), but - even though the log says that can get updated, it isn’t really actually changing anything?