Updating webarchives

mksBelper · June 10, 2021, 7:29pm

I see what you mean, Yes.

I have been ignoring the distinction between ‘able-to-be-reached’ (because valid; but may also be old) and ‘up-to-date’ (will always be valid and up to date), haven’t I?

I think I have assumed that the built-in script (‘Check Links’) invariably updates links it finds. The log certainly suggests that; a typical line reads:

11:33:29: composersforum.ning.com Updated URL (https://composersforum.ning.com/)

And yet the URL as reported stays the same - unless it’s altering ‘http’ to ‘https’ because that’s happened on the site’s server since I captured the address as an EagleFiler webarchive.

Do you know what the built-in DT ‘Check links’ script is actually doing? Is it checking for the return of a 404 (etc)? And nothing more?

I shall return to your script (thanks), Yes - as soon as I’ve completed my first pass, which (without my thinking about it carefully enough) has only done half the job! Now the script will always be run against current URLs.

My reservation - as we found the other day - has been that your script doesn’t work in about 10% of cases and results in a corrupt display/output.

Not that I’m not grateful, Pete! I am :-).

And we think that the reason for this is because certain dynamic elements are not fetched.

With 2,000 URLs I suppose I’m happy to have got as far as I have. But when I also take into consideration the need to have some sites saved as conventional bookmarks as well as PDFs, it’s still a huge job .