Updating webarchives

mksBelper · June 11, 2021, 1:45am

Hoping I can be forgiven for:

being new to DT
being relatively new to scripting and/in DT
just wanting to rely on having some form of ‘captured’ web content, or a pointer thereto whenever I need it,

I have to agree with you that ‘…The thread is a complete mess…’; I take responsibility for that - both by ‘thinking aloud’ at times and by not having the familiarity with the environment that I am now gaining - in no small part thanks to your expertise, Pete!

So I am nevertheless very grateful for your help with explanations, suggestions, and concrete solutions (your script).

You did. But I read that post of yours as meaning that I wouldn’t need the script which you had just written - as I think you’re saying here, aren’t you:

My apologies. It looks as though I won’t need either - provided Reload does detect defunct URLs, rather than flagging them as ‘Invalid’ and creating a Replicant for each one in its own directory so named. Or will I?

At the risk of trying your patience even further, which script: yours or ‘Check Links’?

I do propose to read through this thread - all 60 messages of it (!) again.

I have just done so once .

In the meantime, I believe I understand you to be saying that the process is - in sequence - to:

run one of the scripts… not sure which one; sorry
Reload
Update Captured Archive, which we know from this post of @cgrunenberg, does this: ’ just uses the currently loaded web page & its resources, creates an updated web archive and updates the URL of the item. Therefore it’s useful to update captured web archives in case of changed contents (e.g. after reloading the page), it’s not intended for updating invalid URLs.’

for each URL.

If I’m right, I have two questions:

What is the inbuilt ‘Check URLs’ script designed to do? Just check Bookmarks? I was - in all fairness - directed to this in order to help me with webarchives. Maybe I need to do that first in order to get at least valid URLs; although - as I said in my post earlier today - it does seem to do some updating.
How do I script that process for - say - each webarchive in any given directory, please?

And again - all I want to do is:

update URLs which are no longer valid - that is, record the latest (correct) URL in DT; this will most likely be because the domain has changed and/or because the page (I only save pages) has moved to a different location in the (menu structure of) the site
and then to update the contents of the webarchive inside DT in order for it to be as useful to me as possible.

pete31 · June 11, 2021, 7:15am

There is no reasonable way to update a bunch of webarchives other than creating new ones. This was also suggested by @cgrunenberg:

This script

pete31:

creates new webarchives from selected webarchives and inherits properties.

Comment out every property that should not be inherited in the “Inherit properties” block.
To do so prefix the line with #

-- Create new webarchive from selected webarchive and inherit properties

-- Note:		New webarchives may not contain the content you expect. It's necessary to manually check every new webarchive before you delete the old one
-- Setup:		Comment out every property that should not be inherited in the "Inherit properties" block. To do so prefix the line with #

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some webarchives"
		show progress indicator "Creating webarchive..." steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisType to (type of thisRecord) as string
			if thisType is in {"webarchive", "«constant ****wbar»"} then
				set thisRecord_Name to name without extension of thisRecord
				step progress indicator thisRecord_Name
				
				set thisNewWebarchive to create web document from (URL of thisRecord as string) in parent 1 of thisRecord -- Create new webarchive in selected webarchive's group
				
				set URL of thisNewWebarchive to my getWebResourceURLKey(path of thisNewWebarchive) -- Set record's URL to webarchive's internal URL (the one that was actually used to create the content). Necessary in case of redirections

DEVONthink automatically follows redirected URLs when it creates a webarchive, so you get the correct URL. There is no need to use any of the other scripts.

All you have to do is:

Run the script
Compare each resulting webarchive with the old version

mksBelper · June 11, 2021, 7:04pm

I can see that now . Am learning a lot!

What is the purpose of DT’s ‘Check links’ inbuilt script?

I still find it useful to see which webarchives can be found (even though they may not be up-to-date, may not display the latest content), and so which - as you suggest - it’s still better just to re-Capture.

That’s just an extract, isn’t it… I get an error when trying to compile?

Have you in fact updated it?

I ask because I get this error:

Screen Shot 2021-06-11 at 12.14.06

And, am I right that that’s the reason why it sometimes works - but only on one at a time?

Run your script. This one - originally posted here?

BLUEFROG · June 11, 2021, 7:07pm

This is typically used to find dead links, i.e., bookmarks pointing to non-existent resources. Remember, we have plenty of people who import bookmarks from browsers as well as adding them via Ciip to DEVONthink.

mksBelper · June 11, 2021, 7:22pm

Thanks, Jim; so - bearing the contents of this whole thread in mind - not to check webarchives?

pete31 · June 11, 2021, 7:28pm

Yes. You have to click the linked post to see the whole script. But I now also added a link.

Yes. When you first tested this script you got the same error, remember?
Afterwards I updated it. Use the current version and you won’t get the error.

BLUEFROG · June 11, 2021, 7:28pm

No, it can be used to check the URLs of webarchives…

mksBelper · June 11, 2021, 7:39pm

I do remember. I have now re-downloaded the current version. Thanks. No errors.

This is going to be extremely helpful, Pete. Thanks so much!

But it doesn’t work every time: some webarchives like the one from this URL create what I think I understand, but isn’t complete: applewebdata://69FFF7AA-B2FD-44A4-A912-06D2A29D2E66

Not a complaint. I can see how powerful this can be. Thanks!

mksBelper · June 11, 2021, 7:41pm

But, as has been said in this thread, that may not be the last word: it checks for a URL that is invalid (e.g. www.divontachnologies.cim), but - even though the log says that can get updated, it isn’t really actually changing anything?