Updating webarchives

pete31 · June 1, 2021, 11:27pm

Got it, thanks.

I didn’t think about the fact that users could use Update Captured Archive on webarchives they previously ran the script on (as I never use that option). Obviously something in changing the internal URL can break webarchives if one uses Update Captured Archive afterwards. I’ll add a warning note. Sorry for the trouble!

mksBelper · June 1, 2021, 11:41pm

Thanks to you!

That may well be the case, Pete; but of the seven webarchives which the script appeared to break, I’m 99% sure that I may have run Update Captured Archive only on one or two of them.

IOW I do believe that Update Captured Archive really is irrelevant in our case here.

No trouble at all!

Apart from anything else, your script is extremely useful.

In the back of my mind still lies the fact that the DT (internal) script which I’m running, 'Check Bookmarks’ may have that name for a reason - and doesn’t, in fact, reliably update webarchives (at least as its main job), but is designed to check and update (or mark as invalid) ‘regular’ URLs.

I have several thousand webarchives in my 10,000+ DT database (which is basically an import from EagleFiler).

Being the meticulous kind of person I am, I really want to have the contents of those sites as up to date and ready-for-use as I can.

The 'Check Bookmarks’ script is a real boon - as it seems to work. And helps me keep up to date. But I suspect I wouldn’t be having this trouble if I’d chosen one of the PDF options, particularly because - as @cgrunenberg keeps on pointing out - webarchives are officially deprecated by Apple. But my first few attempts with PDFs were not successful. Partial pages, missing images, corrupt content etc. As it happens EagleFiler is excellent at capturing webarchives.

If I may (and at the slight risk of going off topic), reading posts in this thread from this post from @cgrunenberg onwards, I’m still a little confused.

While, as Jim points out, there should be a ‘Check Bookmarks’ script, my installation doesn’t have it (maybe ‘Check URLs’ has superseded it); and that ‘Check links’ appears not to be documented.

I wonder if there is a better way altogether of checking for outdated links of all kinds.

pete31 · June 2, 2021, 12:34am

Ok, this is new info. Then it seems it’s not save to update the internal URL.

I’ve no idea how webarchives work, only thing I can think of is that after we changed WebResourceURL the loading of some resources breaks because they maybe use relative links to the internal URL.

But I don’t believe that’s the case as a search “WebResourceURL” webarchive relative - Google only yields 50 results - and the whole problem of ending up with webarchives that got a wrong URL is Apple’s fault, so I think there would be a lot more results, e.g. developers reporting this problem.

Not sure what to do. Probably better to delete the script then.

Any chance you could definitely verify that changing WebResourceURL breaks webarchvies? I’m writing a workaround that would make sure we only capture webarchives with Safari’s current URL - but if changing WebResourceURL potentially breaks them that would of course be useless.

Yes, that’s info that’s floating around in this forum. But as far as I know it’s wrong.

mksBelper · June 2, 2021, 2:03am

@pete31, thanks for persevering with me on this!

I have a (DT) database of, as I say, about 10,000 files.

They’re divided into seven top level heads…because I want to colour-code them and the Finder allows no more than seven colours. That’s OK.

Once I was settled in to DT, I decided to check/correct/validate(/where impossible to update to delete) all my webarchives. Webarchives are (so far) the only way I’m keeping URLs. That may change; this thread is helping here .

I’m doing one top level ‘subject area’ roughly every couple of days.

I’ll be doing one tomorrow, all being well.

All the files in that next top level Group are so far untouched: I haven’t run the ‘Check URLs’ script; I haven’t run your script and I haven’t run the Update Captured Archive routine on any of them.

So, Yes, I have a chance to record in much greater detail exactly what happens.

Definitely! Probably tomorrow, Wednesday 2 June.

I’ll make a step-by-step record of exactly what happens with clear examples of anything that breaks.

That makes sense. It also explains why it works for the vast majority but not for a minority. Which is why I uploaded the zip of the files that do break - in the hope that a common factor can be detected.

Yes, that does make sense.

I could do so. But in view of the fact that I only had seven ‘failures’ and was able to correct these manually simply by Clipping a webarchive into my DT inbox then putting the webarchive in its rightful place, the errors we’re encountering took only 15 minutes to correct; they high success rate far outweighs that. I’d like, though, to be able to point your script at every webarchive and have it update them all in one go .

Here’s a thought. Suppose tomorrow, or Thursday, I run your script against a webarchive which is already up to date and displaying properly and see if it breaks it!

(Which makes me think: have you not had a single case of a webarchive which starts failing to display correctly when you run your script, Pete?)

OK.

There’s obviously work to be done here, isn’t there. I’ll do what I can when I run the ‘Check URLs’ script next .

pete31 · June 2, 2021, 2:24am

Nope. Otherwise I wouldn’t have posted it. Looking forward to read if you’ve found some pattern.

mksBelper · June 2, 2021, 7:26pm

@pete31,

I was too. And I think I have .

As we’ve said all along, I don’t think there’s anything wrong with your script. I think the process is just failing where there are heavily dynamically created pages. Or other content/elements which it can’t fetch.

The counterargument to that remains the fact that I can reliably Clip a webarchive in DT 100% of the time.

So… this has been the process today:

1 I ran the ‘Check Links’ script on my ‘Computer’ (grey) top level Group, which as 459 webarchives:

2 The script found, logged and put into its ‘Invalid URLs’ Group 52 webarchives (just over 10%… I have kept the (DT) log if it’ll help) that needed attention.

I worked through them one by one - inspecting, confirming and correcting

3 The first such invalid URL was for Spamhaus, which displays like this in the View/Edit pane with the outdated URL http://www.spamhaus.org/rosko/index.lasso:

The correct URL, of course is: https://www.spamhaus.org and it should look like this:

(Re-)running your Update webarchive script does not correct it.

4 Another example is this page: `https://www.snoize.com’ (MIDI monitoring software called, Snoize)

Again, the DT ‘Check Links’ script finds that the URL in the webarchive is incorrect. I correct it in the inspector.

But both before and after running the Update webarchive script the page is incorrectly rendered in the View/Edit pane:

5 One last example is a forum where somehow the webarchive insists on hanging on to the arguments/parameters for one particular post - rather than displaying the home page, all that’s contained in that URL.

https://www.sibeliusforum.com should look like this:

Again, I correct the URL in the DT inspector and run the script; in the View/Edit pane - no matter how many times I rerun the script, it always renders wildly inaccurately:

It does seem to be hanging on - and trying to render - parameters for a particular thread by ‘keyrkenat’, doesn’t it.

Now - I did notice that if - in the View/Edit pane - I right/Ctrl-click and Open Page in Safari - I go to a page which renders correctly as well:

Today I have not touched Update Captured Archive once.

Happy to provide further information if it helps, @pete31!

I still feel that I may not be approaching the whole question of updating/correcting/editing webarchives properly or as DT expects. I’d love to know, please.

Particularly given the disconnect which has been pointed out between the notional URL visible in the View/Edit pane and the URL field in the Inspector.

Am I missing the preferred/best practice way generally to correct URLs in webarchives? @BLUEFROG, @cgrunenberg, please?

pete31 · June 3, 2021, 4:05am

In the example you gave in the quoted post: Did you just use the script to update the URL and afterwards reopened the record?

Or did you also use Reload in the contextual menu?

cgrunenberg · June 3, 2021, 6:45am

There’s actually no preferred method at all, web archives are basically static snapshots. Wouldn’t it be easier to just capture a web archive for the new URL instead?

mksBelper · June 3, 2021, 6:40pm

Pete,

Just to put this into context, we’re talking about Update Captured Archive, aren’t we.

To answer your question, I only used the script - on the first examples that began this thread.

Now I don’t use it at all because it seems to re-display the earlier/older(/outdated) copy of the webarchive.

If I can help with troubleshooting your script in fetching full/working versions of the newly-updated content of webarchives, I’m here to test! Thanks .

mksBelper · June 3, 2021, 6:45pm

Thanks, @cgrunenberg; it probably would.

I’m trying to monitor/edit/correct hundreds of webarchives in a DT database of 10,000 files.

So the more I can automate it - for instance by using @pete31’s script - the better .

Which makes me wonder whether there is a script that takes all URLs found to be ‘Invalid’ as a result of running the ‘Check Links’ external script and itself automatically iterates through them all capturing new/up-to-date webarchives from the correct URL.

pete31 · June 3, 2021, 9:09pm

I‘ll write something

pete31 · June 3, 2021, 10:41pm

Nope. There’s a contextual menu item Neu laden, I guess it’s Reload in english.

I made a test:

1. Starting point

A https://historicengland.org.uk/advice/hpg/heritage-assets/nhle/ webarchive

2. Change WebResourceURL

I changed the WebResourceURL to the URL of this thread,
didn’t use any DEVONthink commands,
deselected and selected the record to pick up the change.

Result:

3. Use contextual menu items

use contextual menu > menu item 1 (probably Reload in english)
use contextual menu > Update Captured Archive

Result:

Please use both menu items with one of your broken webarchives and see what happens

mksBelper · June 3, 2021, 10:59pm

Will do, Pete. Thanks!

Yes, ‘Reload’ - it’s the very first contextual menu item.

Am I right in understanding that what you did with the English Heritage page means that simply/only putting in a new URL (in that case the URL of this thread) by itself breaks the display of the original - as shown here?

Is changing WebResourceURL the same as entering a (new) value in the Inspector’s URL field?

I’m sorry - I don’t think I know exactly what WebResourceURL is, which I should because you’ve been referring to it all along. Sorry

I’m due to do the next batch in my long project soon. I’ll try it then. Everything else has been corrected by re-clipping the URLs in question.

Should I first put the correct URL in the appropriate field in the Inspector, as I have been doing?

As far as you know - when Reload corrects the webarchive - does it remain corrected?

pete31 · June 4, 2021, 12:24am

Exactly.

No. They are different things. The WebResourceURL is the URL that Safari reports while capturing a webarchive. It’s saved in the webarchive and not visible to users (unless you open the webarchive in BBEdit). DEVONthink uses it do populate the record’s URL property:

Changing the URL property in DEVONthink’s inspector that has no effect on

the content of a webarchive
the WebResourceURL of a webarchive

That’s why you used the script to update the WebResourceURL with the DEVONthink URL.

Can’t say for sure as I can’t test it (easily).

What I know is that if I create a webarchive via AppleScript DEVONthink will follow redirects but it doesn’t use the redirected URL to populate the DEVONthink URL property. This means created via AppleScript we get a webarchive with the redirected URL’s content but the URL we see in DEVONthink is the one we provided in the script.

I guess it’s sufficient to use Reload, DEVONthink will probably follow redirects and you just have to use Update Captured Archive afterwards.

Look at the capture, DEVONthink automatically updated the URL in the view pane:

And it was also updated in the inspector, just checked this. So you probably won’t need the script anymore.

Yes.

Reload loads the WebResourceURL in the current tab
Update Captured Archive saves what’s loaded, i.e. writes the currently visible content to disk

That means

if you use Update Captured Archive without previously using Reload you only save what’s currently visible. In your case that was broken content, that’s why Update Captured Archive didn’t seem to work for you.
if you use Reload without using Update Captured Archive afterwards you don’t change the webarchive record, i.e. nothing is written to disk.

So

Reload → acts on the tab’s content
Update Captured Archive → acts on the file’s content

To update and save a webarchive’s content you need to use both.

mksBelper · June 4, 2021, 1:40am

Thanks. Now I know.

I see. But it’s still good practice to change it - so that it’s correct, surely. Although see my step 4 below…

That’s beginning to explain the occurrence of the corrupted webarchives, isn’t it.

Maybe the fact that they were only 10% is because there’s a correlation between the corruption and the presence of a redirect, or missing CSS/scripts/resources - because they were unreachable.

Although it still doesn’t explain why a webarchive clipped directly into DT always works, never has ‘corruption’.

Great!

But see my step 3 below…

Got it.
So my workflow ought to be:

run the DT ‘Check Links’ external script as I have been doing
for each webarchive which - for whatever reason - is out-of-date (even though it is likely to be displaying ‘correctly’)
despite what you just said about not needing your script, I think I shall have to because that’s the only way to create a new webarchive for the URL in question
don’t update the URL in the Inspector because your script does that
only if the content displays incorrectly, Reload
(only if the content displays incorrectly), Update Captured Archive

Have I understood, Pete?

pete31 · June 4, 2021, 2:27am

No. Let’s do it step by step.

Somewhere in this thread you wrote that you want to update the webarchive’s content anyway because it might be out-of-date. So why do you need the “Check links” script? It only checks the URL, not the content of a webarchive.

No. As I wrote, I guess that DEVONthink will follow redirected URLs if you use Reload. You don’t have to do anything with any URL (anymore). Just use Reload. If what you see afterwards is fine then use Update Captured Archive.

No. My script takes the URL that’s visible in the inspector and updates the webarchive’s internal URL (WebResourceURL). Probably best to completely forget the script. It’s (probably) not needed anymore.

No. You wrote somewhere that you want to update your webarchives because they might be out-of-date. If so then you have to Reload every webarchive.

PS Updating 10.000 webarchive manually is a lot of work.

Would you use a script that simply creates new webarchive?

pete31 · June 4, 2021, 7:00am

This script creates new webarchives from selected webarchives and inherits properties.

Comment out every property that should not be inherited in the “Inherit properties” block.
To do so prefix the line with #

-- Create new webarchive from selected webarchive and inherit properties

-- Note:		New webarchives may not contain the content you expect. It's necessary to manually check every new webarchive before you delete the old one
-- Setup:		Comment out every property that should not be inherited in the "Inherit properties" block. To do so prefix the line with #

use AppleScript version "2.4"
use framework "Foundation"
use scripting additions

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select some webarchives"
		show progress indicator "Creating webarchive..." steps (count theRecords) as string with cancel button
		
		repeat with thisRecord in theRecords
			set thisType to (type of thisRecord) as string
			if thisType is in {"webarchive", "«constant ****wbar»"} then
				set thisRecord_Name to name without extension of thisRecord
				step progress indicator thisRecord_Name
				
				set thisNewWebarchive to create web document from (URL of thisRecord as string) in parent 1 of thisRecord -- Create new webarchive in selected webarchive's group
				
				set URL of thisNewWebarchive to my getWebResourceURLKey(path of thisNewWebarchive) -- Set record's URL to webarchive's internal URL (the one that was actually used to create the content). Necessary in case of redirections 
				
				tell thisNewWebarchive -- Inherit properties
					set name to thisRecord_Name -- The name
					set aliases to aliases of thisRecord -- Wiki aliases
					set comment to comment of thisRecord -- The comment
					set creation date to creation date of thisRecord -- The creation date
					try
						set custom meta data to custom meta data of thisRecord -- User-defined metadata of a record
					end try
					set exclude from search to exclude from search of thisRecord -- Exclude record from searching.
					set exclude from see also to exclude from see also of thisRecord -- Exclude record from see also.					
					set exclude from Wiki linking to exclude from Wiki linking of thisRecord -- Exclude record from automatic Wiki linking.
					set label to label of thisRecord -- Index of label (0-7)
					set locking to locking of thisRecord -- The locking
					set rating to rating of thisRecord -- Rating (0-5)
					set state to state of thisRecord -- The state/flag
					set tags to tags of thisRecord -- The tags
					try
						set thumbnail to thumbnail of thisRecord -- The thumbnail
					end try
					set unread to unread of thisRecord -- The unread flag
				end tell
				
			end if
		end repeat
		
		hide progress indicator
	on error error_message number error_number
		hide progress indicator
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

on getWebResourceURLKey(thePath)
	try
		set theWebarchiveData to current application's NSData's dataWithContentsOfFile:thePath
		set theWebarchivePlist to (current application's NSPropertyListSerialization's propertyListWithData:theWebarchiveData options:0 |format|:(current application's NSPropertyListBinaryFormat_v1_0) |error|:(missing value))
		set theWebResourceURLKey to (theWebarchivePlist's valueForKeyPath:"WebMainResource.WebResourceURL") as string
		return theWebResourceURLKey
	on error error_message number error_number
		activate
		if the error_number is not -128 then display alert "Error: Handler \"getWebResourceURLKey\"" message error_message as warning
		error number -128
	end try
end getWebResourceURLKey

chrillek · June 4, 2021, 7:54am

Although I certainly don’t understand all the subleties of the different URLs and relaxing what, when, and why – I don’t quite understand the purpose behind all these thousands of Web archives.

if they’re meant to capture a website at a certain point in time (archive), why update them?
if they’re meant to be always up to date, why bother with an archive that has to be updated regularly in order to be … up to date? Why not simply bookmark the page(s)

As to completeness etc: in some cases missing or weird content after changing the URL might be caused by a content security policy that prevents loading CSS and other resources.

pete31 · June 4, 2021, 8:28am

Yes, I planned to ask that too. I guess it’s been a decision once made and never corrected. But I of course don’t know his use case and data.

mksBelper · June 4, 2021, 5:45pm

@chrillek - thanks. That’s just the kind of suggestion I really appreciate.

I’m relatively new to DT (from EagleFiler) and helps me keep an open mind

That makes good sense. I think I should try it.

I attempted to Convert to HTML:

But nothing happened.

So if I did want to convert all my webarchives to Bookmarks, how would I set about that, please?

I think that must be what’s happening.

Thanks again - to you, too, @pete31 - for the fresh air!