Updating webarchives

mksBelper · May 31, 2021, 8:26pm

I’m having great and consistent success with the ‘Check Links’ script on webarchives (of which I have several thousand!): I get a batch of Replicants (in their own, newly-created, Group) for each Invalid URL.

I then go through them all and correct the URL - if/when I can find it - in the URL field of the Inspector. I assume that’s the right/best place to update a URL, isn’t it?

But in the View/Edit Pane at the bottom of my window, the old (incorrect/out-of-date) URL remains for each webarchive.

How should I update the (contents of) a webarchive to point to the newly-corrected URL, please?

Ideally with a Smart Rile or script - so as to be automated.

TIA!

pete31 · May 31, 2021, 8:39pm

That’s funny. I have just written a script the other day that does exactly that.

Note:

It is only necessary to use this script if you captured a part of a website.
If you’ve captured via Clip to DEVONthink, i.e. you captured a whole website, then it’s not necessary to use this script.

-- Replace internal webarchive URL with DEVONthink's URL

-- Note: It is only necessary to use this script if you captured a part of a website. If you've captured via Clip to DEVONthink, i.e. you captured a whole website, then it's not necessary to use it.

tell application id "DNtp"
	try
		set theRecords to selected records
		if theRecords = {} then error "Please select a webarchive"
		
		repeat with thisRecord in theRecords
			set thisType to (type of thisRecord) as string
			if thisType is in {"webarchive", "«constant ****wbar»"} then
				set thisURL to URL of thisRecord
				if thisURL ≠ "" then
					do shell script "/usr/libexec/PlistBuddy -c " & quoted form of ("set \":WebMainResource:WebResourceURL\" " & thisURL) & space & quoted form of (path of thisRecord as string)
				end if
			end if
		end repeat
		
	on error error_message number error_number
		if the error_number is not -128 then display alert "DEVONthink" message error_message as warning
		return
	end try
end tell

mksBelper · May 31, 2021, 8:45pm

Hi @pete31!

Thanks. Just to be clear: if I run ‘Update Captured Archive’ from the View/Edit Pane at the bottom of the window, that does the opposite of what I want: it replaces the corrected/edited URL with the old (incorrect) URL for the webarchive!

Yours does what I want, doesn’t it?

I ask because some sites I ran it on didn’t fully complete or display properly. Like this one:

And this one:

Much appreciated!

pete31 · May 31, 2021, 9:25pm

Webarchives take the URL that Safari reports and save them in an internal key WebResourceURL. Sometimes the reported URL isn’t the right one. DEVONthink uses this internal key to set the record’s URL property.

So if the browser reported a wrong URL we have this wrong URL in both places, in the inspector and at the top of the view pane.

1 - Wrong DEVONthink and wrong view pane / internal WebResourceURL URL

2 - Set correct DEVONthink URL in inspector

3 - Run script to set internal WebResourceURL URL to DEVONthink URL

Note: It takes a second until the updated internal URL shows up at the top of the view pane

I’ve never really used this. @cgrunenberg is this done to update URLs that have changed since the webarchive was captured?

Try the script, works fine over here

mksBelper · May 31, 2021, 9:29pm

Thanks, @pete31!

Yes, I follow your explanations.

I’ve been correcting the URL in the Inspector - as in your first screengrab.

The script, though, doesn’t always (maybe 10% - see two examples above, please) seem to fetch a properly-rendered page.

For Parker Library On the Web - Spotlight at Stanford I see:

Launch URL works perfectly.

Is this just DT’s internal rendering in that pane?

Am I doing something wrong?

But this will prove to be so useful: I’m very grateful.

pete31 · May 31, 2021, 9:39pm

I don‘t update the webarchive, just the URL.

Don’t know how using Update Captured Archive behaves if key WebResourceURL was updated. @cgrunenberg ?

mksBelper · May 31, 2021, 9:57pm

@pete31

Please forgive me for not understanding here.

I’ve been updating the URL - as it appears in the Inspector:

But that doesn’t change (the appearance of) the web archive - in the View/Edit pane.

Your script, though, seems always to place the correct URL for the webarchive in the View/Edit pane. Thanks!

And it remains a webarchive, which - 90% of the time - looks fine. The URL also launches (in Safari) and everything looks OK.

But, as you can see, the appearance of the webarchive is broken in the View/Edit pane for some URLs - even though it looks as though your script has worked!

It seems to make no difference: when the page renders incorrectly, it continues to do so after Update Captured Archive.

pete31 · May 31, 2021, 10:14pm

No idea.

My starting point is a webarchive whose displayed content is exactly what I want. Only the internal URL (and therefore DEVONthink’s URL) doesn’t match the URL in Safari’s adress bar. To change the internal URL to the correct one I take the steps as explained in my previous post.

I don’t update the displayed content of any webarchive. If the displayed content isn’t what I want I’ll capture a new webarchive.

And the displayed content was fine before you used Update Captured Archive?

Does the displayed content also break if you don’t update the internal URL and just use Update Captured Archive?

mksBelper · May 31, 2021, 10:24pm

@pete31

No.

But to be totally clear - and this is strange - for 90% of the webarchives I have updated, the page renders perfectly in DT3.7.2.

I change the URL in the Inspector to the correct URL. I run your script. In the View/Edit pane the page changes (to that of the newly-corrected URL), the webarchive updates and it renders perfectly.

But for about 10% of the pages I’ve tried, I change the URL in the Inspector. (I Launch URL and it appears as it should.)

But the webarchive representation is either corrupt, incomplete or otherwise different.

As in my screenshots above. Odd

But only for certain sites/URLs!

No. Running Update Captured Archive puts the old/incorrect URL into the URL field in the Inspector. The displays ‘correctly’ - by which I mean that it displays the old/incorrect content of the webarchive.

IOW, what I’m saying is that your script seems to be working as it should, and as you designed and wrote it. Wonderful; thanks; so useful.

But that the version of the webarchive displayed is - sometimes - corrupted!

mksBelper · May 31, 2021, 10:45pm

Just tried two other things for one of the webarchives which does not display properly as a web archive:

converted it to PDF: same display faults
Exported as website (attached): same display faults

About the Parker Library on the Web Project.html.zip (3.1 KB)

What I think may be happening is that it’s unable to fetch all the necessary resources to display properly when converted like this into a webarchive… CSS, images, scripts etc?

Two other pieces of potentially useful information:

clipping that URL as a webarchives saves it perfectly
single-clicking on the URL of the faulty-displayed page also launches properly to the correct address.

It seems (and I may be wrong) as though something is going wrong in the conversion process.

Hope that helps, @pete31

pete31 · May 31, 2021, 11:05pm

Converting to PDF takes the displayed content, i.e. the content that’s saved inside the webarchive, and creates a PDF from that. It does not capture a PDF from the URL that’s stored in the webarchive.

I make use of this by

capturing only selected part of a site as webarchive
converting the webarchive to PDF

This way I get the best of both worlds: a PDF whose “clutter freeness” I can control beforehand.

Only downside (as you know) is that sometimes the browser doesn’t report the correct URL. Apart from that it’s the best capture methode I’ve found so far.

Edit: Script: Create webarchive from selection with correct URL

Exporting doesn’t change the content. DEVONthink never changes files, neither on import nor on export.

Yes, as explained in DEVONthink’s help Documentation > Documents > HTML-Based Formats:

Note: Web archives can be very useful with web pages using statically linked content. However, some popular and monetized sites get their contents dynamically from other sources, so the actual data is not in the underlying HTML. These pages may have missing content due to this, require an internet connection to display content, and run JavaScript. If you encounter this, a PDF may be a better archiving option.

PS I didn’t look at your attached files as there’s nothing I can do about it

mksBelper · May 31, 2021, 11:21pm

Thanks for confirming that I shouldn’t expect Converting (or exporting) to PDF to make any difference; and your guidance on HTML and DT etc.

I just thought that maybe you’d see something significant there in those files.

I’ve just been through all the webarchives I found to be out-of-date when I ran the Check URLs script earlier today.

There are 79 of them.

Of these, no more than 7 (Yes, that’s right, just seven) display incorrectly when your script is run.

That would lead me to suspect that there must be something ‘special’ (use of a CDN, incorrectly formatted HTML etc) which is causing them to break.

If it weren’t for the fact that when I clip a webarchive from their URLs/sites, it imports and displays perfectly in DT!

So - unless I switch to PDFs (see below) - all I have to do is reClip those URLs as webarchives manually.

IOW your script has saved - and will save - me hours. Thanks again!

New to DT, and to AppleScript, can I create a new folderol my own in

~/Library/Application Scripts/com.devon-technologies.think3/Menu

to store new/external/third party scripts such as yours in, please?

== snip ==

I tried capturing as PDF when I first started to use DT a few weeks ago - without much success.

But now I’m beginning to think that some format of PDF is the better way to do that, to capture website content…

cgrunenberg · June 1, 2021, 6:52am

Web archives contain their own URLs for various resources. The command just uses the currently loaded web page & its resources, creates an updated web archive and updates the URL of the item. Therefore it’s useful to update captured web archives in case of changed contents (e.g. after reloading the page), it’s not intended for updating invalid URLs.

pete31 · June 1, 2021, 7:11am

Thanks! A somehow related question: What NS class/method does the DEVONthink service Capture Web Archive use to get the Safari selection? I know how to create a webarchive from the clipboard but couldn’t find the method that should be used to set the clipboard to the selection. There must be a whole class for this kind of stuff or?

cgrunenberg · June 1, 2021, 7:16am

Services actually receive the complete & required information from the source application but do not even know which application. In this case Safari provides the web archive data of the selection.

mksBelper · June 1, 2021, 8:23pm

Hi @cgrunenberg!

May I ask you if what I am experiencing - and @pete31 has kindly helped me with - is behaviour that you’d expect, please?

And if not, what I can do to put it right?

Thanks!

BLUEFROG · June 1, 2021, 9:18pm

If you clip a webarchive, the content is based on the internals of the webarchive, not controlled by the URL field in the Info inspector.

Just like almost every file in DEVONthink, you can add a URL as a reference.
Only bookmarks vary based on the URL but that is because they are dynamically loading the content the URL points to. Webarchives do not do this.

mksBelper · June 1, 2021, 9:41pm

Jim,

Thanks. Got it. That explains why - when I update the URL in the Inspector (as you did by pickling yourselves in your example ), the webarchive remains the same and displays the same content - as before.

I don’t suppose it’s possible to update a webarchive, is it… other than with another script or (inbuilt) command?

You may remember helping me a few days ago by pointing me in the direction of the ‘Check Bookmarks’ script. That works beautifully, thanks.

What I’ve been then doing is going through the Invalid URLs Group which it creates and using Pete’s script to update the webarchives.

The oddness is that 90% of webarchives get updated properly by Pete’s script (which is fantastic).

But there are exceptions, where the webarchives display as either corrupted and/or with missing components/elements.

At this point that - the incorrect display of the updated webarchive (which now I appreciate is unrelated to anything in the Inspector) - is what I want to find a way to put right.

TIA!

pete31 · June 1, 2021, 10:45pm

@mksBelper what was your starting point before you ran the script and before you used Update Captured Archive?

A: correctly displaying webarchive
B: not correctly displaying webarchive

mksBelper · June 1, 2021, 11:04pm

Hi, Pete, I hope I’m clear that I’m very grateful for your script - and NOT complaining !

I only used Update Captured Archive a couple of times; then I discovered that it doesn’t do what I want… as Jim and you pointed out, changing the URL in the Inspector doesn’t update the webarchive so Update Captured Archive really serves no purpose (for me, in this operation).

A: before I ran the script the webarchive displayed completely correctly - as if I’d (manually) Clipped a webarchive from each URL in question.

But what was displaying was, of course, an out-of-date version of the site.

In about 90% of cases the script updates the webarchive’s contents correctly (that is, it fetches the updated content) and the display is perfect.

In about 10% of cases the script also updates the webarchive’s contents correctly but the display is not correct.

Common sense tells me - I think ! - that this behavior is site dependent… certain resources are not fetched.

OTOH, clipping (DT’s own routine) the self-same URLs always both fetches the current content and displays correctly.