Updating webarchives

I don‘t update the webarchive, just the URL.

Don’t know how using Update Captured Archive behaves if key WebResourceURL was updated. @cgrunenberg :slight_smile: ?

@pete31

Please forgive me for not understanding :frowning: here.

I’ve been updating the URL - as it appears in the Inspector:

But that doesn’t change (the appearance of) the web archive - in the View/Edit pane.

Your script, though, seems always to place the correct URL for the webarchive in the View/Edit pane. Thanks!

And it remains a webarchive, which - 90% of the time - looks fine. The URL also launches (in Safari) and everything looks OK.

But, as you can see, the appearance of the webarchive is broken in the View/Edit pane for some URLs - even though it looks as though your script has worked!

It seems to make no difference: when the page renders incorrectly, it continues to do so after Update Captured Archive.

No idea.

My starting point is a webarchive whose displayed content is exactly what I want. Only the internal URL (and therefore DEVONthink’s URL) doesn’t match the URL in Safari’s adress bar. To change the internal URL to the correct one I take the steps as explained in my previous post.

I don’t update the displayed content of any webarchive. If the displayed content isn’t what I want I’ll capture a new webarchive.

And the displayed content was fine before you used Update Captured Archive?

Does the displayed content also break if you don’t update the internal URL and just use Update Captured Archive?

@pete31

No.

But to be totally clear - and this is strange - for 90% of the webarchives I have updated, the page renders perfectly in DT3.7.2.

I change the URL in the Inspector to the correct URL. I run your script. In the View/Edit pane the page changes (to that of the newly-corrected URL), the webarchive updates and it renders perfectly.

But for about 10% of the pages I’ve tried, I change the URL in the Inspector. (I Launch URL and it appears as it should.)

But the webarchive representation is either corrupt, incomplete or otherwise different.

As in my screenshots above. Odd

But only for certain sites/URLs!

No. Running Update Captured Archive puts the old/incorrect URL into the URL field in the Inspector. The displays ‘correctly’ - by which I mean that it displays the old/incorrect content of the webarchive.

IOW, what I’m saying is that your script seems to be working as it should, and as you designed and wrote it. Wonderful; thanks; so useful.

But that the version of the webarchive displayed is - sometimes - corrupted!

1 Like

Just tried two other things for one of the webarchives which does not display properly as a web archive:

  1. converted it to PDF: same display faults
  2. Exported as website (attached): same display faults

About the Parker Library on the Web Project.html.zip (3.1 KB)

What I think may be happening is that it’s unable to fetch all the necessary resources to display properly when converted like this into a webarchive… CSS, images, scripts etc?

Two other pieces of potentially useful information:

  1. clipping that URL as a webarchives saves it perfectly
  2. single-clicking on the URL of the faulty-displayed page also launches properly to the correct address.

It seems (and I may be wrong) as though something is going wrong in the conversion process.

Hope that helps, @pete31

Converting to PDF takes the displayed content, i.e. the content that’s saved inside the webarchive, and creates a PDF from that. It does not capture a PDF from the URL that’s stored in the webarchive.

I make use of this by

  • capturing only selected part of a site as webarchive
  • converting the webarchive to PDF

This way I get the best of both worlds: a PDF whose “clutter freeness” I can control beforehand.

Only downside (as you know) is that sometimes the browser doesn’t report the correct URL. Apart from that it’s the best capture methode I’ve found so far.


Edit: Script: Create webarchive from selection with correct URL


Exporting doesn’t change the content. DEVONthink never changes files, neither on import nor on export.

Yes, as explained in DEVONthink’s help Documentation > Documents > HTML-Based Formats:

Note: Web archives can be very useful with web pages using statically linked content. However, some popular and monetized sites get their contents dynamically from other sources, so the actual data is not in the underlying HTML. These pages may have missing content due to this, require an internet connection to display content, and run JavaScript. If you encounter this, a PDF may be a better archiving option.

PS I didn’t look at your attached files as there’s nothing I can do about it :slight_smile:

Thanks for confirming that I shouldn’t expect Converting (or exporting) to PDF to make any difference; and your guidance on HTML and DT etc. :slight_smile:

I just thought that maybe you’d see something significant there in those files.

I’ve just been through all the webarchives I found to be out-of-date when I ran the Check URLs script earlier today.

There are 79 of them.

Of these, no more than 7 (Yes, that’s right, just seven) display incorrectly when your script is run.

That would lead me to suspect that there must be something ‘special’ (use of a CDN, incorrectly formatted HTML etc) which is causing them to break.

If it weren’t for the fact that when I clip a webarchive from their URLs/sites, it imports and displays perfectly in DT!

So - unless I switch to PDFs (see below) - all I have to do is reClip those URLs as webarchives manually.

IOW your script has saved - and will save - me hours. Thanks again!

New to DT, and to AppleScript, can I create a new folderol my own in

~/Library/Application Scripts/com.devon-technologies.think3/Menu

to store new/external/third party scripts such as yours in, please?

== snip ==

I tried capturing as PDF when I first started to use DT a few weeks ago - without much success.

But now I’m beginning to think that some format of PDF is the better way to do that, to capture website content…

Web archives contain their own URLs for various resources. The command just uses the currently loaded web page & its resources, creates an updated web archive and updates the URL of the item. Therefore it’s useful to update captured web archives in case of changed contents (e.g. after reloading the page), it’s not intended for updating invalid URLs.

1 Like

Thanks! A somehow related question: What NS class/method does the DEVONthink service Capture Web Archive use to get the Safari selection? I know how to create a webarchive from the clipboard but couldn’t find the method that should be used to set the clipboard to the selection. There must be a whole class for this kind of stuff or?

Services actually receive the complete & required information from the source application but do not even know which application. In this case Safari provides the web archive data of the selection.

1 Like

Hi @cgrunenberg!

May I ask you if what I am experiencing - and @pete31 has kindly helped me with - is behaviour that you’d expect, please?

And if not, what I can do to put it right?

Thanks!

If you clip a webarchive, the content is based on the internals of the webarchive, not controlled by the URL field in the Info inspector.

Just like almost every file in DEVONthink, you can add a URL as a reference.
Only bookmarks vary based on the URL but that is because they are dynamically loading the content the URL points to. Webarchives do not do this.

Jim,

Thanks. Got it. That explains why - when I update the URL in the Inspector (as you did by pickling yourselves in your example :slight_smile: ), the webarchive remains the same and displays the same content - as before.

I don’t suppose it’s possible to update a webarchive, is it… other than with another script or (inbuilt) command?

You may remember helping me a few days ago by pointing me in the direction of the ‘Check Bookmarks’ script. That works beautifully, thanks.

What I’ve been then doing is going through the Invalid URLs Group which it creates and using Pete’s script to update the webarchives.

The oddness is that 90% of webarchives get updated properly by Pete’s script (which is fantastic).

But there are exceptions, where the webarchives display as either corrupted and/or with missing components/elements.

At this point that - the incorrect display of the updated webarchive (which now I appreciate is unrelated to anything in the Inspector) - is what I want to find a way to put right.

TIA!

@mksBelper what was your starting point before you ran the script and before you used Update Captured Archive?

  • A: correctly displaying webarchive
  • B: not correctly displaying webarchive

Hi, Pete, I hope I’m clear that I’m very grateful for your script - and NOT complaining :slight_smile: !

I only used Update Captured Archive a couple of times; then I discovered that it doesn’t do what I want… as Jim and you pointed out, changing the URL in the Inspector doesn’t update the webarchive so Update Captured Archive really serves no purpose (for me, in this operation).

A: before I ran the script the webarchive displayed completely correctly - as if I’d (manually) Clipped a webarchive from each URL in question.

But what was displaying was, of course, an out-of-date version of the site.

In about 90% of cases the script updates the webarchive’s contents correctly (that is, it fetches the updated content) and the display is perfect.

In about 10% of cases the script also updates the webarchive’s contents correctly but the display is not correct.

Common sense tells me - I think ! - that this behavior is site dependent… certain resources are not fetched.

OTOH, clipping (DT’s own routine) the self-same URLs always both fetches the current content and displays correctly.

Got it, thanks.

I didn’t think about the fact that users could use Update Captured Archive on webarchives they previously ran the script on (as I never use that option). Obviously something in changing the internal URL can break webarchives if one uses Update Captured Archive afterwards. I’ll add a warning note. Sorry for the trouble!

Thanks to you!

That may well be the case, Pete; but of the seven webarchives which the script appeared to break, I’m 99% sure that I may have run Update Captured Archive only on one or two of them.

IOW I do believe that Update Captured Archive really is irrelevant in our case here.

No trouble at all!

Apart from anything else, your script is extremely useful.

In the back of my mind still lies the fact that the DT (internal) script which I’m running, 'Check Bookmarks’ may have that name for a reason - and doesn’t, in fact, reliably update webarchives (at least as its main job), but is designed to check and update (or mark as invalid) ‘regular’ URLs.

I have several thousand webarchives in my 10,000+ DT database (which is basically an import from EagleFiler).

Being the meticulous kind of person I am, I really want to have the contents of those sites as up to date and ready-for-use as I can.

The 'Check Bookmarks’ script is a real boon - as it seems to work. And helps me keep up to date. But I suspect I wouldn’t be having this trouble if I’d chosen one of the PDF options, particularly because - as @cgrunenberg keeps on pointing out - webarchives are officially deprecated by Apple. But my first few attempts with PDFs were not successful. Partial pages, missing images, corrupt content etc. As it happens EagleFiler is excellent at capturing webarchives.

If I may (and at the slight risk of going off topic), reading posts in this thread from this post from @cgrunenberg onwards, I’m still a little confused.

While, as Jim points out, there should be a ‘Check Bookmarks’ script, my installation doesn’t have it (maybe ‘Check URLs’ has superseded it); and that ‘Check links’ appears not to be documented.

I wonder if there is a better way altogether of checking for outdated links of all kinds.

Ok, this is new info. Then it seems it’s not save to update the internal URL.

I’ve no idea how webarchives work, only thing I can think of is that after we changed WebResourceURL the loading of some resources breaks because they maybe use relative links to the internal URL.

But I don’t believe that’s the case as a search “WebResourceURL” webarchive relative - Google only yields 50 results - and the whole problem of ending up with webarchives that got a wrong URL is Apple’s fault, so I think there would be a lot more results, e.g. developers reporting this problem.

Not sure what to do. Probably better to delete the script then.

Any chance you could definitely verify that changing WebResourceURL breaks webarchvies? I’m writing a workaround that would make sure we only capture webarchives with Safari’s current URL - but if changing WebResourceURL potentially breaks them that would of course be useless.

Yes, that’s info that’s floating around in this forum. But as far as I know it’s wrong.

@pete31, thanks for persevering with me on this!

I have a (DT) database of, as I say, about 10,000 files.

They’re divided into seven top level heads…because I want to colour-code them and the Finder allows no more than seven colours. That’s OK.

Once I was settled in to DT, I decided to check/correct/validate(/where impossible to update to delete) all my webarchives. Webarchives are (so far) the only way I’m keeping URLs. That may change; this thread is helping here :slight_smile: .

I’m doing one top level ‘subject area’ roughly every couple of days.

I’ll be doing one tomorrow, all being well.

All the files in that next top level Group are so far untouched: I haven’t run the ‘Check URLs’ script; I haven’t run your script and I haven’t run the Update Captured Archive routine on any of them.

So, Yes, I have a chance to record in much greater detail exactly what happens.

Definitely! Probably tomorrow, Wednesday 2 June.

I’ll make a step-by-step record of exactly what happens with clear examples of anything that breaks.

That makes sense. It also explains why it works for the vast majority but not for a minority. Which is why I uploaded the zip of the files that do break - in the hope that a common factor can be detected.

Yes, that does make sense.

I could do so. But in view of the fact that I only had seven ‘failures’ and was able to correct these manually simply by Clipping a webarchive into my DT inbox then putting the webarchive in its rightful place, the errors we’re encountering took only 15 minutes to correct; they high success rate far outweighs that. I’d like, though, to be able to point your script at every webarchive and have it update them all in one go :slight_smile: .

Here’s a thought. Suppose tomorrow, or Thursday, I run your script against a webarchive which is already up to date and displaying properly and see if it breaks it!

(Which makes me think: have you not had a single case of a webarchive which starts failing to display correctly when you run your script, Pete?)

OK.

There’s obviously work to be done here, isn’t there. I’ll do what I can when I run the ‘Check URLs’ script next :slight_smile: .

1 Like

Nope. Otherwise I wouldn’t have posted it. Looking forward to read if you’ve found some pattern.