But that doesn’t change (the appearance of) the web archive - in the View/Edit pane.
Your script, though, seems always to place the correct URL for the webarchive in the View/Edit pane. Thanks!
And it remains a webarchive, which - 90% of the time - looks fine. The URL also launches (in Safari) and everything looks OK.
But, as you can see, the appearance of the webarchive is broken in the View/Edit pane for some URLs - even though it looks as though your script has worked!
It seems to make no difference: when the page renders incorrectly, it continues to do so after Update Captured Archive.
My starting point is a webarchive whose displayed content is exactly what I want. Only the internal URL (and therefore DEVONthink’s URL) doesn’t match the URL in Safari’s adress bar. To change the internal URL to the correct one I take the steps as explained in my previous post.
I don’t update the displayed content of any webarchive. If the displayed content isn’t what I want I’ll capture a new webarchive.
And the displayed content was fine before you used Update Captured Archive?
Does the displayed content also break if you don’t update the internal URL and just use Update Captured Archive?
But to be totally clear - and this is strange - for 90% of the webarchives I have updated, the page renders perfectly in DT3.7.2.
I change the URL in the Inspector to the correct URL. I run your script. In the View/Edit pane the page changes (to that of the newly-corrected URL), the webarchive updates and it renders perfectly.
But for about 10% of the pages I’ve tried, I change the URL in the Inspector. (I Launch URL and it appears as it should.)
But the webarchive representation is either corrupt, incomplete or otherwise different.
As in my screenshots above. Odd
But only for certain sites/URLs!
No. Running Update Captured Archive puts the old/incorrect URL into the URL field in the Inspector. The displays ‘correctly’ - by which I mean that it displays the old/incorrect content of the webarchive.
IOW, what I’m saying is that your script seems to be working as it should, and as you designed and wrote it. Wonderful; thanks; so useful.
But that the version of the webarchive displayed is - sometimes - corrupted!
What I think may be happening is that it’s unable to fetch all the necessary resources to display properly when converted like this into a webarchive… CSS, images, scripts etc?
Two other pieces of potentially useful information:
clipping that URL as a webarchives saves it perfectly
single-clicking on the URL of the faulty-displayed page also launches properly to the correct address.
It seems (and I may be wrong) as though something is going wrong in the conversion process.
Converting to PDF takes the displayed content, i.e. the content that’s saved inside the webarchive, and creates a PDF from that. It does not capture a PDF from the URL that’s stored in the webarchive.
I make use of this by
capturing only selected part of a site as webarchive
converting the webarchive to PDF
This way I get the best of both worlds: a PDF whose “clutter freeness” I can control beforehand.
Only downside (as you know) is that sometimes the browser doesn’t report the correct URL. Apart from that it’s the best capture methode I’ve found so far.
Exporting doesn’t change the content. DEVONthink never changes files, neither on import nor on export.
Yes, as explained in DEVONthink’s help Documentation > Documents > HTML-Based Formats:
Note: Web archives can be very useful with web pages using statically linked content. However, some popular and monetized sites get their contents dynamically from other sources, so the actual data is not in the underlying HTML. These pages may have missing content due to this, require an internet connection to display content, and run JavaScript. If you encounter this, a PDF may be a better archiving option.
PS I didn’t look at your attached files as there’s nothing I can do about it
Web archives contain their own URLs for various resources. The command just uses the currently loaded web page & its resources, creates an updated web archive and updates the URL of the item. Therefore it’s useful to update captured web archives in case of changed contents (e.g. after reloading the page), it’s not intended for updating invalid URLs.
Thanks! A somehow related question: What NS class/method does the DEVONthink service Capture Web Archive use to get the Safari selection? I know how to create a webarchive from the clipboard but couldn’t find the method that should be used to set the clipboard to the selection. There must be a whole class for this kind of stuff or?
Services actually receive the complete & required information from the source application but do not even know which application. In this case Safari provides the web archive data of the selection.
Just like almost every file in DEVONthink, you can add a URL as a reference.
Only bookmarks vary based on the URL but that is because they are dynamically loading the content the URL points to. Webarchives do not do this.
Thanks. Got it. That explains why - when I update the URL in the Inspector (as you did by pickling yourselves in your example ), the webarchive remains the same and displays the same content - as before.
I don’t suppose it’s possible to update a webarchive, is it… other than with another script or (inbuilt) command?
You may remember helping me a few days ago by pointing me in the direction of the ‘Check Bookmarks’ script. That works beautifully, thanks.
What I’ve been then doing is going through the Invalid URLs Group which it creates and using Pete’s script to update the webarchives.
The oddness is that 90% of webarchives get updated properly by Pete’s script (which is fantastic).
But there are exceptions, where the webarchives display as either corrupted and/or with missing components/elements.
At this point that - the incorrect display of the updated webarchive (which now I appreciate is unrelated to anything in the Inspector) - is what I want to find a way to put right.
Hi, Pete, I hope I’m clear that I’m very grateful for your script - and NOT complaining !
I only used Update Captured Archive a couple of times; then I discovered that it doesn’t do what I want… as Jim and you pointed out, changing the URL in the Inspector doesn’t update the webarchive so Update Captured Archive really serves no purpose (for me, in this operation).
A: before I ran the script the webarchive displayed completely correctly - as if I’d (manually) Clipped a webarchive from each URL in question.
But what was displaying was, of course, an out-of-date version of the site.
In about 90% of cases the script updates the webarchive’s contents correctly (that is, it fetches the updated content) and the display is perfect.
In about 10% of cases the script also updates the webarchive’s contents correctly but the display is not correct.
Common sense tells me - I think ! - that this behavior is site dependent… certain resources are not fetched.
OTOH, clipping (DT’s own routine) the self-same URLs always both fetches the current content and displays correctly.
I didn’t think about the fact that users could use Update Captured Archive on webarchives they previously ran the script on (as I never use that option). Obviously something in changing the internal URL can break webarchives if one uses Update Captured Archive afterwards. I’ll add a warning note. Sorry for the trouble!
That may well be the case, Pete; but of the seven webarchives which the script appeared to break, I’m 99% sure that I may have run Update Captured Archive only on one or two of them.
IOW I do believe that Update Captured Archive really is irrelevant in our case here.
No trouble at all!
Apart from anything else, your script is extremely useful.
In the back of my mind still lies the fact that the DT (internal) script which I’m running, 'Check Bookmarks’ may have that name for a reason - and doesn’t, in fact, reliably update webarchives (at least as its main job), but is designed to check and update (or mark as invalid) ‘regular’ URLs.
I have several thousand webarchives in my 10,000+ DT database (which is basically an import from EagleFiler).
Being the meticulous kind of person I am, I really want to have the contents of those sites as up to date and ready-for-use as I can.
The 'Check Bookmarks’ script is a real boon - as it seems to work. And helps me keep up to date. But I suspect I wouldn’t be having this trouble if I’d chosen one of the PDF options, particularly because - as @cgrunenberg keeps on pointing out - webarchives are officially deprecated by Apple. But my first few attempts with PDFs were not successful. Partial pages, missing images, corrupt content etc. As it happens EagleFiler is excellent at capturing webarchives.
If I may (and at the slight risk of going off topic), reading posts in this thread from this post from @cgrunenberg onwards, I’m still a little confused.
While, as Jim points out, there should be a ‘Check Bookmarks’ script, my installation doesn’t have it (maybe ‘Check URLs’ has superseded it); and that ‘Check links’ appears not to be documented.
I wonder if there is a better way altogether of checking for outdated links of all kinds.
Ok, this is new info. Then it seems it’s not save to update the internal URL.
I’ve no idea how webarchives work, only thing I can think of is that after we changed WebResourceURL the loading of some resources breaks because they maybe use relative links to the internal URL.
But I don’t believe that’s the case as a search “WebResourceURL” webarchive relative - Google only yields 50 results - and the whole problem of ending up with webarchives that got a wrong URL is Apple’s fault, so I think there would be a lot more results, e.g. developers reporting this problem.
Not sure what to do. Probably better to delete the script then.
Any chance you could definitely verify that changing WebResourceURL breaks webarchvies? I’m writing a workaround that would make sure we only capture webarchives with Safari’s current URL - but if changing WebResourceURL potentially breaks them that would of course be useless.
Yes, that’s info that’s floating around in this forum. But as far as I know it’s wrong.
I have a (DT) database of, as I say, about 10,000 files.
They’re divided into seven top level heads…because I want to colour-code them and the Finder allows no more than seven colours. That’s OK.
Once I was settled in to DT, I decided to check/correct/validate(/where impossible to update to delete) all my webarchives. Webarchives are (so far) the only way I’m keeping URLs. That may change; this thread is helping here .
I’m doing one top level ‘subject area’ roughly every couple of days.
I’ll be doing one tomorrow, all being well.
All the files in that next top level Group are so far untouched: I haven’t run the ‘Check URLs’ script; I haven’t run your script and I haven’t run the Update Captured Archive routine on any of them.
So, Yes, I have a chance to record in much greater detail exactly what happens.
Definitely! Probably tomorrow, Wednesday 2 June.
I’ll make a step-by-step record of exactly what happens with clear examples of anything that breaks.
That makes sense. It also explains why it works for the vast majority but not for a minority. Which is why I uploaded the zip of the files that do break - in the hope that a common factor can be detected.
Yes, that does make sense.
I could do so. But in view of the fact that I only had seven ‘failures’ and was able to correct these manually simply by Clipping a webarchive into my DT inbox then putting the webarchive in its rightful place, the errors we’re encountering took only 15 minutes to correct; they high success rate far outweighs that. I’d like, though, to be able to point your script at every webarchive and have it update them all in one go .
Here’s a thought. Suppose tomorrow, or Thursday, I run your script against a webarchive which is already up to date and displaying properly and see if it breaks it!
(Which makes me think: have you not had a single case of a webarchive which starts failing to display correctly when you run your script, Pete?)
OK.
There’s obviously work to be done here, isn’t there. I’ll do what I can when I run the ‘Check URLs’ script next .