Capturing html as web archives

Hello all,

I’m quite delighted with the opportunity of capturing both the images and the html of web pages with DT 1.9.3 (and OS 10.3.9) but I’m not sure I want to bloat my database with web images all the time…

Is there a possibility of turning this functiion on and off? It would be great to have the option to either ‘capture html’ or to ‘capture web archive’.

Or is that option already there and I am just not seeing it?

Kind regards,

– Paul

Paul:

I share your sentiments. It is great to be able to capture WebArchives that can be viewed in whole offline.

But even the scientific journals that I routinely visit to collect information are – increasingly – placing advertising and other material on their pages that’s extraneous (distractingly so) to the information that I find useful. Capturing all those ads in WebArchives simply keeps in the distracting stuff, and wastes disk space to boot.

That’s why I’ve been in the habit for a long time of capturing such Web pages as Notes rather than Pages, and then quickly editing out the extraneous stuff from my captured notes.

I wouldn’t mind capturing WebArchives of pages containing lots of ads, if I could quickly and easily edit out unwanted material, retaining just the images that I want to keep. I’ve seen some discussions on the Net that editing WebArchives is (or may become) a feature, but my experiments under 10.3.9 don’t work properly. Does anyone have any information or suggestions on this?

A truly wonderful feature of WebArchives that that internal hyperlinks are reset to work properly in a captured document, rather than sending one out to the Internet. A scientific article that contains many internal lnks to reference citations and figures becomes much more usable on my PowerBook now that I can click on a link to a reference, view it, then return back to the text I was reading even when I’m offline!

Working linternal links go a long way towards balancing the higher file sizes and distracting elements of WebArchives in my DEVONthink database. :slight_smile:

Thanks for your reply Bill.

I for one would like to ask the developers to please add the option of capturing either html or a web archive to the next release.

Any one else for this feature?

Kind regards,

– Paul

I’m not seeing that option in DT 1.9.3 - perhaps I’m missing it also? I’m presuming when Bill says:

that he means he uses WebArchives created by DEVONagent or some other utility? Bill?

The option is already in DEVONagent is it not - to send the “capture” to DEVONthink as either html or web archive? As you say though Paul, it would be convenient to do the same in DT.

:bulb: Funny you mention this Bill - I was following up with the developer of PithHelmet - ad blocker extraodinaire for Safari (I’m not affiliated BTW) concerning the possibility of it’s support for DEVONagent just yesterday! I know the program works very well with Safari at blocking out all kinds of unwanted ads etc. (and one is able to customize it quite easily) - I was hoping that it might do the same job in DEVONagent or perhaps DEVONthink. This is what he had to say about that possibility:

So if there is interest from the DEVON users community and this interest is made known to Mike Solomon (PithHelmet developer)… who knows, we’d be able to capture mostly ad free content - definitely from within DA and perhaps from within DT? As I mentioned in another DA post recently, the surfing experience with and without PithHelmet is like the difference between night and day! No ads = much pleasure = very useful 8)

Of course you can use Webstractor to capture the site and then edit out the ads :slight_smile:

Ltldream writes:

Ltldream: It seems that DT 1.9.3 now captures all html pages as webarchives. This is a very nice feature but I for one don’t want to unnecessarily bloat my database by storing the images of EVERY web page I want to archive.

I think it would be wonderful to have the option to choose either html (not possible any more!) or web archive the moment it is captured – especially since it seems we can’t choose to store web archives (like images and .pdf’s) in the database folder rather than the database proper.

Kind regards,

– Paul

Thanks for clarifying this point for me Paul. Looks like it’s another case of me not viewing the “Read Me” once again with this update of DT! :blush:

I see what you mean now Paul - the same is true in DA - the capture is a WebArchive as it is in DT and is therefore not a work-around. I thought for sure that DA had allowed a choice - a capture and subsequent transfer to DT as a WebArchive OR html file. Once again my previous statement (in this thread) about DA still containing that choice is incorrect - indeed if it ever HAD that capability?! :confused: As a new user of DA and DT looks like it’s going to take me a fair bit of time to remember these finer points.

I’m still hoping to hear a word from the developers on this point.

Christian? Eric?

I find that webarchives of pages are considerably larger than pure html (even when archiving text only pages) and as a user who archives a LOT of web pages this is somewhat worrying…

Don’t get me wrong… it’s a wonderful feature to be able to archive important pages which contain a lot of visual components via web archive (previously I had to print to .pdf and import the page to do this) but in most cases the trade off between increased database size and full feature offline browsing doesn’t seem to be worth the increase in size (especially since stored html loads the requisite images when online).

Please bring back the ability to add html only to the database!

Kind regards,

– Paul

I got a similar response from Mike when I mentioned integrating PithHelmet with DEVONthink last year. Christian said he attempted to do it but it didn’t work; I don’t remember any details.

Boy oh boy, PithHelmet plus DA and or DT would be a truly great combination! Sure would cut down on all the trimming required when one saves an article ripe with ads to DEVONthink. I would probably also use DA much much more than I already am if it was able to use PithHelmet. :bulb: I also think an ad-free environment would be a tremendous selling point for DA and DT. It is far too tedious to have to filter that gunk manually - especially with a customizable solution in the wings. (my rant for the day) :wink: