How to save webarchives to Devonthink reliably?

Drakes · February 9, 2015, 10:20am

Hello,

first post - I’ve been trying out DevonThink Pro Office since christmas, and thanks to the generous testing phase, I am starting to like it more and more .

However, I’ve got issues with using it for storing information bits from the web (which is my main reason for using DevonThink).

When I save information from Safari - using the Clip to DevonThink icon - I prefer to save the page as a webarchive. I was hoping that saving a webarchive would be equivalent to using the “Save as… webarchive” function from Safari. However, whenever I use Clip To, the downloading takes some 10 to 20 seconds (even if the page is already open in the browser) - so it seems that the page is not simply saved?

Further, when I open a page that I stored as webarchive from inside DevonThink, opening it takes a while, and a progress indicator is being displayed behind the URL text. This looks to me as though there is an internet access required to open the webarchive and reduces my trust in having a local copy of the page.

As I’ve been able to view recently stored webarchives with internet connection switched off, some web content must have been saved in DT’s database or cached by the web-page viewer. But today, I have found a couple of older webarchives in my DT database that are not being displayed correctly anymore (text is missing, images are missing, new advertisements are being displayed, etc.). This makes me suspect that maybe they never were downloaded fully and were only displayed properly only due to cacheing.

So my questions are:

How can I go certain that DT downloads all information to the local database when I clip to DT (as Safari would if I saved as a Webarchive)?
How can I validate that information has in fact been downloaded and will be available some months from now? (e.g. by disallowing off all further internet access when displaying elements of the DT database and by disallowing active content).

Thanks and Greetings

scottlougheed · February 9, 2015, 1:45pm

I can’t specifically account for the length of time it takes to actually clip a web archive, but it isn’t too different from other “clipping” tools like the one for Evernote. That one also takes a bit of time to clip.
The one explanation I can offer is that the Save As… in safari has the benefit of being able to directly access Safari’s own cache. This means that performing the save as function could be expedited. I don’t think that Safari gives extensions access to its cache. This means that the clipper has to accumulate the content on its own all over again. This is just my suspicion/speculation.
As for offline access, webarchives are entirely available offline. I just verified this with a dozen or so clips of my own by disabling my internet connection and trying to view them. The progress bar you are seeing likely has to do with the moment it takes to render the webpage.

So, the page is indeed “simply saved”, it just takes a moment for: 1) the content to be re-acquired by DT; and 2) DT to render the content when you go to view it.

Are you finding that there are web archives that you are unable to access without a network connection?

korm · February 9, 2015, 3:07pm

When a webarchive is selected in DEVONthink and the contextual menu opened, there are some commands that are useful for managing the webarchives.

“Capture Page” can use used to convert the archive to a PDF – this doesn’t replace the archive, it makes a separate PDF
“Update Captured Archive” refreshes the archive from its source address – the result can be changed images, links, etc. Use with care.

These these are easy to experiment with, so you’re familiar with what DEVONthink can do for you.

scottlougheed · February 9, 2015, 8:12pm

I want to also add that I use a combination of Web Archive and PDF, depending on what I need the content for.

The advantage to the PDF is that in some cases with some websites, the WebArchive can render advertisements badly. I have seen the same banner ad repeated 10 times before the content was actually displayed because DT rendered it poorly/incorrectly (or, it rendered it properly but the capture was poor). This is also likely the result of poor web design.

Sometimes PDFs can be a bit smaller too, depending on the content of the website.

So as Korm suggests, play around a bit to see what works best for you.

argoulyle · February 9, 2015, 8:33pm

What I do is: Capture Page --> PDF (One Page).

Best option. 100% local storage. Issue with that is that you can’t really print it (haven’t discovered how to hive off a one page PDF into a multi-page readable A4 PDF), but I hardly print nowadays. Good for tablet as well (one page).

Bill_DeVille · February 9, 2015, 8:48pm

I almost never capture an entire Web page. I’m not interested in the Web designer’s layout of that page. I’m not interested in any content that is irrelevant to the information that I want to capture from that page.

I don’t even use Clip to DEVONthink. Instead, my captures are usually rich text or WebArchive captures of a selected area of the page, using the Services shortcuts to capture that selected area as rich text using Command-) or as WebArchive using Command-%. Those captures are immediate, no waiting. They work on sites that require login, and on secure sites.

I usually capture as rich text, which is the smallest filetype available by a Services keyboard shortcut that can capture formatted text, links, images, tables and lists from an HTML page. Rich text may break the layout of items captured. Once in a while, I may find it useful to preserve layout of images or rows of text and numbers, and so capture the selected area as WebArchive. Or if I want to capture a script posted on our user forum for possible future use, I’ll capture the post as WebArchive.

File size savings: Often, by capturing only selected content of a page such as an article, I save 1, 2 or even 3 orders of magnitude of file size, compared to capture of the full page as PDF or WebArchive.

The most important advantage of excluding irrelevant text from the capture is improved efficiency of searches and of the AI assistants such as Classify and See Also. All too many pages out on the Web hold a lot of text that isn’t relevant to the article I want to capture.

Forget about using Chrome or Firefox to make Services capture of a selected area of an HTML Web page as rich text or WebArchive. They can’t do this. Such captures do work in Safari, DEVONagent Pro and DEVONthink’s browser.

Tip: To select desired content on a page, I usually select upwards from the bottom of the article to its top. Some sites such as Science Magazine make it quick and easy to do that, and source information is included in the selection.

Safari’s Reader button automatically presents a rich text view of the primary content of an HTML Web page. Then press Command-A to select that view and Command-) to capture it as rich text. In most cases where an article has been divided among multiple pages, Reader will automatically load them all.

Even so, sometimes I’ll go through recent captures and clean them up further. Often, articles include “pretty” photos that aren’t informative or necessary and I’ll delete those.

gilby · February 21, 2015, 12:35am

As an Evernote refugee, I feel rather strongly that the clip to Web Archive in DTP is seriously deficient.

In Evernote, a web archive clip:

Is captured exactly as displayed in my browser.
Is displayed untouched and unchecked every time I view it.

In contrast, DTP:

Makes its own capture - including ads and visiting tracking sites that are blocked by browser plugins.
Whenever I view a ‘web archive’, DTP again visits the web page including ads and tracking.

To me:

A) DTP’s behaviour is not what I expect of an Archive; and

B) It breaks privacy protections build into browsers.

I am afraid, Bill, that your suggestion to use formatted text (followed by manual fixup) is far to cumbersome and could be interpreted as an admission that DTP’s Web Archive is broken (or badly designed).

As others have suggested, the PDF format seems to be the best/easiest way to save web pages. And, mostly, I find using the Print and Save as PDF to DTP more reliable than DTP’s clip to PDF.

markwlyon · March 25, 2016, 5:18pm

glibly:

Agreed.

Gucky17 · March 30, 2016, 2:58pm

Gilby:
I have to agree as well.
I would like to know if the process is a technological issue.
Why does DTP oven the webpage inst ad of the archive?
One can use the readability mode but it sometimes does not capture the right content

PDF might be a solution which I don’t like since I sometimes start to follow links on pages I saved. Maybe we could use add blockers and similar plug ins in DTP?

Best Regards
Miron

Bill_DeVille · March 30, 2016, 5:01pm

I disagree. I found out early on that Evernote doesn’t meet my criteria for efficient capture of most Web pages. That’s also true of Clip to DEVONthink, for that matter.

If you read my post above about my approach to capture of information from Web pages, it isn’t for the purpose of capturing entire Web pages, but only the portion I want to capture, with no irrelevant images and/or text. Unlike Clip to DEVONthink, the Services I use for capture of a selected portion of a page do not involve a second access to the viewed page content.

That approach has paid dividends in efficiency of searches and performance of AI assistants in my databases, as well as major reductions in database file size. Over the years I’ve captured tens of thousands of papers and articles in my Main research database. Had I always captured full pages as WebArchive or PDF, my database would be at least an order of magnitude larger in file size, and I would get a lot of false positives in searches and lessened performance by the AI assistants such as Classify and See Also. I think that’s a huge payoff in efficiency of use of my database collections.

My Main database does contain many (some of them large) PDFs, such as government agency reports on topics of interest. It contains some plain text books, such as Darwin’s Voyage of the Beagle. Those tend not to contain irrelevant text or images, of course.

I subscribe to several journals and frequently capture papers into my databases. Almost always these captures are as rich text of the selected papers. For example, Science Magazine online gives the reader a choice of viewing/capturing a paper as PDF or as rich text. I choose rich text. In addition to compactness, the full text view often provides information not available in the PDF. I use the OS X Service provided in DEVONthink to capture a selected area of the page as rich text, with a convenient keyboard shortcut. That Service is available in Safari, DEVONagent Pro and DEVONthink’s browser – but is not available in other browsers such as Chrome or Firefox.

I often track news coverage of topics of interest, such as the recent lead pollution issues at Flint, Michigan. News pages, if captured as a whole, can be thousands of times larger than capture of the desired article only, and can contain a lot of irrelevant text. I use the rich text Service to capture only the selected article.

I rarely capture as WebArchive, although there are exceptions. For one thing, the WebArchive file format has been the least stable filetype created by Apple. It has at times suffered from long-standing bugs in WebKit, so that some older WebArchives are broken after an OS X upgrade. It is flummoxed by some sites that use dynamic graphics, and such graphics should be avoided, usually by selection of an area that excludes them.
.
However, I do sometimes use a Service provided in DEVONthink that restricts WebArchive capture to a selected area of a Web page, using the Command-% keyboard shortcut. For example, one of my databases captures potentially useful scripts posted on our user forum. These scripts are usually enclosed in a “box” with the Select All heading that allows copy of the script to the clipboard. Capture of such posts is done as WebArchive, which allows retrieval of the enclosed script. I would also use this Service in cases such that preservation of the article’s layout of elements such as graphics and tables is critical.

ipanini · April 1, 2016, 9:37pm

May I ask how exactly you get to this context menu?
I have tried DTPO as 3 panes, then I select a web archive (name field) and right click.
I also tried to open DTPO browser and looked around there (I’m not familiar with DTPO browser) but could not find the same context menu…

Thx

Bill_DeVille · April 1, 2016, 10:54pm

To see the contextual menu options korm noted, select/open the WebArchive and Control-click in the pane in which it is displayed.

ipanini · April 2, 2016, 5:51am

Thx for explaining.

I myself also ran into the situation that:

you are logged in to a webpage
you clip to archive with clipper
-> since you are logged in, the clipping works

But when you revisit the web archive, you don’t get to see it correctly as the page refreshes and you need to be logged in.

Hence, our question: Is this an archive then? If so, why does it need refreshing?

As an aside: I still use a remainder from my windoze period from time to time:
Firefox in combination with the excellent Scrapbook extension, which does exactly what we mere mortals seem to expect from DTPO’s web archive…
(I’ve always wanted to repurpose these clippings to act kind of wiki like items, but have never come around to doing it…)

As a final note:

DTPO’s web archive is not trustworthy, which is a shame and I hope it will evolve to at least the level of FF & Scrapbook extension.
Thanks a lot to Bill & Korm to sharing their insight, I believe there is a lot to be said for the other ways of clipping.

As a side note:

I’ve quickly tested to clip rtf to DTPO (from the forum)
-> the text becomes light grey and I can’t seem to manage to change it to black or whatever… I can only modify the font…
Is there some hidden way? Preferably a setting to always convert a clipping to xxx if using keyb shortcut YY?

Thank you all for helping!

korm · April 2, 2016, 11:05am

Clipping whole web pages to RTF is going to be inaccurate. Web archives can process the CSS that controls the styling of a page. RTF cannot process that, so I wouldn’t expect faithful reproduction of whole pages clipped to RTF. Clipping sections as RTF with the Take Rich Note service, etc., might be a better result, but YMMV.

This forum is one of the most challenging sites to clip from because of the messy combination of PHPBB’s styling and the skins in place here. I never clip archives when I want a full page from this forum. I click the print preview icon (tiny little icon in the upper right corner of the screen in either of the skin options for the forum) and then print the page to PDF from the browser.

gg378 · April 3, 2016, 7:41pm

Is there any evidence that it is untrustworthy, in the real sense of returning faulty materials? We are not talking about simply not getting what the user expects. That’s something different.

As far as I can tell, the whole issue can be boiled down to one fact, namely that the clipping extension tasks the DT-internal browser with the clipping. Then, it seems to me, everything follows naturally, concerning ad rendering/blocking, logins etc. That particular mechanism is not necessarily the best solution (now and then I run into issues with password protected sites etc), but that’s different from “faulty” or “wrong”. It’s a certain, legitimate, take on this functionality.

In terms of the discussion about how to clip, we’re going in circles, I fear. The best clipping method depends dramatically on the kind of information you try to round up. Clearly, Bill is served very well by extracting text. For situations where this is possible, this is clearly best, as the archived info will be very compact and searches on it will be to the point and quick.

My work requires mostly retrieving pdf-based articles, so clipping is more auxiliary for me. I do it mostly with interesting articles (web-based, such as blogs, not pdfs), or product pages. In that case, I cannot generally clip to text, as the information is not linear, i.e. sidebars with “more information” links etc. What I want in these cases is an as-complete-as-possible representation of the webpage in a highly archive-proof format. To me, that’s pdf. And to read those materials in DTTG, the “one page pdf” format is superb. Often, after clipping, I open the pdf in DT with Preview and crop away the ad-infested sidebars and other useless stuff, mostly at the bottom. In-body ads I simply don’t mind.

While pdf does not seem the best match for archiving webpages, it must be noted that there seems to be no universal mechanism to save the full content of a webpage. Webarchives are Apple-only. And sadly, Apple cannot be trusted with archive-proof formats. That’s not their take on life (see AppleWorks, Pages, Keynote, Aperture …). I would never trust any information that I need to keep long-term to the Webarchive format or any other one-vendor, let alone “no open source implementation” format.
From Wikipedia:

So Apple is not even extending webarchive to iOS! That tells you a lot. I’ve never tried, but presumably DTTG will not render them for us? For me, that’s a definite no-go.

While there can be situations where you need to stick with a proprietary format, I think it is fair to say that the rule “archive only to archive-safe formats” is part of “DT 101”.

Another random thought on clipping through a mainstream browser: I’ve had so many problems, especially with Firefox, with extensions that constantly break on upgrade. Several of my favorite FF extensions were discontinued by their devs, stating the impossibility of keeping up with the new, fast and furious upgrade cycle of FF. DT would have to keep track of at least FF, Safari, and Chrome, and probably wisely decided to let their own web services do the clipping, something they understand and control.

gg378 · April 3, 2016, 7:47pm

The pdfs created by DT web clipping extensions preserve the links, you can click them!

gg378 · April 3, 2016, 7:49pm

Print through something like Mindcad Tiler:
macupdate.com/app/mac/20637/tiler

gg378 · April 3, 2016, 7:59pm

Showing my ignorance: How is Evernote saving its representation of a webpage? According to Wikipedia, the webarchive format is Apple-only, with a Windows version only through Safari. Does EN simply store the website as a folder with all the original files?

korm · April 3, 2016, 11:19pm

Yes, Evernote stores the pieces of the page in a folder containing the html (or xhtml) plus image file and other assets. Will vary by site depending on the assets used to create that page. Webarchives are single-file xml documents with embedded binary information for certain assets, and links to others.

gvbarnes · June 16, 2021, 10:59am

I’ve been saving webpages from the iPad to Markdown (no clutter) and it works well. . This may be a different topic but is there something I’m not getting? I tried web archive and PDF but I like the look and the usability of Markdown better. (I also see this is an older thread) .