Add Web Document - inconsistent results and feedback

wolfjo · March 28, 2009, 4:59pm

Hi all,

I like to capture web pages sometimes as they are, with pictures in line, etc.

I have been using the “Add Web Document” with DTProOffice and Safari with very inconsistent results. Sometimes the icon bounces to indicate that it has worked, sometimes not. Sometimes it captures the page as is, and sometimes it reverts to some sort of login page if the site has a login component. Sometimes nothing at all is captured, and there is no notice from DTPro that the function has failed.

Do other people have better strategies for capturing web pages as they are, or at least the contents in a more rich format? Someone on the forums mentioned using services to do this, but this seems like a roundabout solution (I would still like to hear about it, of course!)

My current solution to guarantee a capture of the page is to print it via the print function to a PDF, but this requires three steps and takes too much time. Perhaps a script that prints the PDF directly to the open database would be helpful here.

Any other ideas, or experience with this problem? I find myself struggling with the capturing of web content with DTPro still, and I feel like this shouldn’t be the case - it should be easy, 100% reliable, and not have to be thought about much since so much research is done this way today.

TIA for any help/suggestions.

Edit - here is an example URL that would not capture a web doc, and no error message:

narsad.org/news/press/rg_200 … 02-09.html

Bill_DeVille · March 28, 2009, 5:08pm

Why not try the available Service to capture a WebArchive of the page viewed in Safari?

First, select the content – all or a desired portion – of the viewed page. Now press Command-% (Shift-Command-5) to capture the selected area as a WebArchive.

To capture the entire page, the sequence Command-A followed by Command-% does it.

wolfjo · March 28, 2009, 5:48pm

Hi Bill,

Thanks for the reply. This seems to work more consistently - thanks for the suggestion.

Can I ask why this is different than the script that is designed to do this?

Also, why do you need to select everything - isn’t there a service that captures a web archive of the page without selection?

Thanks again for your help - just trying to understand the best way to use this for capturing web content for offline viewing…

Bill_DeVille · March 28, 2009, 6:26pm

The selection step “adds value” if you don’t want to capture ads and other extraneous content on a page. Just select the portion of the page that you wish to put into your database.

I do almost all captures as rich text capture of selected text and images.

One of the reasons scripts are provided is that some browsers, such as Firefox, can’t use OS X Services. Using the script to capture a Web Document, it’s possible to capture a WebArchive from FireFox. In that case, however, there’s a significant time penalty. The Services capture of a WebArchive is virtually instantaneous, whereas the script capture of a WebArchive from Firefox requires a redownload of the page and may take seconds.

wolfjo · March 28, 2009, 6:34pm

Thank you Bill - just so I understand it, there is no way to capture the page without selecting something first using services?

I presume that is a function of the OS, and not DTPro…

Bill_DeVille · March 28, 2009, 7:40pm

That’s correct. If you wish to capture the whole page, just rapidly invoke Command-A and Command-% in succession. If you wish to capture a selected portion of a page, “draw” the selection area, then invoke Command-%.

I prefer capturing a selected portion of most pages, e.g., an article in Science Magazine, as that pays dividends in improving the focus of searches and See Also operations in my database by eliminating extraneous material that’s not related to the item of interest. Many sites make it easy to select just the content of an article. Start at the end of the article and swoop upwards to make the selection. Other sites, e.g., The New York Times, provide a ‘printer-friendly’ version of multipage articles, that can be selected as a whole for capture.

Over the years I’ve captured tens of thousands of scientific papers, news articles and so forth in this way, and they are a very valuable reference collection for my interests. Most of my captures are as rich text including text, images and tables. Of course, I’ve also got many longer reports in PDF format as well.

sjk · March 28, 2009, 10:05pm

When I captured that page using “Add web document to DEVONthink” from the script menu it worked fine for me. And I’ve never noticed any trouble with it capturing other pages.

wolfjo · March 28, 2009, 11:42pm

Wow - I’m surprised, since this is happening regularly on both machines of mine. The workaround that Bill has suggested is very useful, and allows for the Rich Text capture as well, which is great (I had only been using the plain text script, which is not as useful for scientific documents, as you know Bill).

Do either of you know why that script might get hung up?

Thanks for the help Bill, and the feedback sjk - thats really a great help in troubleshooting… I will try a few things to see what the problem is.

Bill_DeVille · March 29, 2009, 12:42am

For the record, the norsad.org page cited downloaded as WebArchive, using the script to capture as Web document, on my computer.

But this illustrates my point about capturing just the portion of interest.

The WebArchive of the whole page had a size of 235.5 KB.

I did a rich text capture of the article on that page. The RTFD document had a size of 7.1 KB.

sjk · March 29, 2009, 4:19am

Selecting the main article in the center column and invoking the Command-% “Capture Web Archive” service shortcut created a 6.1KB Web Archive document. The Command-) “Take Rich Note” capture of the same selection is 5.1KB, and would be my choice in this case because the saved document looks almost identical to the Web Archive version. Even the formatting/readability of the full-page RTFD capture is quite acceptable to me so there’s no major advantage with the larger WA version. But for other captures I definitely prefer WA over RTF(D).

Oh, the printable PDF version of that page is a bit larger than the WA captured version.