Clip a Web Page

bleeckerj · October 11, 2010, 7:42pm

So — the first thing I tried after not using DEVONThink for a few years was in the Pro version. I was on a web page on the New York Times that had a story I wanted to add to my new database. In Safari when I click the seashell, I assume I want the webarchive but all I get in my database is a log on page for the New York Times. I’d rather just have what I saw on the page. (Evernote, which I’m not comparing DT to, simply puts the web page into my notes repository, which is what I’d expect.)

Am I missing something? I basically just want an offline version of that article in my database.

I’ll own up that I haven’t used DT since maybe 2006.

Julian Bleecker

padillac · October 11, 2010, 9:30pm

Sometimes DEVONthink has trouble capturing web archives of pages that are behind a login screen. That’s why you get a copy of the long page rather than the article you want.

Instead you could try printing to DEVONtihnk by choosing “Save PDF to DEVONthink Pro” in the PDF dropdown in the lower left corner of the print dialog box.

Another option is to select the part of the page you want and clip it as RTF to DEVONthink. There are a few ways you can do this:

a) with your mouse over the higlighted portion, press and hold the left mouse button for a second. The text now becomes a draggable block, and you can drag it into your DEVONthink window, the sorter, or the dock icon.

b) right-click on the selected text, and choose Services->Take Rich Note (need to have installed the services from Help->Install Add-ons in DEVONthink)

c) press the global shortcut command-) (command-shift-0) - this also requires that the services be installed

My preference is to set up command-) and just use that. Quick and easy and it goes right into DEVONthink’s global inbox for later sorting. Plus it only gets the part of the page I’m interested in, not all the noise like nav menus or banners. When taking a rich text note DEVONthink will capture the URL too so you can easily get back to the original web page if necessary.

c)

korm · October 11, 2010, 10:49pm

The Reader feature in Safari 5 works quite well for most NYT articles, and often manages to get inline images as well.

Arc90 Lab’s Readability bookmarklet does well, too, but be sure to click the Single Page option on NYT articles first.

If you create a bookmark in DT to open NYT in a browser window in DT, and log in there, then you’ll avoid the problem and can make your webarchives, bookmarks, PDFs, etc., directly in DT.

bleeckerj · October 12, 2010, 4:33pm

Hmmm…I’m still having some issues. I’ve tried having a page within DT “logged into” NYT…but whenever I grab a page — it appears in DT as the login page…and I can’t log in from that either.

I’ve uploaded a video of what I’m doing:

youtube.com/watch?v=wJ4JJBxUQQs

If anyone has any other ideas on what to try that would be great to hear. It feels like I’ve hit a road block for myself for using DT straight away. I would imagine myself often grabbing pages from the New York Times for adding to my database.

It seems to work fine with Firefox, btw — I just tested it. But, I don’t know if I’m going to change my browser preferences under these circumstances. I just would like a quick way to take notes without the domino effect of changing too many other ways that I do my research and work. That’d be a bit like changing to AT&T just because they have the iPhone…you suffer alternative consequences.

Julian

korm · October 12, 2010, 5:16pm

I clip dozens of articles daily from NYT to DTPO, using the methods I mentioned above.

It’s also the case that NYT seems to change its security infrastructure frequently. For example, today it looks like all pages except the home page make it appear that one is not logged in. And so the DT Safari extension is having no trouble at all - at least not right at this moment 12 October 2010

bleeckerj · October 12, 2010, 7:03pm

Okay — so you mean the DT plugin is “working” insofar as it’s doing all it can technically speaking, and from the perspective of me the human, it’s not working in that it is not capturing the page I am looking at on the New York Times in Safari?

padillac · October 12, 2010, 7:39pm

Have you tried taking an RTF note? That shouldn’t be affected by NYT’s security.

bleeckerj · October 12, 2010, 11:18pm

Yeah, I tried as a text clip, which works. But – I mean, I’m evaluating at this point, so just seeing what does and doesn’t work? And, basically the webarchive doesn’t work on the New York Times. I’m not sure what others are doing to make it work, but I’ve tried on a few browsers, etc. I wish this was a bug with me, but I fear it may be that the software is challenged somehow to capture what is displayed in the browser and store it as a webarchive in DT.

sjk · October 13, 2010, 3:25am

And The Printliminator bookmarklet can help clean up pages for archiving/printing.

Bill_DeVille · October 13, 2010, 2:26pm

Strictly speaking, the issue you experienced in trying to capture a WebArchive of a page displayed on the New York Times website is not a bug. The NY Times may refer an attempt to reload the page to their login page, and this will happen for almost all Bookmarklet capture attempts. Such a refusal to allow “second access” by a user is still more pronounced on secure sites such as an online banking page. So the issue is related to a Web site’s login and security measures.

For capture as WebArchive, you might try a couple of alternative capture options, 1) use “Save As” to the “Inbox” place in the Finder or 2) select all or a portion of the page and try the keyboard shortcut for the Capture as WebArchive Service, “Command-%”. One or both of these may work on some sites that have login/security procedures.

There are some sites at which you will find it impossible to make a capture as WebArchive. What will usually work in such a case is capture of a selected area of the page as a text note, or “printing” the page as PDF.

korm · October 13, 2010, 2:54pm

Thanks Bill. Maybe that writes “paid” to this topic.

sjk · October 13, 2010, 4:32pm

The Add web document to DEVONthink script, which can be invoked with a shortcut (e.g. using FastScripts).

bleeckerj · October 14, 2010, 4:48am

Awesome. That’s great — thanks for the patient help and suggestions Bill & SJK.

Julian Bleecker

pinax · December 18, 2010, 6:36pm

I just had this very problem with an NYT article and am glad I found this thread. I guess what confuses me is why a browser extension or a “PDF to DT” bookmarklet has to do things differently than Print > PDF > Save PDF to DT. The former triggers a login screen; the latter just… sends the content in a PDF to DT… Why the runaround with the super-simple push-1-button option? Now I need to remember a list of sites like NYT where the DT extensions/bookmarklets won’t work so I don’t wind up with an Inbox full of saved login screens. Selecting, clipping, having to remember Cmd-Shift-5 for a Service that probably isn’t installed in my Services menu anyway, dealing with layers of other bookmarklets, having to pick a different file format—all that seems like it should be unnecessary when all I want is to get the content in front of me in one window to get sent to another app in a form that is searchable/indexable/etc. If Print > Save PDF > Save PDF to DT works with its various clicks and steps, why can’t the PDF to DT bookmarklet?

On a related note, one of the great things about saving stuff to DT is that it usually also captures the URL, and having that link to the source is really important to me. While the URL usually gets saved, sometimes it doesn’t. Can anyone tell me which methods of capturing Web data do not keep the URL—so I can avoid them?

sjk · December 18, 2010, 6:56pm

Bookmarks and extension usage queries the target URL/site, which may require login authentication. Printing/saving a browser page directly prints/saves what’s currently displayed, without any querying. I think something like that is the difference.

Bill_DeVille · December 20, 2010, 4:43pm

That’s it, precisely!