Webpage capture full content

kappabear · June 27, 2017, 8:12am

Hi all,

What’s the best way to a capture content from a website that requires login to see the content? For instance, let’s say that I wanted to capture a page in LinkedIn, that I’m currently logged into in Safari? No matter which format I use to attempt to capture, I never get the content that I’m viewing in Safari. The content that I do capture, tells me I need to login to the page. Is there a way to login to a webpage via DTP or DA, and then capture the full content that I’m actually seeing on the screen?

Thanks

cgrunenberg · June 27, 2017, 11:02am

You could indeed add a bookmark to DEVONthink, login afterwards if necessary and capture the page later. But there are more options:

Print a PDF to DEVONthink
Save the webpage as a webarchive to the global inbox
Use Take Plain/Rich Note services or drag & drop after selecting the interesting part of the page
Use DEVONagent Pro

All of these options retain both the URL & title.

kappabear · June 27, 2017, 11:14am

I tried clipping to a PDF which didn’t work, but I didn’t think to try printing to a PDF which did work well.

I don’t like the idea of a Webarchive any longer, as they’re dynamic and I wanted something static.

The Take Plain/Rich Note option also worked, but it’s really ugly.

So the best option was to save as a non-paginated PDF directly from DevonAgent Pro, which worked like a charm.

Thank you!

Zwiggy · June 29, 2017, 9:13pm

Having just wrangled with this issue yesterday, I would like to share my impressions.

First off, try searching the forums on the keyword “paywall”, and you will see multiple threads addressing this topic in depth. To sum up, various strategies employed by paywalled websites block the efforts of non-human user agents from scraping content from their sites. This is understandable and unavoidable. It does make our legitimate research task slightly more challenging. There are several strategies to get around this, each with it’s own usability tradeoffs. You will have to explore the options and use the one that works best for you. Here are some I tried:

Clip to DEVONthink browser extension (Chrome) - While seemingly easy to use, paywalled content is not available to the user agent, as it attempts to open the page in its own unauthenticated instance of a browser engine for collection. Some discuss methods of authenticating your paywalled accounts within the DEVONThink browser to work around this, but I never reached a point where this was functioning for me. In the end, I looked elsewhere for a solution.

Save PDF to DEVONthink Pro - A serviceable option, but a bit fussy to initiate considering the command is buried in the system print menu. Also, many point to the fact that ALL page content is saved, including ads and links to unrelated articles, which can cause problems with DEVONthink’s AI when searching and looking for connections.

Bookmarklets - Found here: http://www.devontechnologies.com/download/extras-and-manuals.html These ended up being most useful to me. The “Selection” bookmarklet grabs ONLY the selected text along with the source URL. It’s the most minimal approach. I also liked the “HTML” bookmarklet, but thought that suffered from the same problems as the “Save PDF to DEVONthink” option, since the entire page was collected.

In the end, I ended up modifying the “HTML” bookmarklet to something I call “Selection to HTML”, which saves to HTML the lowest level DOM element that contains the selection in the page. It gets a little more than your selection most of the time (due to a complicated relationship between what is potentially selected on a page and the programatic structure of the page, the DOM), but tends to avoid the sidebars and other unnecessary junk. The nice thing is that links are preserved and useable in the archived chunk, which is not the case with the text-only “Selection” bookmarklet, and yes, you still get a URL to refer to if you want to actually revisit the source later on. I’ll paste the code below if you want to use it yourself:


javascript:window.location='x-devonthink://createHTML?title='+encodeURIComponent(document.title)+'&location='+encodeURIComponent(window.location)+'&source='+encodeURIComponent(window.getSelection().getRangeAt(0).commonAncestorContainer.innerHTML);

Just right-click an existing bookmarklet and select “Edit…”. Paste the above code in the URL field and Name it something like “Selection to HTML”. Then try it out by selecting some text on a paywalled article and clicking the bookmarklet.

Hope this helps!
z

BLUEFROG · June 30, 2017, 3:22am

Interesting tip. Thanks.

Mio · September 2, 2017, 6:09pm

Hello everyone,
I have tried to look for answers elsewhere but I’m still scratching my head…
I’d like to download content of a single webpage (images of a book available for online reading) in almost any format (PDF, RTF, Images…). Here’s one of the webpages I’m referring to:
digitalna.nb.rs/wb/NBS/Knjige/zbirka_knjiga_Bore_Stankovica/II_5066_008#page/4/mode/1up

cgrunenberg · September 4, 2017, 7:19am

This webpage is very dynamic (e.g. loads contents while scrolling) and doesn’t support printing either, therefore it’s impossible to capture it automatically (and that’s probably intended by the makers of the website).

edglazer · September 27, 2017, 2:09am

@zwiggy, that’s an awesome tip. Thank you so much for sharing that!