Having just wrangled with this issue yesterday, I would like to share my impressions.
First off, try searching the forums on the keyword “paywall”, and you will see multiple threads addressing this topic in depth. To sum up, various strategies employed by paywalled websites block the efforts of non-human user agents from scraping content from their sites. This is understandable and unavoidable. It does make our legitimate research task slightly more challenging. There are several strategies to get around this, each with it’s own usability tradeoffs. You will have to explore the options and use the one that works best for you. Here are some I tried:
Clip to DEVONthink browser extension (Chrome) - While seemingly easy to use, paywalled content is not available to the user agent, as it attempts to open the page in its own unauthenticated instance of a browser engine for collection. Some discuss methods of authenticating your paywalled accounts within the DEVONThink browser to work around this, but I never reached a point where this was functioning for me. In the end, I looked elsewhere for a solution.
Save PDF to DEVONthink Pro - A serviceable option, but a bit fussy to initiate considering the command is buried in the system print menu. Also, many point to the fact that ALL page content is saved, including ads and links to unrelated articles, which can cause problems with DEVONthink’s AI when searching and looking for connections.
Bookmarklets - Found here: http://www.devontechnologies.com/download/extras-and-manuals.html These ended up being most useful to me. The “Selection” bookmarklet grabs ONLY the selected text along with the source URL. It’s the most minimal approach. I also liked the “HTML” bookmarklet, but thought that suffered from the same problems as the “Save PDF to DEVONthink” option, since the entire page was collected.
In the end, I ended up modifying the “HTML” bookmarklet to something I call “Selection to HTML”, which saves to HTML the lowest level DOM element that contains the selection in the page. It gets a little more than your selection most of the time (due to a complicated relationship between what is potentially selected on a page and the programatic structure of the page, the DOM), but tends to avoid the sidebars and other unnecessary junk. The nice thing is that links are preserved and useable in the archived chunk, which is not the case with the text-only “Selection” bookmarklet, and yes, you still get a URL to refer to if you want to actually revisit the source later on. I’ll paste the code below if you want to use it yourself:
- Just right-click an existing bookmarklet and select “Edit…”. Paste the above code in the URL field and Name it something like “Selection to HTML”. Then try it out by selecting some text on a paywalled article and clicking the bookmarklet.
Hope this helps!