Capture content from subscribed content (Süddeutsche Zeitung)

Maria · July 18, 2023, 9:29am

I remember there were discussions earlier re. New York times paid content etc. I have the same problem with FAZ, Neue Zürcher, and Süddeutsche Zeitung.

It is impossible to capture readable web content from pages that I have subscribed for. The newspaper administrations do not understand the problem; they just say “Of course you can copy content to your computer” but do not understand the problem of formatting. Evernote was great with that.

I would like Devonthink to be able to capture any content from pages I have paid for. I have set the access in Devonthink as well, but only get cut off pages or garbled PDFs.

cgrunenberg · July 18, 2023, 11:14am

See e.g. Capturing Medium articles to WebArchive - #8 by cgrunenberg

thother · July 18, 2023, 12:00pm

FWIW, I often find that non-WebKit browsers do a better job of printing to pdf. Open in Firefox/Chrome —> print —> drag into Dt is sort of an annoying workflow, but I recommend trying it if you’re having trouble saving content as pdf.

Fundamentally, though, there will always be problems and edge cases here, even if they aren’t actively hostile the people designing websites don’t exactly have design for print at front of mind.

thother · July 18, 2023, 12:01pm

Another thing to try is activating the readability function of your browser before printing.

mhucka · July 18, 2023, 3:52pm

Saving an exact representation of a web page turns out to be technically challenging for a variety of reasons (it’s a recurring topic in this forum). That said, some general principles can help with many cases. I myself save PDFs of web pages on an almost daily basis and these days have encountered essentially zero sites that I can’t save faithfully using one or another approach, as follows.

First, we can open a web page in DEVONthink itself using its built-in web browser, and log in to a website in DEVONthink’s browser, and that login session usually will persist for some time thanks to information stored by the site (known as “cookies”) in DEVONthink’s browser. How long this login session lasts depends mostly on the site and mostly not on DEVONthink. This is useful to keep in mind when sending URLs to DEVONthink for websites that have logins and paywalls. When we send a URL from an external browser like Safari to DEVONthink to be saved as (e.g.) PDF, the process of generating the PDF happens in DEVONthink, which is a different program (and thus environment) than the browser where we had the page open at the beginning; if we don’t log in to the website within DEVONthink itself at some point before sending the page to DEVONthink, then DEVONthink won’t be able to “see” the same content that we see in our (external-to-DEVONthink) web browser because it will face the restrictions imposed by the site on users who are not logged in.

For some newspaper sites that I visit, I find I have to log in maybe once in a couple of months and then subsequent saves in DEVONthink work well. For academic journal/publication sites, I have to use a VPN session to make my at-home computer appear to be on my academic institution’s network, and then the academic journal websites automatically allow access to the pages.

Second, DEVONthink has a browser extension (Clip to DEVONthink) that lets you easily send a URL from your current web browser page to DEVONthink, to be saved in a format of your choice. An alternative to that is the bookmarklet for Clip to DEVONthink, which accepts arguments, and can be used to assign a tag to the item when it’s created in DEVONthink. I use this latter method, and also, by putting the bookmarklet in Safari’s “favorites” bar, I can invoke this bookmarklet via a keyboard shortcut rather than have to click on the Clip to DEVONthink extension icon. FWIW, here’s my bookmarklet definition:

javascript:window.location='x-devonthink://createBookmark?title='+encodeURIComponent(document.title)+'&location='+encodeURIComponent(window.location)+'&referrer='+encodeURIComponent(document.referrer)+'&tags=%CF%80-convert-to-pdf';

Finally, I seem to have better luck with saving some pages if I use Safari’s File ▹ Export as PDF… facility to create the PDF. Setting up a workflow to get the resulting PDF into DEVONthink automatically is actually pretty easy.

I used Keyboard Maestro to set up a keyboard shortcut for the menu item. (This could be done using the macOS keyboard shortcut facility, or other tools.)
When invoked, Safari shows a dialog to save the file. In that dialog, I select the DEVONthink Inbox in the left-hand sidebar, and press return to save it there.

image1596×1012 90.9 KB
After a short delay, the PDF file appears in DEVONthink’s Inbox (within DEVONthink itself).

I hope this helps!

Maria · July 19, 2023, 12:50am

Dear all,

Thanks for your kind advice.

Mostly, I tested all the options but results are varied. There is one satisfactory result, but since I cannot be sure it works every time consistently, it is not satisfactory either.

This morning I made some clippings of the same page content from safari to Devonthink exported them for Christian to check.

It seems to me that Devonthink is able to collect and present content well in all cases, but it works inconsistently. Best (and most trustworthy) is “share to Devonthink as formatted note” and then convert to RFT, then clean up. But I am still not sure.

Great community, thanks so much!

Maria

cgrunenberg · July 19, 2023, 6:18am

Using the browser extension or activating the Sorter’s Clip to DEVONthink tab and choosing the browser should usually provide the best results as the HTML source from the browser is received (via the browser extension) or retrieved (via AppleScript if scriptable and automation not denied)

Maria · July 19, 2023, 6:44am

Dear all,

after a private conversation with Christian, it seems that there is no trustworthy solution due to modern web site programming as some of you already mentioned.

My solution is as follows:

I do not use the share button or clipping any more, but “Print PDF to Devonthink” consistently.
All metadata are saved, I can choose the group where to file within the dialog, and even endless pages at the end of articles can be clipped within the print dialog by setting printing range (page 1-6 instead of all 27 pages e.g.) – It is almost one step that gives me everything I need.
Whenever I need a cleaner representation or want to work with the text, I can simply convert to Markdown or RTFD.

I wonder why it took so long to find this simple and quick solution that has no compromise in its result…

Have a nice day and thanks to all.

cgrunenberg · July 19, 2023, 7:04am

There are also two other options:

DEVONagent is able to directly add the currently viewed web page to DEVONthink in the desired format
When viewing web pages in DEVONthink (e.g. after adding a bookmark first), just capture the page in the preferred format

Both approaches should usually deliver the best results but are more cumbersome and can’t be automated.

Maria · July 19, 2023, 7:06am

Helpful advice, Thank you!

Edit:

Just tested it; entered my password etc. in DevonAgent Pro and Devonthink. But the result is not nice for formatted notes, slightly better for RTFD.

It seems that Print PDF to Devonthink will stay my favorite for a while…

mhucka · July 19, 2023, 5:28pm

Maybe worth noting: the print-to-pdf approach causes page breaks to be inserted into the result, whereas the export-to-pdf approach (at least using Safari’s Export to PDF) produces a single-page PDF without page breaks. Whether one or the other is preferable is, of course, an individual choice.

(Since I rarely print things these days, my purpose in saving a page as PDF is to save a faithful, fully self-contained rendering of the page as appeared to me when I saw it. The print-to-pdf approach often inserts page breaks in bad places and changes the layout in other ways, but it does depend on the specific pages, personal preferences, and how one uses the results.)

raluke · July 19, 2023, 8:01pm

THANK YOU!!! I have wanted this for so long and didn’t realize it could be done within Safari without installing extensions.

Maria · July 19, 2023, 11:43pm

The “Export to PDF” often creates unreadable results. The only reliable procedure with those websites I like storing (from behind the paywall) is using the usual print dialogue and there choose “Save PDF to Devonthink 3”.

Here is a short description of the process I will settle with from now on:

You can see in the dialogue, that I can choose to delete the large number of pages with unrelated links and just save the important 4 pages – in this case. The pagination works, it does not cut lines.

When saving, I get the dialogue where I can choose where to file the PDF and under which name:

The result is clean, but not too pretty:

And I get all the necessary metadata.

Later, without connection to the internet or the site I used, and only if I feel the need to make in pretty, I can always create an RTFD or Markdown file via rightclick and “Convert”. Looks like this:

But for the time being, the PDF works well in Devonthink and feeds the AI.

Since this workflow is consistent, reliable, simple, and complete, I am content and will not use clip or share or the sorter any more while in web browsers.

I did not use RTF since decades, because the paragraph formatting is horrible in multilingual text, but here it works great, and RTFD will play an important role from now on in my database. – I would like another font and line spacing though…

Cheers,
Maria

Disclaimer: I am NOT collecting articles about that man in real life.

papierlos · July 20, 2023, 7:37am

I spent a lot of time to come up with a solution how to save full-text paid content of FAZ without non-text related pictures and advertising.

I get the best result with the Safari extension MarkDownload (see AppStore, available for Firefox as well). It will save you a clean version of an article.

Alternatives are:
Instapaper. The Safari extension saves full-text, but how to export efficiently to Devonthink?
Gather CLI: Gather is more difficult to implement as it is a command line tool and you need a bookmarklet to save content behind a paywall. Gather CLI - BrettTerpstra.com
Readability CLI: Firefox Reader, helpful if you need certain elements like extracts etc. gardenappl / readability-cli · GitLab
DevonSave for iOS make really good looking clips of news articles in html: DEVONsave v3

For converting to other format:
Marked2 is highly recommended
Hazel and Pandoc is my choice for automated conversion
Or inside Devonthink

Fokus02 · April 18, 2025, 7:43pm

Interestingly, when an article is shared with Instapaper via the Safari share sheet on both macOS and iOS, Instapaper also seems to receive the entire webpage as it is displayed in Safari. It therefore saves the full article, even if it is paywalled, without having to sign in to the website within Instapaper.

If I’m not mistaken, this is not currently the case with DEVONthink and DTTG, but maybe better support for capturing via the share sheet could be added in the future?

cgrunenberg · April 19, 2025, 7:18am

DEVONthink has to download, render & convert the web page on its own due to the various supported formats & options (and the captured URL might not even be a web page but a e.g. a PDF) Especially in case of dynamic content or content behind paywalls this might fail. In these cases printing to DEVONthink, saving a web archive to the inbox or taking a note via services are the recommended workarounds. But we are always trying to improve this.