Batch capture multiple URLs

ssheth · May 19, 2020, 2:36pm

I’m accessing a paywalled site (think Club Macstories, in this instance though it is Peter Attia’s podcast archive).

How do I capture multiple web pages in MD (not web archive since the formatting isn’t great) directly from the page, if presented with a link of articles?

Also, what would be the best format to capture these in? What is the difference between a single page PDF and a paginated PDF (apart from the fact that there are multiple pages)?

BLUEFROG · May 19, 2020, 3:30pm

There isn’t a “best format”.

A PDF generally retains the styling and is locked down. The single page PDF is a closer representation of the page. Paginated depends on if there is a print style on a particular page, so the results may be similar to what Print to PDF produces.

Webarchives include the styling but may change at a later visit.

While it is technically possible to download the pages by processing the HTML, a page may contain many links unbeknownst to you. This short page on his site has 57 HTML links…

An easier option would be to drag and drop the links from this page into the database (they’d come in as bookmarks), then try the Scripts > Download > As PDF Documents option.

ssheth · May 19, 2020, 4:03pm

I just tried your suggestion. Though the formatting looks ok, the issue now is that the rest of the paywalled content is cut off from the PDF. It’s just showing the non-paywalled content on conversion of the PDF.

BLUEFROG · May 19, 2020, 4:53pm

That may be insurmountable, depending on the way the site has things set up.
Did you log into MacStories in DEVONthink? If not, do so and try again.

ssheth · May 20, 2020, 2:49pm

Ok, I tried Macstories with the steps you suggested. It worked like a charm, especially the single-page PDF that retains the styling of the newsletter.

What options exist on converting the Peter Attia website into a PDF? Could I use Keyboard Maestro or something else? Do paywalled sites usually have this issue? How do I log in to an instance of the website in DEVONthink?

BLUEFROG · May 20, 2020, 3:28pm

What options exist on converting the Peter Attia website into a PDF?
Could I use Keyboard Maestro or something else?

That would depend on your creativity, I suppose. I don’t use KM.
I don’t have any specific recommendations on an automatic capture as it’s not a simple matter with so many links and no easily automatic way to determine which are desirable links. In fact, I am a proponent of curation, so converting chosen links would be what I suggest.

Do paywalled sites usually have this issue?

It’s not uncommon, especially since they want to protect their content.

How do I log in to an instance of the website in DEVONthink?

DEVONthink has a WebKit browser built in, so you can add a bookmark to a database and access the page just as you do in your normal web browser.

Here is an example smart rule…

This is a controlled example using a targeted group where bookmarks are added or dropped. They are processed into PDFs in a destination group and the bookmark removed.

ssheth · May 20, 2020, 4:45pm

This is great! Thank you @BLUEFROG

BLUEFROG · May 20, 2020, 5:02pm

You’re very welcome

PS: In case you didn’t know, you can drag links from web pages and they’ll create bookmarks automatically. Check this out…

bookmarkdrag

jooz · May 20, 2020, 6:45pm

I actually had the same issue with exactly same website with the paywall. Thanks for sharing the tips @BLUEFROG I tried to replicate it but this does not seem to work for me.
This is what i configured:

And I am logged into the webpage in the DT webbrowser - i checked it in a separate tab.
But generated PDFs only include content up the paywall … @ssheth were you able to get this working for Peter’s webpage?

BLUEFROG · May 20, 2020, 8:02pm

It would depend on what the bookmark is linked to.
If it’s linked to paywalled data, we can’t control if the content can be clipped properly. It will vary site-to-site.

ssheth · May 21, 2020, 2:16am

Which website is this? I could get Club MacStories to process correctly but not Peter Attia’s paywalled section.

BLUEFROG · May 21, 2020, 6:03am

I just added the About page from his site. I can’t speak to any subscribed content on his site.

jooz · May 21, 2020, 2:25pm

Does not work for me with Peter Atttia’s website.

ssheth · May 26, 2020, 5:10pm

I can’t sign in to an instance of the Webkit browser from within DEVONthink and I’m getting very inconsistent results from saving the weblocations directly into DEVONthink from DEVONagent. Sometimes, I manage to download the full text in .md and in some instances, it hits the paywall.

Anyone have a bright idea of how to automate this?

BLUEFROG · May 26, 2020, 5:25pm

I can’t sign in to an instance of the Webkit browser from within DEVONthink

Please clarify what you mean by this.

ssheth · May 26, 2020, 5:27pm

Even when I download the web location/webarchive, if I try logging into the paywall, it opens up in Safari and not within DEVONthink.

BLUEFROG · May 26, 2020, 5:39pm

That is a regression in 3.5 that will be addressed in the next maintenance release. You can hold the Command key to open the link in a new tab.

Aurelius · May 26, 2020, 8:45pm

I download a lot of journal articles through my University’s library proxy, so this is something I have been wondering about for a while. Currently, I download the PDFs manually and then drag-and-drop them into DevonTHINK, as the web clipper is not logged into the proxy. But if there is a way for DevonTHINK to use the proxy and directly download the PDFs, that would be quite the time saver.

jooz · May 27, 2020, 6:14pm

My manual work-around is to use “Print” on a webpage (CTRL+P) > save as PDF > Save into DT Inbox which is in Finder.

BLUEFROG · May 28, 2020, 5:13am

Have you tried logging in and browsing the site directly in DEVONthink?