Downloading PDF from link in a RSS-feed

Geiri · August 7, 2024, 9:49pm

To begin with, I would like to thank everyone for all the help I have received here on the forum; this community is truly unique.

Now I am having DT monitor RSS feeds, but these feeds contain links to PDF documents. Is there any way to get DT to download the respective PDF documents and save them in a separate database?

I have been trying to experiment with Smartrules, and I can manage to save the feed itself as a PDF but not the link that is in the RSS-feed.

Here you can find a link to an sample RSS-feed
http://www.heradsdomstolar.is/heradsdomstolar/reykjavik/domar/rss/

Screenshot of the PDF link i whish to download
Screenshot 2024-08-07 at 21.47.52

chrillek · August 8, 2024, 7:03am

I guess you want to automatically download all these PDFs, not only some of them?

chrillek · August 8, 2024, 7:16am

When I add that one to DT, I get an empty feed. Same if I use the correct protocol (https: instead of http:). If I open the URL in a browser, the content looks like an RSS feed, though…
I just tried with another RSS feed in RDF format – that one is not empty in DT.

cgrunenberg · August 8, 2024, 7:16am

As the RSS feed doesn’t contain PDF enclosures, the only possibility is to use a smart rule and a script that downloads the web page, retrieves the PDF links from the HTML source and finally downloads the PDF document too.

cgrunenberg · August 8, 2024, 7:17am

The items in the feed are probably too old depending on your RSS settings.

chrillek · August 8, 2024, 7:20am

Any ideas what to change here?

cgrunenberg · August 8, 2024, 7:28am

These settings add 50 items over here to a new feed.

chrillek · August 8, 2024, 7:30am

As they do here, now. I had to re-add the feed, then it worked.

chrillek · August 8, 2024, 1:07pm

I can propose a script, but I can’t make it work:

(() => {
  const app = Application("DEVONthink 3");
  const window = app.thinkWindows[0];
  const feed = app.getRecordWithUuid('x-devonthink-item://F051ADB9-CB81-43B2-A439-F51D4F5FF1DC');
  const testRecord = feed.children().filter(item => item.name() === 'S-03677/2023')[0];
  /* Now open the testRecord in a tab to have it load the HTML page. 
     That seems to work ok, DT opens a new tab with the page. */
  const testTab = app.openTabFor({url: testRecord.url(), in: window});
  /* Notice: The testTab's `source` property is undefined – why? */
  console.log(testTab.source())
  /* Now, of course, `getLinksOf` with the `source` of the testTab fails */
  const links = app.getLinksOf(testTab.source());
})()

I’m certainly doing something stupid here, and that @cgrunenberg will immediately see what that is. Or will source only be set if there’s a DT record loaded in the tab?

The basic idea is:

open the link from the RSS feed in a new tab
get all PDF links from the document in that tab using getLinksOf
close the tab
download the PDFs

cgrunenberg · August 8, 2024, 1:16pm

I would suggest to use the command download markup from instead.

chrillek · August 8, 2024, 2:01pm

Thanks, that looks better:

(() => {
  const app = Application("DEVONthink 3");

  const URL = 'http://www.heradsdomstolar.is/default.aspx?pageid=d2ca19a6-a3fa-11e5-9402-005056bc0bdb&id=929e8c39-37d9-44dc-a0d5-24396b5c1073';
  const baseURL = URL.replace(/(.*\/).*$/, "$1"); //protocol and hostname

  const URLsource = app.downloadMarkupFrom(URL);
  const allLinks = app.getLinksOf(URLsource);
  const pdfLinks = allLinks.filter(l => /.pdf$/.test(l)).map(l => `${baseURL}${l}`);
  pdfLinks.forEach(l => app.addDownload(l, {automatic: true}));
  app.startDownloads();
 })()

This downloads the PDF(s) found in the HTML page after prepending the base URL (the PDFs are referenced by absolute URLs without protocol and host, like /Cache/Cache/Verdicts/929e8c39-37d9-44dc-a0d5-24396b5c1073.pdf).

With this approach, the PDF arrives in the database selected in the Download manager’s action menu, within a group determined by the URL. A bit awkward, but at least it’s there. Or perhaps there’s a setting to have only the file itself without creating a group hierarchy?

A better solution would be to use DT’s app.downloadURL(). But with JXA, that just gives me a useless data string .

cgrunenberg · August 8, 2024, 2:05pm

Specifying the base URL parameter should avoid this.

Or use the create PDF document from command.

BLUEFROG · August 8, 2024, 2:08pm

I didn’t see any articles with linked PDFs.

cgrunenberg · August 8, 2024, 2:10pm

The links are only on the original web page, not in the news of the feed.

chrillek · August 8, 2024, 2:20pm

For which I’d have to know the baseURL in the first place. So, I can either prepend it to each URL or add it to the command.

And again something new learned. If I use
app.createPDFDocumentFrom(url, {in: app.databases['Test'].root()})
I get a bookmark displaying the PDF, not the PDF itself.

cgrunenberg · August 8, 2024, 2:28pm

It’s the URL that was used for the download markup from command.

And the URL was which one? The one of the PDF?

chrillek · August 8, 2024, 2:37pm

Not really:

const URL = 'http://www.heradsdomstolar.is/default.aspx?pageid=d2ca19a6-a3fa-11e5-9402-005056bc0bdb&id=929e8c39-37d9-44dc-a0d5-24396b5c1073';
const URLsource = app.downloadMarkupFrom(URL)

The baseURL would be http://www.heradsdomstolar.is/, and I’d have to figure that out from the URL myself.

Very much so, as shown in the protocol window:

Interestingly, if I do a convert to paginated PDF from the context menu of this record, I get a (the?) PDF.

cgrunenberg · August 8, 2024, 2:43pm

No need to complicate this, this works flawlessly over here:

tell application id "DNtp"
	set theURL to "http://www.heradsdomstolar.is/default.aspx?pageid=d2ca19a6-a3fa-11e5-9402-005056bc0bdb&id=929e8c39-37d9-44dc-a0d5-24396b5c1073"
	set theHTML to download markup from theURL
	set theLinks to get links of theHTML base URL theURL file type "PDF"
	repeat with theLink in theLinks
		create PDF document from theLink in (root of inbox)
	end repeat
end tell

chrillek · August 8, 2024, 2:57pm

Where does the file come from? I don’t see that in the scripting dictionary. If I use type: 'PDF' in the JavaScript code, I do get a PDF+Text record now. So, here it is:

(() => {
  const app = Application("DEVONthink 3");
  const target = app.databases['Test'].root();

  const URL = 'http://www.heradsdomstolar.is/default.aspx?pageid=d2ca19a6-a3fa-11e5-9402-005056bc0bdb&id=929e8c39-37d9-44dc-a0d5-24396b5c1073';
  const URLsource = app.downloadMarkupFrom(URL);
  const pdfLinks = app.getLinksOf(URLsource, {baseURL: URL, type: 'PDF'});
  pdfLinks.forEach(l => app.createPDFDocumentFrom(l, {in: target}));
 })()

cgrunenberg · August 8, 2024, 3:04pm

Oops… I used an internal revision of the script suite, sorry