To begin with, I would like to thank everyone for all the help I have received here on the forum; this community is truly unique.
Now I am having DT monitor RSS feeds, but these feeds contain links to PDF documents. Is there any way to get DT to download the respective PDF documents and save them in a separate database?
I have been trying to experiment with Smartrules, and I can manage to save the feed itself as a PDF but not the link that is in the RSS-feed.
When I add that one to DT, I get an empty feed. Same if I use the correct protocol (https: instead of http:). If I open the URL in a browser, the content looks like an RSS feed, though…
I just tried with another RSS feed in RDF format – that one is not empty in DT.
As the RSS feed doesn’t contain PDF enclosures, the only possibility is to use a smart rule and a script that downloads the web page, retrieves the PDF links from the HTML source and finally downloads the PDF document too.
(() => {
const app = Application("DEVONthink 3");
const window = app.thinkWindows[0];
const feed = app.getRecordWithUuid('x-devonthink-item://F051ADB9-CB81-43B2-A439-F51D4F5FF1DC');
const testRecord = feed.children().filter(item => item.name() === 'S-03677/2023')[0];
/* Now open the testRecord in a tab to have it load the HTML page.
That seems to work ok, DT opens a new tab with the page. */
const testTab = app.openTabFor({url: testRecord.url(), in: window});
/* Notice: The testTab's `source` property is undefined – why? */
console.log(testTab.source())
/* Now, of course, `getLinksOf` with the `source` of the testTab fails */
const links = app.getLinksOf(testTab.source());
})()
I’m certainly doing something stupid here, and that @cgrunenberg will immediately see what that is. Or will source only be set if there’s a DT record loaded in the tab?
The basic idea is:
open the link from the RSS feed in a new tab
get all PDF links from the document in that tab using getLinksOf
This downloads the PDF(s) found in the HTML page after prepending the base URL (the PDFs are referenced by absolute URLs without protocol and host, like /Cache/Cache/Verdicts/929e8c39-37d9-44dc-a0d5-24396b5c1073.pdf).
With this approach, the PDF arrives in the database selected in the Download manager’s action menu, within a group determined by the URL. A bit awkward, but at least it’s there. Or perhaps there’s a setting to have only the file itself without creating a group hierarchy?
A better solution would be to use DT’s app.downloadURL(). But with JXA, that just gives me a useless data string .
For which I’d have to know the baseURL in the first place. So, I can either prepend it to each URL or add it to the command.
And again something new learned. If I use app.createPDFDocumentFrom(url, {in: app.databases['Test'].root()})
I get a bookmark displaying the PDF, not the PDF itself.
No need to complicate this, this works flawlessly over here:
tell application id "DNtp"
set theURL to "http://www.heradsdomstolar.is/default.aspx?pageid=d2ca19a6-a3fa-11e5-9402-005056bc0bdb&id=929e8c39-37d9-44dc-a0d5-24396b5c1073"
set theHTML to download markup from theURL
set theLinks to get links of theHTML base URL theURL file type "PDF"
repeat with theLink in theLinks
create PDF document from theLink in (root of inbox)
end repeat
end tell
Where does the file come from? I don’t see that in the scripting dictionary. If I use type: 'PDF' in the JavaScript code, I do get a PDF+Text record now. So, here it is: