Hi!
I’m quite new to devonthink, I still have to learn it. I was wondering if anybody could help me in the following problem. I’ve already searched in the forum for similar questions, but I wasn’t able to find the exact solution for my case.
Let’s say that I have a list of URLs in a txt or csv file loaded in DT, and I want DT to download all the listed pages as Web Archive format.
Is there a script that can do that? What do you think?
Thanks!
Yep! So basically what I do to get the txt file is the following:
- The main page that contains all the links is this one: www dot reddit dot com/r/UrsulaKLeGuin/comments/eoacbw/earthsea_reread_intro_invitation/ (sorry but it did not allow me to post a link);
- I import it in DT, and then on the right column of the program, under the “Link” section, there is the list of all the URLs in the page, from which I select only those that I want, right-click on the selection, then “copy”, and I create the txt file with all the links; as an example, the list begins with:
- www dot reddit dot com/r/UrsulaKLeGuin/comments/eozu88/earthsea_reread_a_wizard_of_earthsea_chapter_1/
- www dot reddit dot com/r/UrsulaKLeGuin/comments/epxku7/earthsea_reread_a_wizard_of_earthsea_chapter_2/
- and so on
- Then at this point I need a script that batch-downloads in Web Archive format all these URLs. I don’t really know from where to start though
Also, maybe this in not the best way to fetch the URLs in a txt file, but that was what I came up with…
Did you just copy and paste the URLs from the Reddit page into a text file? A rich text file?
Here’s a snippet that would process a rich text or html file with active hyperlinks in it. I just copied the links from the Reddit post and pasted into a rich text file to run this script…
tell application id "DNtp"
if (selected records) = {} then return
set sel to (selected record 1)
if (type of sel) is in {rtf, rtfd, html} then
set src to source of sel
repeat with theURL in (get links of src)
with timeout of 3600 seconds
create web document from theURL in incoming group
end timeout
end repeat
end if
end tell
Bear in mind, the data has to be downloaded to create the file. There’s a 6 minute timeout so it has time to work, if it needs to.
I’d guess this is a one-off script, but if you did want to save it…
- Open /Applications/Utilities/Script Editor.app.
- Paste the desired code.
- Select Script > Compile to ensure it’s compiling properly.
- Select File > Save.
- In the Save dialog, press Command-Shift-G and paste ~/Library/Application Scripts/com.devon-technologies.think3/Menu. You can save into that directory or a subfolder of your choice.
- Give the script your desired name and save it. The script should now be available in the Scripts menu in DEVONthink.
Thanks for the quick reply! So, now I have saved and imported the script into DT3, and then I have created a .rtf with a list of URLs. If I just select that file and then activate the imported script from the top bar of DT3 nothing happens though; how am I supposed to run it? (I’ve already tried to reduce the timeout time)
Run it from Script Editor after enabling View > Show Log. Is it showing an error?
So it does not show any error (in both the “Result” and “Messages” tab).
However, I was able to put a couple of prints here and there:
tell application id "DNtp"
log "1"
if (selected records) = {} then return
set sel to (selected record 1)
if (type of sel) is in {rtf, rtfd, html} then
log "2"
set src to source of sel
repeat with theURL in (get links of src)
log "3"
...
and (strangely to me), also “2” is printed, while it should not, since being run in this way, I’m not feeding the script any input file, so the if condition should be false (I’m not familiar at all with applescript, so I may be wrong…). However, it does not print “3” instead.
But
You may not be “feeding” it any input files, but there may well be record(s) selected in DT
What’s in the rich text file? Did you do just what I said to do – copy and paste into rich text from the Reddit post?
The rich text file contains a series of URLs of the pages I want to save as web archives, in the following format:
www.page1.com
www.page2.com
...
It’s handier (and faster) for me to collect a series of URLs in a file, and then batch-download them using a DT script, instead of manually saving them one by one.
Again…
Ah ok, it was not clear to me that I had to save from inside DT a Rich File version of the main reddit page; instead I did just create a .rft with the plain list of URLs, which in that case did not worked (I misinterpreted from your phrase here: “I just copied the links from the Reddit post and pasted into a rich text file”; sorry about that).
Many thanks! Now it works like a charm!
Summarizing for all who might be interested, the procedure to batch-download a series of pages in Web Archive is the following:
- Starting from the main webpage that contains the links to all the pages you want to save, import it in DT and save it in Rich Text format;
- Edit the just created Rich Text file, in order to filter out all the links that you are not interested into;
- Finally, click on this file, and from the top bar click on the imported applescript (see above) to start the download procedure; depending on how massive the site is, reduce/increase the timeout time (for instance, 10 sec is more than enough in my case)!
- Wait for the download to finish and enjoy your local copies of your favorite webpages
Final question, I swear
I’m still deciding whether I want to save webpages as Web Archive or as PDF format; how much should I change the script that you kindly provided me to download the URLs as PDF?
You’re welcome.
The command would be create pdf document from if you wanted to make PDFs.