Batch-download a list of webpages as Web Archive in Devonthink 3 Pro

LetoTheII · September 1, 2023, 12:23pm

Hi!
I’m quite new to devonthink, I still have to learn it. I was wondering if anybody could help me in the following problem. I’ve already searched in the forum for similar questions, but I wasn’t able to find the exact solution for my case.
Let’s say that I have a list of URLs in a txt or csv file loaded in DT, and I want DT to download all the listed pages as Web Archive format.
Is there a script that can do that? What do you think?
Thanks!

BLUEFROG · September 1, 2023, 12:57pm

Welcome @LetoTheII
Do you have a sheet or text file already?

LetoTheII · September 1, 2023, 2:14pm

Yep! So basically what I do to get the txt file is the following:

The main page that contains all the links is this one: www dot reddit dot com/r/UrsulaKLeGuin/comments/eoacbw/earthsea_reread_intro_invitation/ (sorry but it did not allow me to post a link);
I import it in DT, and then on the right column of the program, under the “Link” section, there is the list of all the URLs in the page, from which I select only those that I want, right-click on the selection, then “copy”, and I create the txt file with all the links; as an example, the list begins with:

www dot reddit dot com/r/UrsulaKLeGuin/comments/eozu88/earthsea_reread_a_wizard_of_earthsea_chapter_1/
www dot reddit dot com/r/UrsulaKLeGuin/comments/epxku7/earthsea_reread_a_wizard_of_earthsea_chapter_2/
and so on

Then at this point I need a script that batch-downloads in Web Archive format all these URLs. I don’t really know from where to start though

Also, maybe this in not the best way to fetch the URLs in a txt file, but that was what I came up with…

BLUEFROG · September 2, 2023, 3:27pm

Did you just copy and paste the URLs from the Reddit page into a text file? A rich text file?

Here’s a snippet that would process a rich text or html file with active hyperlinks in it. I just copied the links from the Reddit post and pasted into a rich text file to run this script…

tell application id "DNtp"
	if (selected records) = {} then return
	set sel to (selected record 1)
	if (type of sel) is in {rtf, rtfd, html} then
		set src to source of sel
		repeat with theURL in (get links of src)
			with timeout of 3600 seconds
				create web document from theURL in incoming group
			end timeout
		end repeat
	end if
end tell

Bear in mind, the data has to be downloaded to create the file. There’s a 6 minute timeout so it has time to work, if it needs to.

I’d guess this is a one-off script, but if you did want to save it…

Open /Applications/Utilities/Script Editor.app.
Paste the desired code.
Select Script > Compile to ensure it’s compiling properly.
Select File > Save.
In the Save dialog, press Command-Shift-G and paste ~/Library/Application Scripts/com.devon-technologies.think3/Menu. You can save into that directory or a subfolder of your choice.
Give the script your desired name and save it. The script should now be available in the Scripts menu in DEVONthink.

LetoTheII · September 2, 2023, 4:30pm

Thanks for the quick reply! So, now I have saved and imported the script into DT3, and then I have created a .rtf with a list of URLs. If I just select that file and then activate the imported script from the top bar of DT3 nothing happens though; how am I supposed to run it? (I’ve already tried to reduce the timeout time)

BLUEFROG · September 2, 2023, 4:33pm

Run it from Script Editor after enabling View > Show Log. Is it showing an error?

LetoTheII · September 2, 2023, 6:52pm

So it does not show any error (in both the “Result” and “Messages” tab).
However, I was able to put a couple of prints here and there:

tell application id "DNtp"
   log "1"
	if (selected records) = {} then return
	set sel to (selected record 1)
	if (type of sel) is in {rtf, rtfd, html} then
       log "2"
		set src to source of sel
		repeat with theURL in (get links of src)
            log "3"

                ...

and (strangely to me), also “2” is printed, while it should not, since being run in this way, I’m not feeding the script any input file, so the if condition should be false (I’m not familiar at all with applescript, so I may be wrong…). However, it does not print “3” instead.

chrillek · September 2, 2023, 7:21pm

But

You may not be “feeding” it any input files, but there may well be record(s) selected in DT

BLUEFROG · September 2, 2023, 7:46pm

What’s in the rich text file? Did you do just what I said to do – copy and paste into rich text from the Reddit post?

LetoTheII · September 2, 2023, 8:35pm

The rich text file contains a series of URLs of the pages I want to save as web archives, in the following format:

www.page1.com
www.page2.com
...

It’s handier (and faster) for me to collect a series of URLs in a file, and then batch-download them using a DT script, instead of manually saving them one by one.

BLUEFROG · September 2, 2023, 8:42pm

Again…

LetoTheII · September 2, 2023, 9:31pm

Ah ok, it was not clear to me that I had to save from inside DT a Rich File version of the main reddit page; instead I did just create a .rft with the plain list of URLs, which in that case did not worked (I misinterpreted from your phrase here: “I just copied the links from the Reddit post and pasted into a rich text file”; sorry about that).

Many thanks! Now it works like a charm!

Summarizing for all who might be interested, the procedure to batch-download a series of pages in Web Archive is the following:

Starting from the main webpage that contains the links to all the pages you want to save, import it in DT and save it in Rich Text format;
Edit the just created Rich Text file, in order to filter out all the links that you are not interested into;
Finally, click on this file, and from the top bar click on the imported applescript (see above) to start the download procedure; depending on how massive the site is, reduce/increase the timeout time (for instance, 10 sec is more than enough in my case)!
Wait for the download to finish and enjoy your local copies of your favorite webpages

LetoTheII · September 2, 2023, 9:40pm

Final question, I swear
I’m still deciding whether I want to save webpages as Web Archive or as PDF format; how much should I change the script that you kindly provided me to download the URLs as PDF?

BLUEFROG · September 3, 2023, 1:17pm

You’re welcome.

The command would be create pdf document from if you wanted to make PDFs.

mrchild · September 21, 2023, 2:23am

i use Site Sucker to download all websites

zacattac · January 1, 2024, 3:06pm

Sorry to bring this back up, everything is working from this script great. But I’m bad at AppleScript and wanting to change the location of where they’re saved from ‘in incoming group’ to a folder inside the Inbox?

I know the script below is wrong - but am I even on the right track?

with timeout of 3600 seconds
			set dest to get record at ("/Inbox/Auto-Pulled Links" in database "Inboxes")
				create web document from theURL in dest

BLUEFROG · January 1, 2024, 4:23pm

Yes, that’s feasible. You could also Control-click the destination group and copy the item link, then use…

set dest to get record with uuid "x-devonthink-item://………"

That has the benefit of allowing you to rename or move the group in the future without upsetting the script.

PS: Do you actually have a database named “Inboxes” ??

zacattac · January 2, 2024, 1:13pm

No I don’t. I was just attempting to try to tell it the global inboxes or something was just messing around/trying things out to make it work.

BLUEFROG · January 2, 2024, 2:04pm

You’d need to use a different syntax then as there is no database called “Inboxes”.

set dest to inbox 
-- OR --
set dest to incoming group
-->Targets the Global Inbox

— set dest to incomoing group of database “Recipes”
--> Targets the Inbox of a specific database