Download a list of webpages as HTML files into a Devonthink group

halloleo · August 24, 2023, 5:21am

I have a list of URLs as a text file and I would like to tell devonthink to downl;oadf all tehs epages (as HTML) into a group.

How can I do this?

Could I use Import → Website for this? (I think of generating from the URLs a link HTML page – but how can I then tell DEVONthink only to crawl one level deep?)

Or is there another simpler option?

PS: I saw Any way to batch download a list of URLs to DevonThink? - DEVONthink - DEVONtechnologies Community and I tried to run Scripts > Download > As HTML Pages, but nothing seem to happen…

rmschne · August 24, 2023, 5:45am

If I were doing this, I’d use the command “wget” as a terminal command. Use your text file as input to wget to “get” the web files. wget is built for what you want to do.

Then once all downloaded to your computer, import them into DEVONthink.

I have it on all my Mac’s and I cannot, frankly, remember if it comes as standard with macOS or not. But easily available for download from trustworthy sites on the internet.

halloleo · August 24, 2023, 6:29am

Thanks @rmschne. Good idea.

Can I pipe the list of URLs to wget as stdin??? That would be cool!

I have it on my Macs too (installed via Homebrew).

rmschne · August 24, 2023, 6:45am

I just looked at the “man” file and it appears you can send it a file that has a list of URL’s. I have not tested that idea. I recommend you read the “man” file for all the options which might be useful to you.

meowky · August 24, 2023, 8:20am

You can also use Apple Shortcuts to download HTML files. I have created a crude one with each file named as the source URL.

The downloaded files are stored in the Shortcuts folder (change it as you wish). Import them into DT either manually or using your preferred automation tools.

halloleo · August 24, 2023, 8:53am

I might have tomatoes on my eyes, but I couldn’t find an option to give wget a list of URLs. I guess I call wget in some form of for loop instead; that should work.

chrillek · August 24, 2023, 9:15am

From the Linux man page (I’m not convinced that the Tomaten-Augen proverb works in English, though):

-i file
–input-file= file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)

If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If –force-html is not specified, then file should consist of a series of URLs, one per line.

However, if you specify –force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding “<base href=” url “>” to the documents or by specifying –base= url on the command line.

If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file’s location will be implicitly used as base href if none was specified.

halloleo · August 24, 2023, 9:25am

Thanks you @chrillek! – I did have “Tomaten auf den Augen”. (= I was blind.)

cornchip · August 31, 2023, 1:54am

I use wget often as well, but File > Import > Website, then Options in the Download Manager > Follow Links > One Level Deep gives you a similar result.