Download a list of webpages as HTML files into a Devonthink group

I have a list of URLs as a text file and I would like to tell devonthink to downl;oadf all tehs epages (as HTML) into a group.

How can I do this?

Could I use ImportWebsite for this? (I think of generating from the URLs a link HTML page – but how can I then tell DEVONthink only to crawl one level deep?)

Or is there another simpler option?


PS: I saw Any way to batch download a list of URLs to DevonThink? - DEVONthink - DEVONtechnologies Community and I tried to run Scripts > Download > As HTML Pages, but nothing seem to happen…

1 Like

If I were doing this, I’d use the command “wget” as a terminal command. Use your text file as input to wget to “get” the web files. wget is built for what you want to do.

Then once all downloaded to your computer, import them into DEVONthink.

I have it on all my Mac’s and I cannot, frankly, remember if it comes as standard with macOS or not. But easily available for download from trustworthy sites on the internet.

1 Like

Thanks @rmschne. Good idea.

Can I pipe the list of URLs to wget as stdin??? That would be cool!

I have it on my Macs too (installed via Homebrew).

I just looked at the “man” file and it appears you can send it a file that has a list of URL’s. I have not tested that idea. I recommend you read the “man” file for all the options which might be useful to you.

1 Like

You can also use Apple Shortcuts to download HTML files. I have created a crude one with each file named as the source URL.

The downloaded files are stored in the Shortcuts folder (change it as you wish). Import them into DT either manually or using your preferred automation tools.

1 Like

I might have tomatoes on my eyes, but I couldn’t find an option to give wget a list of URLs. I guess I call wget in some form of for loop instead; that should work.

From the Linux man page (I’m not convinced that the Tomaten-Augen proverb works in English, though):

-i file
–input-file= file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.)

If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If –force-html is not specified, then file should consist of a series of URLs, one per line.

However, if you specify –force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding “<base href=” url “>” to the documents or by specifying –base= url on the command line.

If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html. Furthermore, the file’s location will be implicitly used as base href if none was specified.

2 Likes

Thanks you @chrillek! – I did have “Tomaten auf den Augen”. (= I was blind.)

1 Like

I use wget often as well, but File > Import > Website, then Options in the Download Manager > Follow Links > One Level Deep gives you a similar result.

2 Likes