Import website and make it searchable (didn't find solution )

Anton_Gyaltsen · March 19, 2023, 11:59am

Hello!
I have tried to Import website and waited for several hours till it downloaded mostly images from pages and have got only one initial html page (with option “follow link on same host”). I tried to search forum and didn’t find any clear answer to the following question: How can I download the whole website (I mostly need only text) and make all of it’s content searchable in DEVONthink? Please help me, because it would be very helpful for my work — to be able to search that site along with another material in my Database. I think I am not alone in that.

chrillek · March 19, 2023, 12:42pm

I’d use curl or wget for that. Then index the resulting folder from DT. Or perhaps simply use your correctly parametrized favorite search engine to search the site?

Downloading a complete website just to be able to search it (which Google can already do because it has … not downloaded but indexed it)? I think many people would choose a different approach. Then there might be the question of copyright, too – downloading is in fact making a copy, which might not exactly be what the author(s) of the site would be happy about.

Anton_Gyaltsen · March 19, 2023, 1:31pm

Thank you for reply! But I wonder why I am not getting all htmls using import website?..
I want to have a heatmap of relevancy which will include not only pdf’s but also site. This information is given for free because it is non-profit project.

chrillek · March 19, 2023, 1:35pm

Slow network connection, DDoS prevention, website down – there could be many reasons. You might want to check if anything is logged in DT’s protocol window.

And, as I said, try curl or wget. They tell you what they’re downloading and from where and how fast the connection is.

BLUEFROG · March 20, 2023, 1:32am

There are many settings in the Download Manager’s options and also the way the site has content delivered matters as well.

What URL and what options are you using?

Anton_Gyaltsen · March 20, 2023, 4:50am

Site is https://studybuddhism.com
And options

cgrunenberg · March 20, 2023, 7:19am

Another common possibility are dynamic websites which require JavaScript or even user interaction.