Capturing Webpages, including subpages

Wolkenhauer · May 10, 2022, 4:07pm

Ok, Sitesucker is doing that stuff and I purchased the pro version. The thing is that my first attempt with DT got me much further than spending an hour with SiteSucker. The only thing missing with DT was that links in the main page need to be linked to the downloaded location. Otherwise DT does the job much easier, in my view.

My scenario is like downloading a Wikipedia page, with a predefined level of depth, for offline reading. I wish DT could do that as well, being so close (and in my view easier than with Sitesucker where far more parameters are available and need to be understood).

cgrunenberg · May 10, 2022, 4:32pm

Could you please send us an example URL plus either an exact description what actually should be imported and how or just send us the Sitesucker result for comparison? Thank you!

rmschne · May 10, 2022, 4:48pm

Did you ever check out the “free” curl ? Does not that work?

Wolkenhauer · May 10, 2022, 5:21pm

I can try but the point of my comment here was that DT has this functionality, I use the web archive every day, I only want one level further with links from that page to have that material offline available. I was therefore hoping that it can be done with DT. All files are downloaded, “all” that is missing is the links in the webpage being redirected to the downloaded files I was hoping that this can be done, only that I am too stupid to choose the right option

chrillek · May 10, 2022, 5:30pm

Curl does not rewrite URLs, wget does. But the OP doesn’t want to use a CLI, if I understood them correctly.

rmschne · May 10, 2022, 5:33pm

either curl or wget will probably work for what is describes as a need. searching for alternatives without trying seems not how i would work the issue. just me i guess.

Wolkenhauer · May 10, 2022, 5:44pm

Here how it works in DT. The arrow points at the file to which the webpage that I want to download, refers to. The file is downloaded but the links therein do not point to the subpages, that were downloaded nicely with DT:

I actually could not get this running, that well, with Sitesucker. I am trying to create a simple example from a Wikipedia page, the problem is that most pages there point to so many things that the download is massive. I have to think of a topic where there is little written about

Wolkenhauer · May 10, 2022, 5:51pm

Here is an example to test this with Wikipedia:

https://en.wikipedia.org/wiki/Symplectic_basis

Sitesucker downloads 44 files for a depth of two levels but the nice thing is that in the html file Sympletic_basis.html all links point to the downloaded files.

So strangely, for my actual example DT does well, just short of the redirection of links and Sitesucker does not work. For the Wikipedia example Sitesucker produces the result.

I guess its not an easy task but seems like something that is “just” one level deeper than web archive and thus suiting DT well.

Thanks everyone for commenting!

Olaf

chrillek · May 10, 2022, 6:03pm

Several people have already suggested tools that are known to work. Apparently, you decided to not try them. I for one don’t feel inclined to comment on another tool which I do not even know.

Wolkenhauer · May 10, 2022, 6:16pm

As I wrote, I purchased Sitesucker, when it was recommended above. I think this qualifies for trying. Not sure what your comment is trying to suggest. I even documented the experiments. So, what did I comment about without knowing or trying?

chrillek · May 10, 2022, 6:21pm

See below

cgrunenberg · May 11, 2022, 9:01am

I’m not sure which setting you used but a depth of two levels should download many more files actually. Were the links limited to the same host or subdirectory maybe? Did Sitesucker import the resources (e.g. images, stylesheets and scripts) of the pages too?

Wolkenhauer · May 11, 2022, 9:25am

Yes, Sitesucker loads down everything, including the images and maths. The pages are recreated as if they are online but the links from the main document point to and open links in the downloaded folder.

My use case is just text, basically a webpage with an article split into several sub-webpages with text, so that the DT web archive functionality works well (for a single page).

system · May 10, 2025, 9:26am

This topic was automatically closed 1095 days after the last reply. New replies are no longer allowed.