Capturing Webpages, including subpages

What I like to do: Capturing a web-page, including subpages, for offline reading.

Hi Everyone

I checked the forum for a good description, but could not get it. The Help provides a section on Web Archive, which however does not download subpages.

The Download Manager is the way to go … but somehow I do not get it either … hence my call for your help :slight_smile:

So, going to > Window > Download Manager, I have to windows with options to consider:


Going for “Offline Archive” is the same as a Web Archive, not downloading subpages.

I then (regardless of the option chosen, and repeatedly) get an error message for the Global Inbox: “Failed database verification, please repair the database”.

This finds one inconsistency and is easily repaired.

I then have inside a download folder in the global inbox, a folder and an html document of the main webpage that I like to read offline.

The folders show that all subpages are there but the main page, from which I want to go to subpages, that is still requiring the Internet. What I was expecting was a page where the links to subpages are now leading to the offline download. I imaged an offline copy of a webpage, so that I still go to that main page and click links to subpages.

Did I miss something?

Thank you in advance!

Olaf

What is the subpage of a web page? If you mean “I want to download all documents referred to by links in the original document”, I’d suggest you go for a wget, curl or such tools.

As an aside: The topic of capturing web documents has been discussed many times here. There’s no silver bullet to it. While webarchive might work in some cases, it has been (kind of) deprecated by Apple and is not supported on any other platform anyway. To reiterate my point: What you seemingly want (though I might have misunderstood your intentions) can be achieved with commandline download tools in a reliable and a very flexible, too. And you can of course script those with do shell command (AppleScript) or doShellCommand (JavaScript).

Also, capturing a page that is generated at least partly by JavaScript stored on the server (as is for example Apple’s developer documentation) will always require an internet connection to display: The content is provided by the server when the page is loaded by the browser.

2 Likes

What I mean is a simple webpage with links, to other pages (no further levels). The download manager pulls all pages down, so this works well.

I was only hoping for the main page, with the link I used for the download, to then have its links to the subfolder/s that were downloaded, so that I can read the webpage, and its subpages, offline.

I read everything I could find on the download manager here before writing.

I am an ordinary user, no experience with scripts and was therefore looking for a solution within DT. No problem if it isn’t possible, I am not complaining. The result is already very close, I have all the files there, only the entry/main page with links to the folders is missing. I understand that there are complex scenarios with Java Script etc but my case was a simple two-layer scenario.

Thanks

Olaf

As suggested by @chrillek, check out and use wget and/or curl to do this.

Was additional information logged to the Log panel after verifying & repairing the database? Are you able to reproduce the issue?

Well, that’s what the tools I mentioned are handling correctly (or at least wget does, I haven’t yet used curl for that).

That shouldn’t stop you from trying (which then leads to experience :wink:). There are tons of explanations on how to use curl and wegt available online. It’s not rocket science nor will it break your computer (although you might fill up your hard disk if you do not limit the search depth).

1 Like

Some years ago I used a utility called SiteSucker that (it seems) will still work on MacOS. It might be worth investigating:

https://ricks-apps.com/osx/sitesucker/index.html

I don’t remember much about it, except that it worked for what I wanted to do at the time!

1 Like

Sheesh! I haven’t though about SiteSucker in years :flushed::blush:

I’m getting really old :slight_smile:

Yes, I get the message for repair every time. I added a screenshot of the log window already above. Let me know if I can try anything else to resolve this.

You said it was easily repaired. What did you repair?

I understand that there are other tools, and thanks for the tips. I use DT primarily, every day, all the time, to gather material from the web, and also use web archives frequently. DT covers virtually all of my needs to gather material. So, when it came to having just the second layer of a simple webpage archived, I naturally hoped that it would do it as well. In a way, it does - the files are there, just the offline reading is not as convenient by having the main page linking to those downloaded subpages.

The message says to repair the database (global inbox). I do that and than all is fine. It always finds one inconsistency after a download.

We can have a Zoom session to look at it. I can send a Zoom link to quickly meet up and reproduce the scenario.

Please choose Help > Report Bug while pressing the Alt modifier key and send the result to cgrunenberg - at - devon-technologies.com. This should be sufficient, thanks!

1 Like

Done

Thank you for the logs! Are there still any copies of the Downloads group in the trash? That’s most likely causing the issue.

Yes, during the various attempts, trying different options, I delete things, moving them to the trash. I take it, that emptying the trash is a good idea then :slight_smile:

Just like the waste bin in your kitchen, routinely emptying the trash in your databases is advisable.

2 Likes

SiteSucker is still going strong. The regular version is available in the Mac App Store — and there’s a Pro version available from the developer. I’ve always had good luck with it.

1 Like