Capturing Webpages, including subpages

Wolkenhauer · May 6, 2022, 9:49am

What I like to do: Capturing a web-page, including subpages, for offline reading.

Hi Everyone

I checked the forum for a good description, but could not get it. The Help provides a section on Web Archive, which however does not download subpages.

The Download Manager is the way to go … but somehow I do not get it either … hence my call for your help

So, going to > Window > Download Manager, I have to windows with options to consider:

Going for “Offline Archive” is the same as a Web Archive, not downloading subpages.

I then (regardless of the option chosen, and repeatedly) get an error message for the Global Inbox: “Failed database verification, please repair the database”.

This finds one inconsistency and is easily repaired.

I then have inside a download folder in the global inbox, a folder and an html document of the main webpage that I like to read offline.

The folders show that all subpages are there but the main page, from which I want to go to subpages, that is still requiring the Internet. What I was expecting was a page where the links to subpages are now leading to the offline download. I imaged an offline copy of a webpage, so that I still go to that main page and click links to subpages.

Did I miss something?

Thank you in advance!

Olaf

chrillek · May 6, 2022, 10:11am

What is the subpage of a web page? If you mean “I want to download all documents referred to by links in the original document”, I’d suggest you go for a wget, curl or such tools.

As an aside: The topic of capturing web documents has been discussed many times here. There’s no silver bullet to it. While webarchive might work in some cases, it has been (kind of) deprecated by Apple and is not supported on any other platform anyway. To reiterate my point: What you seemingly want (though I might have misunderstood your intentions) can be achieved with commandline download tools in a reliable and a very flexible, too. And you can of course script those with do shell command (AppleScript) or doShellCommand (JavaScript).

Also, capturing a page that is generated at least partly by JavaScript stored on the server (as is for example Apple’s developer documentation) will always require an internet connection to display: The content is provided by the server when the page is loaded by the browser.

Wolkenhauer · May 6, 2022, 10:48am

What I mean is a simple webpage with links, to other pages (no further levels). The download manager pulls all pages down, so this works well.

I was only hoping for the main page, with the link I used for the download, to then have its links to the subfolder/s that were downloaded, so that I can read the webpage, and its subpages, offline.

I read everything I could find on the download manager here before writing.

I am an ordinary user, no experience with scripts and was therefore looking for a solution within DT. No problem if it isn’t possible, I am not complaining. The result is already very close, I have all the files there, only the entry/main page with links to the folders is missing. I understand that there are complex scenarios with Java Script etc but my case was a simple two-layer scenario.

Thanks

Olaf

rmschne · May 6, 2022, 10:59am

As suggested by @chrillek, check out and use wget and/or curl to do this.

cgrunenberg · May 6, 2022, 11:03am

Was additional information logged to the Log panel after verifying & repairing the database? Are you able to reproduce the issue?

chrillek · May 6, 2022, 11:04am

Well, that’s what the tools I mentioned are handling correctly (or at least wget does, I haven’t yet used curl for that).

That shouldn’t stop you from trying (which then leads to experience ). There are tons of explanations on how to use curl and wegt available online. It’s not rocket science nor will it break your computer (although you might fill up your hard disk if you do not limit the search depth).

mbbntu · May 6, 2022, 11:19am

Some years ago I used a utility called SiteSucker that (it seems) will still work on MacOS. It might be worth investigating:

https://ricks-apps.com/osx/sitesucker/index.html

I don’t remember much about it, except that it worked for what I wanted to do at the time!

BLUEFROG · May 6, 2022, 12:57pm

Sheesh! I haven’t though about SiteSucker in years

mbbntu · May 6, 2022, 1:18pm

I’m getting really old

Wolkenhauer · May 6, 2022, 1:18pm

Yes, I get the message for repair every time. I added a screenshot of the log window already above. Let me know if I can try anything else to resolve this.

BLUEFROG · May 6, 2022, 1:22pm

You said it was easily repaired. What did you repair?

Wolkenhauer · May 6, 2022, 1:24pm

I understand that there are other tools, and thanks for the tips. I use DT primarily, every day, all the time, to gather material from the web, and also use web archives frequently. DT covers virtually all of my needs to gather material. So, when it came to having just the second layer of a simple webpage archived, I naturally hoped that it would do it as well. In a way, it does - the files are there, just the offline reading is not as convenient by having the main page linking to those downloaded subpages.

Wolkenhauer · May 6, 2022, 1:25pm

The message says to repair the database (global inbox). I do that and than all is fine. It always finds one inconsistency after a download.

Wolkenhauer · May 6, 2022, 1:28pm

We can have a Zoom session to look at it. I can send a Zoom link to quickly meet up and reproduce the scenario.

cgrunenberg · May 6, 2022, 1:31pm

Please choose Help > Report Bug while pressing the Alt modifier key and send the result to cgrunenberg - at - devon-technologies.com. This should be sufficient, thanks!

Wolkenhauer · May 6, 2022, 1:50pm

Done

cgrunenberg · May 6, 2022, 1:54pm

Thank you for the logs! Are there still any copies of the Downloads group in the trash? That’s most likely causing the issue.

Wolkenhauer · May 6, 2022, 2:25pm

Yes, during the various attempts, trying different options, I delete things, moving them to the trash. I take it, that emptying the trash is a good idea then

BLUEFROG · May 6, 2022, 3:11pm

Just like the waste bin in your kitchen, routinely emptying the trash in your databases is advisable.

robbchadwick · May 9, 2022, 9:08pm

SiteSucker is still going strong. The regular version is available in the Mac App Store — and there’s a Pro version available from the developer. I’ve always had good luck with it.