Importing a website to for offline use

padillac · December 28, 2020, 3:16am

Hi there, I am trying to:

Import a website
Navigate it in DT3 offline (either with no network connection, or with the website down)
Use it in DTTG
Periodically re-import the website to get new content

So far I am stuck on 2

My understanding is that by importing a website, links to the pages on the website will actually load the downloaded files in DT, rather than connecting to the server. So far, I’m not seeing that - clicking links on the web page always tries to connect to the server.

What is the magic set of steps I need to do to truly have an offline archive of a website in DT3?

I am trying to do the same thing as this guy:

chrillek · December 29, 2020, 10:16am

You mean like a whole site? Depending on the size of the site, that might be a challenge. And copyright questions might arise, too, since you are copying.
I’d probably not try to do that with DT but with (e.g.) wget. There’s also curl and other tools. They presumably convert links in the HTML pages so that they can be used locally.

padillac · January 2, 2021, 3:25pm

A whole site might be only a couple dozen pages. No copyright issues here - it’s my content.

It’s not about download the pages locally, it’s about importing them into DEVONthink and DTTG.

BLUEFROG · January 2, 2021, 3:29pm

Have you tried Subdirectory (Complete)?

chrillek · January 2, 2021, 4:40pm

If you want to “import” a website into DT, you have to “download” it locally. Otherwise, you’d just store bookmarks (aka URLs) in DT. So my suggestion is still valid: use wget, curl or something similar to mirror your site (or, even easier, since it is your content: Make a copy of the relevant pages in a local directory and adjust the internal links accordingly with a script). At least wget is able to change the links in the HTML documents so that they work relative to the local directory.
You can then index the files in DT or maybe import them. If it’s only a couple dozen pages, its done quickly.

padillac · January 4, 2021, 1:56am

No, DT has a specific Import Website function:

DTTG doesn’t work with relative links (or at least not ../relative links) - it’s been this way for years. So it’s not sufficient to download a website with wget and index it. At the very least, you’d need to flatten the path structure so it’s compatible with DTTG.

padillac · January 4, 2021, 2:49am

Yes.

I have put together a github repo with example HTML files to use. You’d need to a run a web server and add that folder to the server document root.

The results I get so far are that DEVONthink creates records for all of the pages. But when clicking links, it makes a new request to the web server rather than using the imported page.

Here are the posts where @cgrunenberg says it should be possible for links to render the downloaded files:

padillac · January 6, 2021, 8:50pm

@cgrunenberg I have set up the most basic example project possible to demonstrate this. Here’s a video demonstrating the behavior:

cgrunenberg · January 7, 2021, 10:06am

If you’re really offline it should work as expected.

padillac · January 8, 2021, 9:18pm

I don’t know what to tell you. The reason I’m asking for help is because it doesn’t work.

Here’s a follow up video. I’ve disabled wi-fi and turned off the web server. DT still tries to connect to the web server instead of using the imported document:

BLUEFROG · January 8, 2021, 10:02pm

I’m not seeing an issue here.

I set up some basic HTML pages in DEVONthink.
Exported to a ~/Sites directory.
Ran python -m SimpleHTTPServer 8000 and imported from localhost:8000 to a database.
Disabled WiFi and navigated without incident.

padillac · January 8, 2021, 11:08pm

Well, I’m not sure what else to say / do here. If you or @cgrunenberg ask me for more information, I’ll be happy to provide what I can. As it stands, I’ve described the issue, given steps to replicate it, and posted two videos illustrating it. So while it may work for you, it doesn’t work for me, which is why I’m here asking for help.

pete31 · January 9, 2021, 1:55am

I’m seeing the same behaviour.

Downloaded @padillac‘s folder, started MAMP, downloaded via DEVONthink’s Download Manager with settings as in his video. Then set macOS to offline and wondered why it seemed to work over here. However after I turned off MAMP I got this:

Might be that I did something wrong, though, never really used a local server.

chrillek · January 9, 2021, 10:46am

The URL protocol is http. That will tell the browser to try to connect to a HTTP server, in this case running on the local machine and listening on port 8888.
If you turn this server off deliberately, what do you expect the browser to do except to report an error?
In my opinion, DT should download a website as a local tree of files. If one wants to see the website locally, the protocol has to be file, the server must be omitted and the rest of the URL must point to a local file: file:///Users/me/Downloads/tree.html. Alternatively, no protocol at all, only a local filename.
All URLs in the local files must also be relative to the local filesystem. An HTTP URL can’t be resolved unless a webserver ist running on the host specified in this URL.
I have no idea how DT is downloading a website locally and what it does with embedded URLs. Therefore I still suggest using a tool like wget.

chrillek · January 9, 2021, 11:19am

As a follow up: I just tried it myself. Entered an URL in the Download Manager and selected “Subdirectory (compete)” from the action menu. Which effetively downloaded the whole site (my bad) to my Downloads folder. Internal URLs adjusted and all, no problem viewing it locally (by clicking on the toplevel HTML file in the Finder).
This seems to be different from what the OP did (though he was not very specific in his description).

pete31 · January 9, 2021, 12:20pm

There are two videos.

chrillek · January 9, 2021, 12:45pm

Right, I overlooked them because I usually ignore YouTube. However, one can clearly see in the first one that http URLs to a local web server are downloaded. As I just explained, this can not work for offline viewing. As shown in the second video.

pete31 · January 9, 2021, 1:06pm

I don’t whether it could work, however it seems clear that it should work:

From help:

All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time.

The whole paragraph:

Website: Opens the Download Manager and downloads a complete web page/site for archiving and offline viewing. Make sure the download options are set correctly, especially the options that define which links DEVONthink should follow (if any). All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time. By default, groups created by the Download Manager are excluded from tagging.

padillac · January 9, 2021, 5:54pm

Of course the browser will report an error. However, my understanding based on the documentation and some forum posts from @cgrunenberg is that DEVONthink should use the imported documents when following a link from an imported file. So when I open index.html in DEVONthink and click the link to all.html, it should load the all.html that it has already imported, rather than making a request to the server.

I understand where you’re coming from, I just believe that you aren’t fully aware of the import site / download manager feature.

If I’ve misunderstood how the feature is supposed to work, then fine - but I’d like to hear that directly from @cgrunenberg / @BLUEFROG. They both claim that it works as I expect it to, it just appears to not be working for me or others.

BLUEFROG · January 9, 2021, 5:56pm

Ahh… I have reproduced the issue here.

I watched the Navigation bar as I hovered over links. It was pointing to my localhost. After shutting off the SimpleHTTPServer, navigation is no longer possible.

The links aren’t pointing to relative links, but still to a server.

Here’s the local structure…

Here’s the link…