Navigate it in DT3 offline (either with no network connection, or with the website down)
Use it in DTTG
Periodically re-import the website to get new content
So far I am stuck on 2
My understanding is that by importing a website, links to the pages on the website will actually load the downloaded files in DT, rather than connecting to the server. So far, Iām not seeing that - clicking links on the web page always tries to connect to the server.
What is the magic set of steps I need to do to truly have an offline archive of a website in DT3?
You mean like a whole site? Depending on the size of the site, that might be a challenge. And copyright questions might arise, too, since you are copying.
Iād probably not try to do that with DT but with (e.g.) wget. Thereās also curl and other tools. They presumably convert links in the HTML pages so that they can be used locally.
If you want to āimportā a website into DT, you have to ādownloadā it locally. Otherwise, youād just store bookmarks (aka URLs) in DT. So my suggestion is still valid: use wget, curl or something similar to mirror your site (or, even easier, since it is your content: Make a copy of the relevant pages in a local directory and adjust the internal links accordingly with a script). At least wget is able to change the links in the HTML documents so that they work relative to the local directory.
You can then index the files in DT or maybe import them. If itās only a couple dozen pages, its done quickly.
DTTG doesnāt work with relative links (or at least not ../relative links) - itās been this way for years. So itās not sufficient to download a website with wget and index it. At the very least, youād need to flatten the path structure so itās compatible with DTTG.
I have put together a github repo with example HTML files to use. Youād need to a run a web server and add that folder to the server document root.
The results I get so far are that DEVONthink creates records for all of the pages. But when clicking links, it makes a new request to the web server rather than using the imported page.
Here are the posts where @cgrunenberg says it should be possible for links to render the downloaded files:
I donāt know what to tell you. The reason Iām asking for help is because it doesnāt work.
Hereās a follow up video. Iāve disabled wi-fi and turned off the web server. DT still tries to connect to the web server instead of using the imported document:
Well, Iām not sure what else to say / do here. If you or @cgrunenberg ask me for more information, Iāll be happy to provide what I can. As it stands, Iāve described the issue, given steps to replicate it, and posted two videos illustrating it. So while it may work for you, it doesnāt work for me, which is why Iām here asking for help.
Downloaded @padillacās folder, started MAMP, downloaded via DEVONthinkās Download Manager with settings as in his video. Then set macOS to offline and wondered why it seemed to work over here. However after I turned off MAMP I got this:
The URL protocol is http. That will tell the browser to try to connect to a HTTP server, in this case running on the local machine and listening on port 8888.
If you turn this server off deliberately, what do you expect the browser to do except to report an error?
In my opinion, DT should download a website as a local tree of files. If one wants to see the website locally, the protocol has to be file, the server must be omitted and the rest of the URL must point to a local file: file:///Users/me/Downloads/tree.html. Alternatively, no protocol at all, only a local filename.
All URLs in the local files must also be relative to the local filesystem. An HTTP URL canāt be resolved unless a webserver ist running on the host specified in this URL.
I have no idea how DT is downloading a website locally and what it does with embedded URLs. Therefore I still suggest using a tool like wget.
As a follow up: I just tried it myself. Entered an URL in the Download Manager and selected āSubdirectory (compete)ā from the action menu. Which effetively downloaded the whole site (my bad) to my Downloads folder. Internal URLs adjusted and all, no problem viewing it locally (by clicking on the toplevel HTML file in the Finder).
This seems to be different from what the OP did (though he was not very specific in his description).
Right, I overlooked them because I usually ignore YouTube. However, one can clearly see in the first one that http URLs to a local web server are downloaded. As I just explained, this can not work for offline viewing. As shown in the second video.
I donāt whether it could work, however it seems clear that it should work:
From help:
All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time.
The whole paragraph:
Website: Opens the Download Manager and downloads a complete web page/site for archiving and offline viewing. Make sure the download options are set correctly, especially the options that define which links DEVONthink should follow (if any). All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time. By default, groups created by the Download Manager are excluded from tagging.
Of course the browser will report an error. However, my understanding based on the documentation and some forum posts from @cgrunenberg is that DEVONthink should use the imported documents when following a link from an imported file. So when I open index.html in DEVONthink and click the link to all.html, it should load the all.html that it has already imported, rather than making a request to the server.
I understand where youāre coming from, I just believe that you arenāt fully aware of the import site / download manager feature.
If Iāve misunderstood how the feature is supposed to work, then fine - but Iād like to hear that directly from @cgrunenberg / @BLUEFROG. They both claim that it works as I expect it to, it just appears to not be working for me or others.
I watched the Navigation bar as I hovered over links. It was pointing to my localhost. After shutting off the SimpleHTTPServer, navigation is no longer possible.
The links arenāt pointing to relative links, but still to a server.