Importing a website to for offline use

pete31 · January 9, 2021, 1:06pm

I don’t whether it could work, however it seems clear that it should work:

From help:

All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time.

The whole paragraph:

Website: Opens the Download Manager and downloads a complete web page/site for archiving and offline viewing. Make sure the download options are set correctly, especially the options that define which links DEVONthink should follow (if any). All links within the site are modified so that they point to the downloaded images or other embedded objects. This ensures that the page/site can be displayed at any time. By default, groups created by the Download Manager are excluded from tagging.

padillac · January 9, 2021, 5:54pm

Of course the browser will report an error. However, my understanding based on the documentation and some forum posts from @cgrunenberg is that DEVONthink should use the imported documents when following a link from an imported file. So when I open index.html in DEVONthink and click the link to all.html, it should load the all.html that it has already imported, rather than making a request to the server.

I understand where you’re coming from, I just believe that you aren’t fully aware of the import site / download manager feature.

If I’ve misunderstood how the feature is supposed to work, then fine - but I’d like to hear that directly from @cgrunenberg / @BLUEFROG. They both claim that it works as I expect it to, it just appears to not be working for me or others.

BLUEFROG · January 9, 2021, 5:56pm

Ahh… I have reproduced the issue here.

I watched the Navigation bar as I hovered over links. It was pointing to my localhost. After shutting off the SimpleHTTPServer, navigation is no longer possible.

The links aren’t pointing to relative links, but still to a server.

Here’s the local structure…

Here’s the link…

chrillek · January 10, 2021, 7:58am

Do you see the same problem when you import to the filesystem instead of a database? Maybe in the latter case, only bookmarks are imported, not files?

chrillek · January 10, 2021, 8:01am

I wasn’t until I tried it out. I have to admit that I find the GUI irritating and the results not consistent: importing to the filesystem seemed to work ok whereas importing into the database did not give the results promised by the documentation.

cgrunenberg · January 11, 2021, 8:52am

DEVONthink doesn’t change links while downloading (and never did), it only looks for items in your database(s) having the full absolute URL (e.g. after resolving relative links) while browsing and should use them if found. Please send me the database and I’ll have a look at it.

chrillek · January 11, 2021, 10:13am

In that case, I repeat my suggestion for the OP to use a command line tool that can change the URLs, eg wget.

padillac · January 11, 2021, 5:11pm

The URLs are fine. The question at hand is this part of behavior:

So you open the all.html page which has a link to subdir/page.html. When you click that link, DT should “resolve the relative link” to be “the full absolute URL” http://pat.local/dt-html-import-site/subdir/page.html, and since there’s a document that was downloaded that has that exact URL, use that item.

BLUEFROG · January 11, 2021, 5:40pm

I emailed you the test database I used.

chrillek · January 11, 2021, 5:48pm

I seem to be missing something here. Until now, I thought that you wanted to view downloaded HTML pages while you’re offline (cf. your first post here). Now you say that you want DT to “resolve the relative link” (i.e. subdir/page.html) to the absolute URL (i.e. http:/host/something/subdir/page.html).
But this is, as far as I can tell, the exact behaviour DT shows now. With the obvious caveat that it can’t “resolve” this relative link when the relevant server is offline (either because it is turned off or because you have no connection.
I just tried it with a website here (mind you: one hosted outside of my local network – maybe that’s relevant?) and it works exactly as expected: If WLAN is off, DT displays the HTML pages as they where downloaded. So that seems to work as described in the documentation, at least in this case.
Note Since @pete31 saw the same behaviour as @padillac, also with a local web server, maybe it is related to the fact that the URLs’s domain is .local? At least in @pete31’s example, I’m certain that the loopback interface will be used. Depending on the network layer, someone/something might take a shortcut there and handle server/connection detection differently then for a “real” interface?

chrillek · January 11, 2021, 6:09pm

The problem is the local webserver

I apologize for shouting
@BLUEFROG and I see no problem but @pete31 and @padillac do because @BLUEFROG and I apparently tried external websites (i.e. not hosted by a server running on the current machine).

I just tried to replicate @padillac’s setup with a local Apache and saw the exact same behaviour: DT displays the website ok as long as the server is running. If the server is turned off, DT complains about itnot being available. I think that this is inconsistent: It is not relevant on which server the site is hosted.

So maybe @cgrunenberg could have a look into the code and figure out how the local interface (lo) and the other(s) (en0 etc.) are handled differently?

padillac · January 11, 2021, 7:29pm

Indeed. As described multiple times previously, the process is:

Use the Download Manager to import a site
Open one of the imported items
Click a link, and have DT open the imported document that corresponds to that link, rather than requesting it from the web server.

Does it work when you click a link on one of the pages? That’s the question of this thread. The individual pages all work fine. It’s when clicking on a link from one page to another that DEVONthink does not load the imported document.

Interesting observation, and one that I found plausible. However, I removed the pat.local entry from /etc/hosts and restarted the machine, and DT still tries to connect to pat.local (which now just times out because the domain doesn’t exist).

I will try it with an external site at some point though, to see if it behaves differently. fwiw, @BLUEFROG confirmed that he saw the same issue that I reported.

chrillek · January 11, 2021, 7:45pm

Yes for an external site.
No for a local stor site.

local is a special domain, so it might be handled differently than other TLDs

padillac · January 11, 2021, 8:17pm

Okay! I think we cracked it I changed it to pat.chicken and it works as expected now. Thank you for digging into this.

padillac · January 11, 2021, 8:24pm

Well, the links appear not to work in DTTG, which was the whole point. Oh well. Hopefully DTTG3 adds this functionality.

(If DTTG2 should support this same functionality, please let me know)

cgrunenberg · January 12, 2021, 10:22am

The WebKit doesn’t seem to handle local and remote requests the same way but the next release will fix this.

chrillek · January 12, 2021, 11:23am

The joys of opaque frameworks

CanadaSteven · June 17, 2021, 4:34am

Ok, I read, and reread this entire thread. The solution seems to be pat.chicken

Seriously, I want to do the same as padillac wanted to do,

Import a website
Navigate it in DT3 offline (either with no network connection, or with the website down)

I imported a website, subdirectory (complete), Files [all options selected], follow links in subdirectories

When I selected the main page [html], and turned off wifi,I had the same issue, with the page showing “The internet connection appears to be offline”

I am using DevonThink 3, on my iMac. I do not know what he meant by changing from pat.local to pat.chicken - and I do not think that applies for me, though it could

Am I missing the obvious solution?

cgrunenberg · June 17, 2021, 6:36am

Which version of macOS do you use and what’s the URL of the page?

chrillek · June 17, 2021, 7:36am

If (and that is actually a very small if) the website uses JavaScript to load parts of the page, this behaviour is expected and completely normal. One of the consequences of Web 2.0, I’d say.

@pete31 shared an example of the Apple develper documentation recently. It consists of just a bare HTML scaffolding, and every single part is filled in at “run time” (aka when the page is opened) by JavaScript: the documentation is retrieved from a server, which obviously will not work when your machine is offline.

So “importing a website for offine use” will probably not do what one would naively expect in many cases nowadays.