HTML documents behave differently depending on whether its URL property is set

SOLVED

TL;DR WebKit, an Apple framework which DT relies on to display HTML-based files, has a safety feature that prevents web pages from accessing local resources. The URL property in DT is used to determine whether the document should be treated as a “web” page.


I was testing my own web clipper when I found this weird behavior in DT.


Description

We have an HTML document, which contains link(s) to external CSS stylesheet(s) inside the local file system, in DT. The issue is with the URL property of the document in the right-side inspector panel.

An example HTML document is given – use your own CSS file path in the <link> tag. The behavior can be recreated by adding a CSS link tag to any HTML document I have.

example.zip (10.8 KB)

a) If the URL property is set to anything containing ://, e.g. https://www.google.com, the external CSS stylesheet(s) will not be loaded when viewing the document. Upon opening the document, a loading progress bar will be shown briefly at the top of the window, despite the document not containing anything that requests content from a web location.

b) If the URL property is blank, or set to a string that is not a URL (e.g. “document”), the external CSS stylesheet(s) will be recognized, and the document displayed correctly. No loading progress bar is shown; the HTML is rendered in an instant.

c) If I use “open with” to open the document in Safari, it is always displayed correctly, no matter the value of the URL property. This proves the validity of the HTML source code.

It’s not the end of the story, though. If I try converting the HTML document into Webarchive, the result depends on whether the URL property of the HTML document is set at the time of conversion.

d) Convert the HTML document in state a), i.e. with a legitimate URL property. Judging from the file size, the linked CSS stylesheet(s) is not embedded in the conversion. No matter what I do (removing the URL property, or messing with the HTML source) to the resulting Webarchive document, it will never load any external CSS.

e) Convert the HTML document in state b), i.e. with a blank or illegitimate URL property. The CSS stylesheet(s) is embedded and loaded correctly upon viewing, and everything is just fine in the resulting Webarchive document. I can change the URL property of the Webarchive to anything, but it will revert to “file:///” on random occasions, as I had reported to Jim in an email support ticket some time ago.


I don’t know whether this is intentional behavior or not, but it certainly looks like a bug to my non-programmer eyes. There is indeed an impact on my workflow; Many of my older Webarchives cannot be re-styled seemingly for this reason; And I have to avoid using the built-in URL property in my web clipping workflow for the time being.

Would love to know if there are solution and/or possibly bug fixes.

Thanks in advance.


DEVONthink Pro 3.9.4, M1 Mac running macOS 14.0

The URL of HTML pages is used as the base URL to load resources, e.g. for clipped/downloaded HTML pages, and to resolve links. But you can avoid this issue by using fully qualified file URLs, e.g. instead of href="/Users/Username/Documents/test.css" use href="file:///Users/Username/Documents/test.css"

1 Like

Thanks for the reply. I have tested this before posting. Unfortunately it does not solve the issue. Populating the URL property field still result in the external CSS stylesheet not loading.

The used HTML & CSS files would be useful.

test-html-and-css.zip (11.5 KB)

The CSS contains only one line body {background-color: black !important;}. The URL I use to populate the URL property field is https://www.google.com .

test2.zip (12.8 KB)

This is what I used in this screen record demonstration video.

You have
href="/Users/meow/Documents/fullblack.css"
in your HTML. That’s, as @cgrunenberg already said, not a fully qualified URL since the protocol is missing.

Wrong code, correct result.

I gratefully agree with your and @cgrunenberg 's point that a fully qualified link is preferable. The thing is, changing it to a fully qualified link does not change how things work in my a) to e) scenarios.

DT is not rendering the HTML by itself but relies on Apple’s framework.
Principally, fully qualified URLs must work in WebKit, or a lot of users would already by e making a lot of noise about it.

You might want to look at the source of the HTML in DT to check when the CSS is included and when the default CSS is added to the code. Also, WebKit caches quite aggressively, so that restarting DT might change things.

Thank you for the suggestions. I did try multiple times empty cache, restarting DT, rebooting, and even creating the whole document via AppleScript. And still scenario a) does not go away.

There is also the issue I found months ago that some of my Webarchive documents in DT would see their URL property changed to “file:///” without any particular prompt from the user. Today I found out that the Webarchive issue – scenario d) and e) – is related to the HTML issue described in this thread. This makes me feel that the culprit lies with DT, or the way DT hands information to the frameworks, although I may very well be wrong :grinning:

Actually DEVONthink doesn’t modify files or their properties on its own as long as you do not e.g. edit them. Did you upgrade macOS in the meantime? This might affect the WebKit handling too.

I have been living with the Webarchive issue in both macOS 13.x and 14.0 . For some time I (and perhaps Jim @BLUEFROG ) have had no idea how that would be possible, in part because I was unable to reproduce consistently.

Now I’m able to reproduce this Webarchive issue consistently from scenario d).

  • Convert an HTML document (importantly, with its URL property unset/blank) into Webarchive.
  • The converted Webarchive has its URL property blank as well, which is nothing unexpected.
  • In the normal view (not source mode or side-by-side), edit the Webarchive content in any way.
  • Press Command+S to save the changes.
  • The URL property of the Webarchive automatically changes into file:/// !

I doubt this is caused by WebKit – is WebKit actually capable of changing the URL property of a document inside DT?

The web archive is indeed changed by editing as expected but not by DEVONthink on its own. When updating a web archive, the URL property is then automatically set to the internal URL of the web archive (due to former requests to ensure consistency of these two URLs).

But in this case without a URL and an invalid file URL this doesn’t make sense, the next release will fix this.

BTW:
For long term storage, web archives are actually not recommended as they’re limited to Apple’s platforms. In addition, web archives created on current macOS versions are not always compatible to older macOS versions.

3 Likes

Thanks for the clarification! Glad to hear that there is a fix coming.

I understand very much that .webarchive is not an open format. However I’ve decided to stick with it due to my peculiar needs:

  • The content should be editable, so PDF is not suitable;
  • Images should be stored locally (due to web censorship concerns where I live), so plain HTML is not very much suitable;
  • The content should be easily re-styled, and with support for moderately complex light/dark view stylesheets, so RTF and Formatted Note are eliminated.

And now the only suitable option is Webarchive. It’s not perfect, but it gets the job done.

I’ll perhaps try in the future to make a workflow embedding Base64 images inside .HTML, or keeping them in a separate DT folder. Grateful for the advice anyway :smile:

One more thing – will the fix also address my initial concern regarding HTML documents, that is, scenario a) ? :wink:

No. The URL property is used as the base URL as intended.

Base64 will increase the size of your HTML tremendously. Why not use something like wget or curl to download the html with the images locally?

1 Like

So, my understanding is that, the html <head> element in the local document will be overridden by something in the “base URL”, therefore the CSS link inside the local document gets ignored during rendering. Is that true?

No. The base URL is internally used by the WebKit to resolve not fully qualified URLs (e.g. relative URLs or URLs without a scheme)

Sorry but it seems a misunderstanding has occurred.

To rephrase my initial concern – scenario a) :

The ❖ URL property (see below) of an HTML document …

Screenshot 2023-11-07 at 20.25.43

… affects whether a ▲ CSS link (see below) within the document …

… is used to render the HTML document.

  • If the ❖ URL property value is blank or invalid, the ▲ CSS link will be used to render the HTML document.
  • If the ❖ URL property value is a valid URL – e.g. https://www.google.com or https://wwwwwwww.goooooooogle.com – the ▲ CSS link will not be used.

Whether the ▲ CSS link is fully qualified or not does not affect the correlation between ❖ URL property value and ▲ CSS styling behavior.

So I’m wondering if this correlation, which does not seem intuitive to me, will also be addressed? TIA for your patience and expertise. :blush:


Update 10 minutes later

I just did more tests.

Setting the ❖ URL property value to anything starting with file:/// (different from the path to the CSS file) does not prevent the ▲ CSS link from being used. Setting ❖ to anything starting with https:// does, however.

Does this mean that merely adding the protocol file:// is not enough to make a file path “fully qualified” in the eyes of WebKit?

Thanks. wget and curl are new territories for me :grinning: I’ll try when I have time.