What to (not) expect when saving HTML documents in DEVONthink

chrillek · July 10, 2021, 12:47pm

This is a TL;DR attempt at collecting different aspects concerning saving of HTML in DT/DTTG in one place.

DT and DTTG offer some options to save HTML documents:

Bookmark
Markdown (MD)
Webarchive
PDF (single and multi-page)
Text only
RTF
HTML
Formatted Note

Before talking about the formats in detail, it is important to determine what shall be achieved by saving the HTML document. Which is a bit more complicated than it may sound at first. Some years ago, an HTML document was just that: A static file consisting of HTML elements and possibly referencing content on the same or other servers (like images). A document in your browser looked pretty much the same today than three weeks ago.

HTML documents are not static anymore

Today’s HTML documents are very different from that. Parts of them are generated by JavaScript on the fly (or even most of the document is). Therefore, the script code fetches data from a server, and this data might change any time. So a document in your browser might look very different three weeks from today, without any changes in the HTML proper.

So first you have to ask yourself: Do I want to save the state of this web page as it is today, regardless of what it might become in the future? If the answer is „yes“, you can exclude all dynamic formats for saving, i.e. Bookmark, Webarchive, HTML, and Formatted Note (which is more or less HTML, too).

What if I want to keep the layout?

Another question that many people find important is the layout. Do you want the saved HTML document to appear as it does in the browser? If your answer is „yes“, you might want to reconsider what „in the browser“ means. Many web sites adapt their layout to the size of the browser’s window. Which means that a document you see on your iPhone can look very different from the same document displayed in your desktop browser. So, if you save it in DTTG, do you want to save the iOS or the desktop version? Or the one for iPad?

Then there are physical differences between different formats: HTML can drop down a menu if you hover over it with your mouse – that’s not possible in any of the text formats available for saving (RTF, Markdown, Text only, Formatted Note, PDF). Also, screens allow different graphical effects like (semi-)transparent overlays. PDF simply can’t do that, nor can Markdown. All animations will only work in an HTML-based format like Webarchive, HTML and (may be) formatted note.

What about images?

Related to this and to the first question are the images. They are hardly ever really a part of the HTML document, but just drawn into it by the browser when it loads the document. In that respect, they’re very much like links: the HTML element img contains a src attribute whose URL is telling the browser where to find the image. It is not telling the browser what this image actually is. So if the document contains something like <img src="https:/example.com/img.png">, you might very well see a cat today and a dog tomorrow – if the file img.png has been exchanged for another one. The same, of course, goes for links: They can point to this document now and to another one tomorrow.

As a side note: It is technically possible to include images into the HTML document with a data URL. This is very rarely done because data URLs tend to be huge, so the amount of data to be downloaded grows, which nobody wants. If you’re adventurous, you could think about doing that for the local copy of an HTML document though.

Which means that even if you capture the state of a web document today, what you actually see when you look at the same HTML file tomorrow, might be different. This problem can be solved for images by downloading them together with the document and changing all src attributes to point to these local copies. This is, of course, not possible with normal links (unless you’d want to download the whole internet to your local machine).

So what is the best format to save HTML to?

This all amounts to one thing: Technically, you can’t reliably create a copy of an HTML document that remains unchanged in time and trustfully reflects the layout. If you want immutability, you have to go with PDF, but then you might lose certain layout features. If you want all layout features, you have to go with HTML, but then you lose immutability. There is no single format that works for all requirements.

There’s also the famous “clutter free” option for PDF and HTML. I’m not too fond of this for two reasons:

first, it sends the document to a third-party server. That goes against my idea of privacy (YMMV, of course, and DT/DTTG are most probably not the evil guys)
second, and more importantly: It is neither clear what this server does nor can one influence it.

Bookmarks

The simplest method to „save“ a document, which actually saves nothing except the URL. Whenever you open the bookmark, you’ll see the current state of the document. Saves some local space, takes more time when viewing because the whole document has to be downloaded.

Webarchive

A proprietary invention of Apple, which means that it is useless on any other platform. Apart from that, many people seem to like it because it encapsulates the HTML document with its images in one compact format. It is not quite clear though if DT actually uses these local copies of the images or tries to download them again when you open the Webarchive.

Markdown

A very portable and compact format. Images pose a problem though, because Markdown only stores links to them. Which puts the burden of keeping images together with the Markdown document on you. You’ll also lose most of the original layout.

Text only

Useful only if you don’t care for either images or layout.

PDF

Saves the current state of the document, not mutable. Can not reflect all the graphical subleties like transparency, animations, drop downs. Also, a print style sheet on the web site can thoroughly change the layout of the PDF vs. the original one. E.g., navigational elements might be removed, multi-column layout can be changed to single-column layout etc.

Outside the box

Another possibility would be to mirror the web page or site you want to archive. This can be done with tools like wget or curl, which create a local copy of the web site’s data (i.e. images, CSS, HTML documents, scripts etc.) and possibly adjust all the links in the documents to point to your local copy. That is similar to what the web archive does, so reinventing a wheel that is already spinning.

DTLow · July 10, 2021, 12:53pm

A point of confusion on the title
Your discussion is about saving web pages

My preferred format for notes/documents is formatted note (HTML)
I use Web Archive for saving web pages

chrillek · July 10, 2021, 1:10pm

Thanks, I amended that, but I stick with HTML documents, because that’s what they are.

mksBelper · July 10, 2021, 7:53pm

Thanks so much for this, @chrillek … extremely useful :"=) !

Mkcmobile · July 17, 2021, 2:56pm

Thanks for the thorough discussion of the options. I find myself continually saving to DTTG through all the various options on my iPad because the results seem to vary considerably depending on the web source. A very slow way indeed to capture, but I don’t trust that I’ll capture in a way I prefer unless I try each one.

I find I prefer markdown, but had also become worried about the need to reference images in the file on the internet rather than local, which as you mention over time could change or even no longer work. An option to save markdown images locally would be a potential solution. I use the web archive with or without clutter but as you mention, DTTG appears to refresh from the web rather than local (although that may be incorrect).

It’s also frustrating when the few images in the article you’d like to retain are sometimes lost in the capture if you use “clutter free”, however if you don’t then you end up with the “clutter”.

I default to PDF (with clutter) as the failsafe option.

chrillek · July 17, 2021, 3:04pm

Just a short update: There’s a markdown extension to Visual Studio Code (which is free on all platforms) which offers to save images as inline data when “printing” (aka converting) to HTML. As I said before, this will blow up the file size considerably, but since the images are in the same file as the HTML itself, there’s no additional burden on administration here.
I suppose that this could be scripted in DT (not DTTG, though), too.