Yes, I have already read this, where you talk about the problems with capturing modern dynamic web pages
What I meant was that I wasn’t sure what the proper mental model is for the .webarchive
format itself. Is it a file? I’m not sure, my impression was at this point “No”. Well, is it really a collection of other files – a bundle/package like .dtBase2
or .app
? I can’t right click and choose “Show Contents”.
I had a recollection that I have previously opened them in a text editor, but last time I tried, both CotEditor and Sublime Text had some trouble interpreting the contents, and I couldn’t figure out the correct settings.
I just now tried with BBEdit, and that must be what I used previously. It has no problem showing the contents, without me doing anything. Here I see .plist
XML document – that clears up things for me. A single file:
Web Archive .plist
structure
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>WebMainResource</key>
<dict>
<key>WebResourceData</key>
<data>
...
</data>
<key>WebResourceFrameName</key>
<string></string>
<key>WebResourceMIMEType</key>
<string>text/html</string>
<key>WebResourceTextEncodingName</key>
<string>UTF-8</string>
<key>WebResourceURL</key>
<string>https://...</string>
</dict>
<key>WebSubresources</key>
<array>
<dict>
<key>WebResourceData</key>
<data>
...
</data>
<key>WebResourceMIMEType</key>
<string>image/gif</string>
<key>WebResourceResponse</key>
<data>
</data>
<key>WebResourceURL</key>
<string>https://...</string>
</dict>
<dict>
...
</dict>
</array>
</dict>
</plist>
Re: Omnivore. Omnivore is a Read-It-Later service. Think an advanced version of Instapaper. I mainly used it for the Obsidian plugin, as an alternative to DEVONthink’s web clipper, but I also found the web app nice to read in. I think for many it is an alternative to Readwise. A central hub for your reading, with an API to integrate it with other applications and services – for getting content in, and exporting/syncing the highlights and notes.
(The iOS app also had a pretty amazing Text-to-Speech engine. They are shutting down to join the company who developed it, ElevenLabs.)
For the web content you feed it, the whole point is to remove all the cruft and just get the raw text + images and some metadata (author, date etc.). That’s why I assumed there is no problem with external content, as I couldn’t think of any besides images.