Omnivore to DEVONthink Importer

Rino · November 7, 2024, 5:41pm

Greetings,

On October 30th, the Omnivore team announced they will shut down [1] their free-hosted service on November 30th. Until then, you can download your entire Omnivore library as a zip archive.

As a trial user of DEVONthink who is looking for an alternative to Omnivore, I wrote a script [2] that imports articles (in Web Archive format) and highlights (in Markdown format) to DEVONthink’s Global Inbox.

“Export Data” at Omnivore, download zip, unzip, run this script, specify the extracted directory and all data will be imported into DEVONthink.

It might be a very niche script but I’m happy if this helps someone stuck at Omnivore’s shut down.

[1] Details on Omnivore shutting down - Omnivore Blog
blog.omnivore.app/p/details-on-omnivore-shutting-down

[2]_github.com/rinodrops/omnivore-to-devonthink

Rino

troejgaard · November 7, 2024, 6:22pm

Oh bummer! I hadn’t seen, thanks for making me aware. They already limit the size of a free library, but it makes sense that it can’t stay free forever. Their costs must increase a lot as they become more popular.

Thanks for sharing the script. If you don’t know, there is another option (skipping the webarchive format): Omnivore has an official plugin for Obsidian. This plugin syncs/downloads your Omnivore library (all items, or filtered) to Obsidian as markdown files. I have been happy to use this, as I found Omnivore’s parser to work better on some sites than DEVONthink’s web clipper.

The plugin lets you customise the markdown output quite a lot, including MMD metadata headers and highlights/annotations & notes. They have some documentation here:

Also worth highlighting: Omnivore is open source. So if you have the knowledge and a fitting setup, you can install it on your own server.

Edit: it helps to read to blog post At first I thought they just removed the free tier, but I see they’re shutting down the service all together. And they themselves note that it is open source and possible to self-host…

I assume the Obsidian plugin works until they shut down their servers, and can probably be set up to work with a self-hosted instance. The Omnivore team is unlikely to develop it further, but maybe someone in the Obsidian community will continue/fork it if it’s popular enough.

BLUEFROG · November 7, 2024, 11:22pm

Thanks for sharing this. I’m sure there are Omnivore users that may need such a thing.

Rino · November 8, 2024, 5:14am

Thank you for letting me know Obsidian is one of the read-it-later solutions. I also wrote a plugin for Joplin to interact directly with Omnivore API using GraphQL. Note-taking apps like Obsidian might be able to support self-hosted Omnivore instances, but it might take time since Omnivore looks heavily dependent on a specific configuration, omnivore.app.

Yes, they will shut down entirely. If they just shut down free-tier only, I pay…

Let me explain why I choose Web Archive for this task.
I found that Omnivore’s articles obtained by API and recently developed export function contain external links (e.g., images) to Omnivore’s service. Since they are shutting it down, these links break immediately after the service termination. So, every resource linked from the HTML record should be downloaded and packaged into the note on DEVONthink. Omnivore’s articles are web-based, I chose Web Archive as a format.

troejgaard · November 9, 2024, 11:26pm

I assumed you chose Web Archive because of the images, but I wasn’t sure if there were other resources or reasons I’m overlooking. (Cross-reference links from the highlights?) I had noticed that all image links were running through some proxy on their server… Or at least that’s how I understand it, with my limited knowledge. They look like this:

‌https://proxy-prod.omnivore-image-cache.app/
{parameters,identifier}/https://{original URL}

<!-- Example -->
https://proxy-prod.omnivore-image-cache.app/
1030x733,smqMuC6Wb-Py0Mz8u0ErLqRpAZ1qteUQmETAJ2O0RTvU/
https://tlacuilollicom.files.wordpress.com/2021/03/return-to-aztlan.jpg?w=1030

Notice that they actually include the original URL. Unless the image is gone, you can still find it.

Of course, a Web Archive is a single point collecting all resources – I can see why that is attractive.

But if anyone reading likes the markdown output from the Obsidian plugin, let me point this out: DEVONthink has an easy way to download online images referenced in markdown files. Select one or multiple markdown documents in the item list, and go to Tools > Import Online Markdown Images. This works with the proxy images now, and should work until they shut down the servers.

I recall several threads in the forum about keeping markdown documents and their local referenced images together for portability. If you want this, you can do like me:

Index or Import the folder in your Obsidian vault where you store the articles from Omnivore.
- I already have the folder Indexed. As a failsafe I wanted to work on copies. So I created a group named “Omnivore Import” in the Global Inbox.
- I then went to my indexed folder, selected everything (folders + documents), right clicked and chose Duplicate To > Global Inbox > Omnivore Import
Run a search for kind:markdown isolated to the “Omnivore Import” group.
Select all results, and run a Tools > Batch Process… with the File action using the Name placeholder. This will put every individual document in its own group with the same name as the document. (Add other stuff if you want, maybe a tag)
In DEVONthink’s settings, under Files > Markdown I set “Image Reference” to Automatic, because I want relative links, not x-devonthink-item links. Useful if you want it to render nicely outside of DEVONthink.
- If you want to customise the name of the group that will contain the images, you can do so in the field above. The default is “Assets”.
If you closed the previous search, repeat it. Select all markdown documents, and chose Tools > Import Online Markdown Images. Depending on the amount of markdown documents, this might take a while. (And depending on the images, considerable space.)

Now every downloaded Omnivore article will live in it’s own group, as sibling to a group containing the images. Nice and portable.

chrillek · November 10, 2024, 8:42am

Kind of. If the collection includes JavaScript that in turn loads other resources (CSS, HTML, images, fonts), those resources are not part of the archive. Similarly, if CSS references URLs (think background images, fonts), the resources behind these URLs are not downloaded into the archive. And I have no idea about @imports in CSS.

It’s half-hearted at best.

troejgaard · November 10, 2024, 6:27pm

Thanks for correcting me/adding details. I wondered if you would jump in when I wrote that. I assumed – perhaps wrongly – that in this particular case, there are no external resources to load, since @Rino chose Web Archive as the solution.

To be honest, the Web Archive format still mystifies me a bit. I’m aware of the complications you describe. I even called it “a single point collecting…” because I’m not sure if “a single file” is technically correct. Perhaps looking at your Exploding Web Archive script & post will clear up some things for me.

I mainly wanted to note that if images was the problem, DEVONthink can easily download them

chrillek · November 10, 2024, 6:44pm

I don’t know anything about Omnivore, so it may well be that the issues I mentioned don’t arise with it and that webarchive is the ideal format to get contents from there to DT.

Today, the idea of capturing the content of a HTML document completely so that it can be “replayed” later from the local machine seems far-fetched.

Every current browser will refuse to load most resources from the local file system (Content Security Policy). One could overcome that by running a web server locally. Nice.
One would need to not only dump the DOM in its current state (possible), but also all resources:
- images referred to directly with img or picture elements, indirectly in CSS or in inline styles in any HTML element or loaded by JavaScript;
- fonts (loaded in the CSS, in inline style, by JavaScript);
- scripts (embedded, external, loaded by either embedded or external)
- I probably forgot something

It’s (IMO) not possible to save every possible HTML document nowadays so that it can be “replayed” truthfully locally (and only using the local resources). It might be possible to get close to it, and for some requirements close enough. For a real “archive” in the sense of “snapshot in time that doesn’t and can’t change”, PDF is better.

troejgaard · November 10, 2024, 8:31pm

Yes, I have already read this, where you talk about the problems with capturing modern dynamic web pages

What to (not) expect when saving HTML documents in DEVONthink

What I meant was that I wasn’t sure what the proper mental model is for the .webarchive format itself. Is it a file? I’m not sure, my impression was at this point “No”. Well, is it really a collection of other files – a bundle/package like .dtBase2 or .app? I can’t right click and choose “Show Contents”.

I had a recollection that I have previously opened them in a text editor, but last time I tried, both CotEditor and Sublime Text had some trouble interpreting the contents, and I couldn’t figure out the correct settings.

I just now tried with BBEdit, and that must be what I used previously. It has no problem showing the contents, without me doing anything. Here I see .plist XML document – that clears up things for me. A single file:

Web Archive .plist structure

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>WebMainResource</key>
	<dict>
		<key>WebResourceData</key>
		<data>
		...
		</data>
		<key>WebResourceFrameName</key>
		<string></string>
		<key>WebResourceMIMEType</key>
		<string>text/html</string>
		<key>WebResourceTextEncodingName</key>
		<string>UTF-8</string>
		<key>WebResourceURL</key>
		<string>https://...</string>
	</dict>
	<key>WebSubresources</key>
	<array>
		<dict>
			<key>WebResourceData</key>
			<data>
			...
			</data>
			<key>WebResourceMIMEType</key>
			<string>image/gif</string>
			<key>WebResourceResponse</key>
			<data>
			</data>
			<key>WebResourceURL</key>
			<string>https://...</string>
		</dict>
		<dict>
			...
		</dict>
	</array>
</dict>
</plist>

Re: Omnivore. Omnivore is a Read-It-Later service. Think an advanced version of Instapaper. I mainly used it for the Obsidian plugin, as an alternative to DEVONthink’s web clipper, but I also found the web app nice to read in. I think for many it is an alternative to Readwise. A central hub for your reading, with an API to integrate it with other applications and services – for getting content in, and exporting/syncing the highlights and notes.

(The iOS app also had a pretty amazing Text-to-Speech engine. They are shutting down to join the company who developed it, ElevenLabs.)

For the web content you feed it, the whole point is to remove all the cruft and just get the raw text + images and some metadata (author, date etc.). That’s why I assumed there is no problem with external content, as I couldn’t think of any besides images.