DT4 - privacy when using AI

jem · April 10, 2025, 8:04pm

I tried to find some info about this in the documentation but failed, I assume it’s there but I’m missing it.

One thing I really like with DT is the privacy and that I keep full control of where my data is stored. I use DT for some work related documents that should stay local, so I’m a bit concerned with using external tools like various LLMs etc. What happens with the documents if I start using an external tool like ChatGPT/Claude/whatever, is my data sent to these tools? (I assume so)

chrillek · April 10, 2025, 8:29pm

I’d assume that is necessary if you want the tools to analyze the data.

CAE · April 10, 2025, 10:57pm

You could install and run a model locally via Ollama or LM Studio or whatever, and limit it to your database(s).

BLUEFROG · April 10, 2025, 11:27pm

And accepting the limitations of that setup as well.

ajcowell · April 11, 2025, 1:28am

I’ve played around with a few different options over the last few days. Getting LM Studio set up really wasn’t too difficult. LM Studio is one of the options DEVONthink supports as a local “model garden”. That is, you install the environment, and from them download opensource models. You then configure LM Studio to be called from DT4 and it will return the results of your query. Everything stays on your machine, nothing is sent elsewhere.

The challenge with running a local model like this is speed and complexity. Depending on your machine specs, you’ll only be able to run small models and they’ll probably run quite slow. The nice thing about LM Studio is that it will advise you based on your machine stats which models you could use. I’m using a Macbook Air M3 with 24MB of memory and the experience isn’t great.

You could get a free API key from Google and use the Gemini models. I can’t remember how many free tokens they give you but if you just want to try things out, that is an option. As it’s free, I wouldn’t send anything you didn’t want posted on the internet (or used to train their models). I’ve also set up an Anthropic API key. They require you buy $5 in credits first, and then your usage consumes from your balance. I believe all the major players state that they will not use any data or interactions for these paid APIs but worth checking.

If you’re 100% sure your materials can never leave your machine, I believe the local LLMs are your only choice.

cgrunenberg · April 11, 2025, 6:16am

By default generative AI (via chat assistant, batch processing, smart rules or scripts) uses only selected documents. Optionally the chat assistant might use a database search (see Settings > AI > Chat) but the search is also limited to the current selection in the item list or, if there’s none, in the sidebar.

But DEVONthink does never send your original documents:

Image files are scaled and recompressed and send without the original metadata. PDF documents without a text layer are handled likewise, thumbnails of the first n pages (depending on the model) are used.
In case of text-based documents only the raw text or excerpts of it are used, again no metadata
In case of audio/video files the transcription (if available) and n video still images (if supported by the chat model) might be used
Transcribing audo/video files extracts and recompresses the audio track and sends only this to Whisper (if selected in Settings > AI > Transcription)

Furthermore, DEVONthink anonymizes links (including email addresses) to improve both the privacy and to reduce the likelihood that the response will include invalid links as LLMs don’t like stuff like UUIDs or session identifiers. This saves also tokens.

Finally, in case of commercial models supporting tool calls (currently all except Perplexity and Gemini 2.0 Flash Thinking) data is only send on demand and not in advance. And commercial providers do not use API requests (like the ones of DEVONthink) for model training.

jem · April 11, 2025, 6:56am

Thanks, as I’m using a 5-6 year old Intel machine with 16GB of memory I think it’s safe to say that my experience would be horrible. I think I will continue to use my own limited intelligence + whatever DT has built in.

jem · April 11, 2025, 6:57am

Thanks for the details, that makes my decision much easier to make.

ajcowell · April 11, 2025, 12:28pm

You know, human intelligence still has a lot going for it … Good luck!

jonmoore · April 11, 2025, 12:54pm

With DT4, and it’s new Ai assisted features, I like to think of the traditional search features within DT as “semantic search” in that it’s able to derive meaning via metadata in combination with content that’s added by the user (or automatically in many cases).

LLM assisted features are a different layer again where thoughtful use can be seen both as database management utilities and a secondary layer of insight inquiry.

If you’re using a premium service like that provided by kagi, this provides an API key that can be used to access all the main LLM tools, including the latest reasoning models such as Gemini 2.5 Pro, Claude 3.7 (with extended thinking) and GPT 4o. And of course the faster lighter chat style clients from all the main players.

The secondary advantage of using a private search engine/service like kagi, is that they provide a secondary layer of privacy over and above the individual LLM providers. kagi’s own privacy practices in combination with DT’s privacy practices provides an extra layer of confidence when using LLM’s. But unless you’re paying for premium API based LLM services, you need to be aware of the privacy promise of each particular service provider. If you decide to use a single premium API provider, be that e.g. Anthropic, OpenAi or Google; they provide pretty decent privacy promises, however, if you’re only using free services, you’ll find the privacy promise to be less satisfactory.

shiro · April 17, 2025, 2:32am

Interesting about kagi, I hadn’t heard of them before. So is it possible to use the API key generated by them directly in DT4? I was under the impression that you could only use keys that were generated by OpenAI, Anthropic, etc. Thanks.

cgrunenberg · April 17, 2025, 6:19am

That’s not possible.

shiro · April 17, 2025, 6:58am

I suspected not, thanks for confirming!

jonmoore · April 17, 2025, 7:16am

To be clear. The API key Kagi provide uses a mix of Ai models to power summarisation.

The place where Kagi provides real Ai value is with regard to what they call their Ai Assistant. This is a powerful desk research tool in the sense that it allows one to query multiple Ai services as part of a single research session, and those queries are tuned with real time search results from Kagi’s excellent search algorithm. These are the current models you can interchangeably use (apologies for the forum enforced image resize, instead view it here - Screenshot 2025-04-17 at 07.30.29.png - Droplr):

shiro · April 17, 2025, 8:01am

Good to know. Thanks for the additional information!