DT4 - privacy when using AI

I tried to find some info about this in the documentation but failed, I assume it’s there but I’m missing it.

One thing I really like with DT is the privacy and that I keep full control of where my data is stored. I use DT for some work related documents that should stay local, so I’m a bit concerned with using external tools like various LLMs etc. What happens with the documents if I start using an external tool like ChatGPT/Claude/whatever, is my data sent to these tools? (I assume so)

1 Like

I’d assume that is necessary if you want the tools to analyze the data.

1 Like

You could install and run a model locally via Ollama or LM Studio or whatever, and limit it to your database(s).

1 Like

And accepting the limitations of that setup as well.

1 Like

I’ve played around with a few different options over the last few days. Getting LM Studio set up really wasn’t too difficult. LM Studio is one of the options DEVONthink supports as a local “model garden”. That is, you install the environment, and from them download opensource models. You then configure LM Studio to be called from DT4 and it will return the results of your query. Everything stays on your machine, nothing is sent elsewhere.

The challenge with running a local model like this is speed and complexity. Depending on your machine specs, you’ll only be able to run small models and they’ll probably run quite slow. The nice thing about LM Studio is that it will advise you based on your machine stats which models you could use. I’m using a Macbook Air M3 with 24MB of memory and the experience isn’t great.

You could get a free API key from Google and use the Gemini models. I can’t remember how many free tokens they give you but if you just want to try things out, that is an option. As it’s free, I wouldn’t send anything you didn’t want posted on the internet (or used to train their models). I’ve also set up an Anthropic API key. They require you buy $5 in credits first, and then your usage consumes from your balance. I believe all the major players state that they will not use any data or interactions for these paid APIs but worth checking.

If you’re 100% sure your materials can never leave your machine, I believe the local LLMs are your only choice.

By default generative AI (via chat assistant, batch processing, smart rules or scripts) uses only selected documents. Optionally the chat assistant might use a database search (see Settings > AI > Chat) but the search is also limited to the current selection in the item list or, if there’s none, in the sidebar.

But DEVONthink does never send your original documents:

  1. Image files are scaled and recompressed and send without the original metadata. PDF documents without a text layer are handled likewise, thumbnails of the first n pages (depending on the model) are used.
  2. In case of text-based documents only the raw text or excerpts of it are used, again no metadata
  3. In case of audio/video files the transcription (if available) and n video still images (if supported by the chat model) might be used
  4. Transcribing audo/video files extracts and recompresses the audio track and sends only this to Whisper (if selected in Settings > AI > Transcription)

Furthermore, DEVONthink anonymizes links (including email addresses) to improve both the privacy and to reduce the likelihood that the response will include invalid links as LLMs don’t like stuff like UUIDs or session identifiers. This saves also tokens.

And finally, in case of commercial models supporting tool calls (currently all except Perplexity and Gemini 2.0 Flash Thinking) data is only send on demand and not in advance.

2 Likes

Thanks, as I’m using a 5-6 year old Intel machine with 16GB of memory I think it’s safe to say that my experience would be horrible. I think I will continue to use my own limited intelligence + whatever DT has built in.

3 Likes

Thanks for the details, that makes my decision much easier to make.

2 Likes

You know, human intelligence still has a lot going for it … :slight_smile: Good luck!

4 Likes

With DT4, and it’s new Ai assisted features, I like to think of the traditional search features within DT as “semantic search” in that it’s able to derive meaning via metadata in combination with content that’s added by the user (or automatically in many cases).

LLM assisted features are a different layer again where thoughtful use can be seen both as database management utilities and a secondary layer of insight inquiry.

If you’re using a premium service like that provided by kagi, this provides an API key that can be used to access all the main LLM tools, including the latest reasoning models such as Gemini 2.5 Pro, Claude 3.7 (with extended thinking) and GPT 4o. And of course the faster lighter chat style clients from all the main players.

The secondary advantage of using a private search engine/service like kagi, is that they provide a secondary layer of privacy over and above the individual LLM providers. kagi’s own privacy practices in combination with DT’s privacy practices provides an extra layer of confidence when using LLM’s. But unless you’re paying for premium API based LLM services, you need to be aware of the privacy promise of each particular service provider. If you decide to use a single premium API provider, be that e.g. Anthropic, OpenAi or Google; they provide pretty decent privacy promises, however, if you’re only using free services, you’ll find the privacy promise to be less satisfactory.

Interesting about kagi, I hadn’t heard of them before. So is it possible to use the API key generated by them directly in DT4? I was under the impression that you could only use keys that were generated by OpenAI, Anthropic, etc. Thanks.

That’s not possible.

I suspected not, thanks for confirming!

To be clear. The API key Kagi provide uses a mix of Ai models to power summarisation.

The place where Kagi provides real Ai value is with regard to what they call their Ai Assistant. This is a powerful desk research tool in the sense that it allows one to query multiple Ai services as part of a single research session, and those queries are tuned with real time search results from Kagi’s excellent search algorithm. These are the current models you can interchangeably use (apologies for the forum enforced image resize, instead view it here - Screenshot 2025-04-17 at 07.30.29.png - Droplr):

1 Like

Good to know. Thanks for the additional information!