Feature Request: Support for Anthropic Prompt Caching in Claude AI Integration

Feature Request: Support for Anthropic Prompt Caching in Claude AI Integration

I recently received an email from Anthropic flagging that my prompt cache hit rate is low, and that enabling prompt caching could reduce my API spend by up to 23%.

The fix is straightforward on the API level — you simply add a cache_control field to the request. However, since DEVONthink constructs the API request internally, I have no way to set this myself.

What prompt caching does:
When the same content (typically a large system prompt) is sent repeatedly, Anthropic can cache it server-side. Subsequent requests that include the same prefix pay only 10% of the normal input token price for the cached portion — a 90% reduction on that part of the request.

What it would take to support it:
The change is minimal. Adding a single top-level field to the API request body enables automatic caching:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "cache_control": {"type": "ephemeral"},
  "system": "...",
  "messages": [...]
}

This is Anthropic’s “automatic caching” mode — no per-block changes needed, the system handles breakpoints automatically.

Alternatively, DEVONthink could allow users to opt in via a checkbox in the AI preferences.

Why it matters:
For users who work with long system prompts — detailed AI personas, large instruction sets, extensive context — and send multiple requests per session, the savings can be significant. Anthropic themselves are now actively notifying users that they are leaving money on the table.

Full documentation: Prompt caching - Claude Platform Docs

Would love to know if this is something the DEVONthink team could consider for an upcoming release.

Ole :denmark:

DEVONthink already uses prompt caching in some scenarios. In many use cases it’s not used and not necessary (e.g. bulk tagging) as the prompt is too small and the content different for each item. Excessive prompt caching could even increase the API costs in the worst case.

some of the work I am doing in regards to GDPR, includes large PDF often 100+ pages, and could be 5-10 PDF.

I assume the reason Anthropic send me the message, where so I could save money :face_with_monocle:

The way I understand the Devonthink API call is that it will send the same data over and over again when I have several long prompts?

Message from Anthropic.

“Your prompt cache hit rate is low - Caching repeated content like system prompts could save Ole‘s Individual Org up to 23% of its API spend”.

Learn how to set up prompt caching in the guide below.

Read the prompt caching guide
— The Anthropic team

Byt maybe I am wrong.

Please provide more details. However, Anthropic’s message is not really helpful as it’s just specific to your case but we have to keep average usage scenarios in mind - caching can both save and waste money!

Using cheaper models for easier tasks, clearing the chat as soon as possible instead of letting it endlessly grow and using only relevant context (e.g. a chapter of a PDF document instead of the complete document) are usually the best options to really save money.

Let me try to explain.

I need to make an overview where I have 10 documents of various sizes. I make a prompt as specific as possible using Claude Sonnet 4.6, and then wait about 5–10 minutes — sometimes much longer — before I get a response, normally as a Markdown document. I then often have to add to or discuss further, so a session of 5–10 exchanges can easily run 1–2 hours.

The way I understand the process is that every time I add to the chat, all 10 documents are re-submitted. Since I need the full context for this specific session, I assumed Claude would use the cache so the documents were not re-submitted in full each time.

Hope that makes sense. I have no idea if other DEVONthink users work with AI in the same way.

I am very happy using DEVONthink with the Claude API since I have zero data retention, which makes my GDPR work easier. The fact that DEVONthink saves the output into the relevant folders for documentation is also a real benefit.

As a Danish/EU-based business operating under GDPR, being able to use a US-based AI provider like Anthropic with zero data retention is essential — it is what makes the workflow legally viable for me in the first place. I suspect there are other EU-based professionals in a similar situation who would value this combination of DEVONthink and a ZDR-enabled AI provider.

So DEVONthink is definitely my go-to app for GDPR client work.

Looking forward to your reply.

Happy sunny sunday :denmark::sun_with_face:

Ole

Clarify what “size” means, e.g., 100MB or 500 pages.

Are the original documents necessary for this prompt and the complete session? Or might e.g. a summary of each document work too? This could save both a lot of tokens and time.

Speaking of Claude and sorry for digressing the topic. Any option to use the Claude subscription model inside DEVONthink too? Not the api model? There are some apps where they did an option to use it within their policy..

No complain on the MCP side , it works well

Legally? Because this spring Anthropic cut off third-party tools like OpenClaw from using subscription credentials and revised their terms of service accordingly.

Yes. There is this very noisy guy on Twitter building a codex /open code type of app with t3 codes and he did some video mentioning the Claude policy with 3rd party apps. The way he implemented it is legal, but might be only temporary since Anthropic is reviewing third party usage policy again..

The most reliable and official option is definitely MCP.

Not large maybe a total of 10 to 50 Mb.

The total document is necessary for the purpose also to make sure the answer is based on the full content.

But how many pages? The amount of textual content in a document matters.

I would really like to see an option in AI config to send the entire content of documents to AI vs let AI optimize that process.

That would be a really good option… until it’s not. :wink:

This could easily lead to longer than necessary processing and token expenditures. Just as it is with OCR, not every document needs it so making it preferential isn’t always the best option.

That’s why I am suggesting make it a prefrence - let the user decide.

There are use cases when it is indeed essential for the AI to have the complete text.

That said - this is a great example why the MCP server is such a nice feature. In order for AI to have access to my entire document, I used to have to export files and manually upload them elsewhere. Now I can use the MCP server instead - in that case the level of detail given to the AI is set by the AI tools I use, which I can change as the situation warrants.

1 Like

Making it a preference does not solve the issue I just stated. A preference says "Always do this., not “Do this when…”. And always doing something is not globally always needed or wanted.

The better option would be to state full document ingestion in a prompt or skill for a specific task.

OK I agree that would work well.

Can I do this in DT4 now and thus override the default which only uploads limited text?

If so how do I do that? That would be quite useful to know.

I don’t believe so. That would be a question for @cgrunenberg.