Feature Request: Support for Anthropic Prompt Caching in Claude AI Integration

oschmitto · June 26, 2026, 11:19am

Feature Request: Support for Anthropic Prompt Caching in Claude AI Integration

I recently received an email from Anthropic flagging that my prompt cache hit rate is low, and that enabling prompt caching could reduce my API spend by up to 23%.

The fix is straightforward on the API level — you simply add a cache_control field to the request. However, since DEVONthink constructs the API request internally, I have no way to set this myself.

What prompt caching does:
When the same content (typically a large system prompt) is sent repeatedly, Anthropic can cache it server-side. Subsequent requests that include the same prefix pay only 10% of the normal input token price for the cached portion — a 90% reduction on that part of the request.

What it would take to support it:
The change is minimal. Adding a single top-level field to the API request body enables automatic caching:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "cache_control": {"type": "ephemeral"},
  "system": "...",
  "messages": [...]
}

This is Anthropic’s “automatic caching” mode — no per-block changes needed, the system handles breakpoints automatically.

Alternatively, DEVONthink could allow users to opt in via a checkbox in the AI preferences.

Why it matters:
For users who work with long system prompts — detailed AI personas, large instruction sets, extensive context — and send multiple requests per session, the savings can be significant. Anthropic themselves are now actively notifying users that they are leaving money on the table.

Full documentation: Prompt caching - Claude Platform Docs

Would love to know if this is something the DEVONthink team could consider for an upcoming release.

Ole

cgrunenberg · June 26, 2026, 12:06pm

DEVONthink already uses prompt caching in some scenarios. In many use cases it’s not used and not necessary (e.g. bulk tagging) as the prompt is too small and the content different for each item. Excessive prompt caching could even increase the API costs in the worst case.

oschmitto · June 26, 2026, 12:32pm

some of the work I am doing in regards to GDPR, includes large PDF often 100+ pages, and could be 5-10 PDF.

I assume the reason Anthropic send me the message, where so I could save money

The way I understand the Devonthink API call is that it will send the same data over and over again when I have several long prompts?

Message from Anthropic.

“Your prompt cache hit rate is low - Caching repeated content like system prompts could save Ole‘s Individual Org up to 23% of its API spend”.

Learn how to set up prompt caching in the guide below.

Read the prompt caching guide
— The Anthropic team

Byt maybe I am wrong.

cgrunenberg · June 26, 2026, 12:49pm

Please provide more details. However, Anthropic’s message is not really helpful as it’s just specific to your case but we have to keep average usage scenarios in mind - caching can both save and waste money!

Using cheaper models for easier tasks, clearing the chat as soon as possible instead of letting it endlessly grow and using only relevant context (e.g. a chapter of a PDF document instead of the complete document) are usually the best options to really save money.

oschmitto · June 28, 2026, 2:01pm

Let me try to explain.

I need to make an overview where I have 10 documents of various sizes. I make a prompt as specific as possible using Claude Sonnet 4.6, and then wait about 5–10 minutes — sometimes much longer — before I get a response, normally as a Markdown document. I then often have to add to or discuss further, so a session of 5–10 exchanges can easily run 1–2 hours.

The way I understand the process is that every time I add to the chat, all 10 documents are re-submitted. Since I need the full context for this specific session, I assumed Claude would use the cache so the documents were not re-submitted in full each time.

Hope that makes sense. I have no idea if other DEVONthink users work with AI in the same way.

I am very happy using DEVONthink with the Claude API since I have zero data retention, which makes my GDPR work easier. The fact that DEVONthink saves the output into the relevant folders for documentation is also a real benefit.

As a Danish/EU-based business operating under GDPR, being able to use a US-based AI provider like Anthropic with zero data retention is essential — it is what makes the workflow legally viable for me in the first place. I suspect there are other EU-based professionals in a similar situation who would value this combination of DEVONthink and a ZDR-enabled AI provider.

So DEVONthink is definitely my go-to app for GDPR client work.

Looking forward to your reply.

Happy sunny sunday

Ole

BLUEFROG · June 28, 2026, 4:03pm

Clarify what “size” means, e.g., 100MB or 500 pages.

cgrunenberg · June 29, 2026, 7:22am

Are the original documents necessary for this prompt and the complete session? Or might e.g. a summary of each document work too? This could save both a lot of tokens and time.

Nielle · June 29, 2026, 7:36am

Speaking of Claude and sorry for digressing the topic. Any option to use the Claude subscription model inside DEVONthink too? Not the api model? There are some apps where they did an option to use it within their policy..

No complain on the MCP side , it works well

cgrunenberg · June 29, 2026, 7:57am

Legally? Because this spring Anthropic cut off third-party tools like OpenClaw from using subscription credentials and revised their terms of service accordingly.

Nielle · June 29, 2026, 9:40am

Yes. There is this very noisy guy on Twitter building a codex /open code type of app with t3 codes and he did some video mentioning the Claude policy with 3rd party apps. The way he implemented it is legal, but might be only temporary since Anthropic is reviewing third party usage policy again..

cgrunenberg · June 29, 2026, 10:03am

The most reliable and official option is definitely MCP.

oschmitto · June 30, 2026, 8:48pm

Not large maybe a total of 10 to 50 Mb.

oschmitto · June 30, 2026, 8:50pm

The total document is necessary for the purpose also to make sure the answer is based on the full content.

BLUEFROG · July 1, 2026, 1:01am

But how many pages? The amount of textual content in a document matters.

rkaplan · July 1, 2026, 2:04am

I would really like to see an option in AI config to send the entire content of documents to AI vs let AI optimize that process.

BLUEFROG · July 1, 2026, 1:26pm

That would be a really good option… until it’s not.

This could easily lead to longer than necessary processing and token expenditures. Just as it is with OCR, not every document needs it so making it preferential isn’t always the best option.

rkaplan · July 1, 2026, 2:04pm

That’s why I am suggesting make it a prefrence - let the user decide.

There are use cases when it is indeed essential for the AI to have the complete text.

That said - this is a great example why the MCP server is such a nice feature. In order for AI to have access to my entire document, I used to have to export files and manually upload them elsewhere. Now I can use the MCP server instead - in that case the level of detail given to the AI is set by the AI tools I use, which I can change as the situation warrants.

BLUEFROG · July 1, 2026, 2:23pm

Making it a preference does not solve the issue I just stated. A preference says "Always do this., not “Do this when…”. And always doing something is not globally always needed or wanted.

The better option would be to state full document ingestion in a prompt or skill for a specific task.

rkaplan · July 1, 2026, 2:38pm

OK I agree that would work well.

Can I do this in DT4 now and thus override the default which only uploads limited text?

If so how do I do that? That would be quite useful to know.

BLUEFROG · July 1, 2026, 4:35pm

I don’t believe so. That would be a question for @cgrunenberg.