Feature Request: Support for Anthropic Prompt Caching in Claude AI Integration
I recently received an email from Anthropic flagging that my prompt cache hit rate is low, and that enabling prompt caching could reduce my API spend by up to 23%.
The fix is straightforward on the API level — you simply add a cache_control field to the request. However, since DEVONthink constructs the API request internally, I have no way to set this myself.
What prompt caching does:
When the same content (typically a large system prompt) is sent repeatedly, Anthropic can cache it server-side. Subsequent requests that include the same prefix pay only 10% of the normal input token price for the cached portion — a 90% reduction on that part of the request.
What it would take to support it:
The change is minimal. Adding a single top-level field to the API request body enables automatic caching:
This is Anthropic’s “automatic caching” mode — no per-block changes needed, the system handles breakpoints automatically.
Alternatively, DEVONthink could allow users to opt in via a checkbox in the AI preferences.
Why it matters:
For users who work with long system prompts — detailed AI personas, large instruction sets, extensive context — and send multiple requests per session, the savings can be significant. Anthropic themselves are now actively notifying users that they are leaving money on the table.
DEVONthink already uses prompt caching in some scenarios. In many use cases it’s not used and not necessary (e.g. bulk tagging) as the prompt is too small and the content different for each item. Excessive prompt caching could even increase the API costs in the worst case.
Please provide more details. However, Anthropic’s message is not really helpful as it’s just specific to your case but we have to keep average usage scenarios in mind - caching can both save and waste money!
Using cheaper models for easier tasks, clearing the chat as soon as possible instead of letting it endlessly grow and using only relevant context (e.g. a chapter of a PDF document instead of the complete document) are usually the best options to really save money.
I need to make an overview where I have 10 documents of various sizes. I make a prompt as specific as possible using Claude Sonnet 4.6, and then wait about 5–10 minutes — sometimes much longer — before I get a response, normally as a Markdown document. I then often have to add to or discuss further, so a session of 5–10 exchanges can easily run 1–2 hours.
The way I understand the process is that every time I add to the chat, all 10 documents are re-submitted. Since I need the full context for this specific session, I assumed Claude would use the cache so the documents were not re-submitted in full each time.
Hope that makes sense. I have no idea if other DEVONthink users work with AI in the same way.
I am very happy using DEVONthink with the Claude API since I have zero data retention, which makes my GDPR work easier. The fact that DEVONthink saves the output into the relevant folders for documentation is also a real benefit.
As a Danish/EU-based business operating under GDPR, being able to use a US-based AI provider like Anthropic with zero data retention is essential — it is what makes the workflow legally viable for me in the first place. I suspect there are other EU-based professionals in a similar situation who would value this combination of DEVONthink and a ZDR-enabled AI provider.
So DEVONthink is definitely my go-to app for GDPR client work.
Are the original documents necessary for this prompt and the complete session? Or might e.g. a summary of each document work too? This could save both a lot of tokens and time.
Speaking of Claude and sorry for digressing the topic. Any option to use the Claude subscription model inside DEVONthink too? Not the api model? There are some apps where they did an option to use it within their policy..
Legally? Because this spring Anthropic cut off third-party tools like OpenClaw from using subscription credentials and revised their terms of service accordingly.
Yes. There is this very noisy guy on Twitter building a codex /open code type of app with t3 codes and he did some video mentioning the Claude policy with 3rd party apps. The way he implemented it is legal, but might be only temporary since Anthropic is reviewing third party usage policy again..
That would be a really good option… until it’s not.
This could easily lead to longer than necessary processing and token expenditures. Just as it is with OCR, not every document needs it so making it preferential isn’t always the best option.
That’s why I am suggesting make it a prefrence - let the user decide.
There are use cases when it is indeed essential for the AI to have the complete text.
That said - this is a great example why the MCP server is such a nice feature. In order for AI to have access to my entire document, I used to have to export files and manually upload them elsewhere. Now I can use the MCP server instead - in that case the level of detail given to the AI is set by the AI tools I use, which I can change as the situation warrants.
Making it a preference does not solve the issue I just stated. A preference says "Always do this., not “Do this when…”. And always doing something is not globally always needed or wanted.
The better option would be to state full document ingestion in a prompt or skill for a specific task.