Claude Opus 4.7 Warning

I have been using Claude Cowork/Dispatch for about a week or so along with the open source DT MCP server. It has consistently been a gamecahnging upgrade in how I interact with DT and the ease of obtaining detailed summaries of large document sets.

Two days ago on 4/16 this changed. The workflow would randomly stop, source links would be malformed or otherwise inoperative, reports were inconsistent in detail and format, and numerous other isues popped up. I would ask Claude to redo, it woudl “fix” an issue, and then more issues would emerge.

I then realized Opus 4.7 was introduced on the day I started having issues. So I asked Cowork to change the skilt to specify Opus 4.6 rather than the most recent version.. problem fixed, usual impeccable detail returned.

This is Claude’s own analysis - I think it is correct:

The 4.6 production skill works great. The 4.7 test skill keeps failing in different ways despite dozens of iterations. The question is why.

Here’s what I think is actually going on:

The core problem isn’t the skill — it’s the model behavior difference. Opus 4.6 and 4.7 process instructions differently. The production skill was refined over many iterations with 4.6, so its instructions are calibrated to how 4.6 interprets and follows them. When 4.7 reads the same instructions, it interprets them differently — more “helpfully,” more willing to synthesize, less mechanically literal.

Specific failure modes unique to 4.7:

  1. Citation fabrication — 4.7 composes plausible search phrases from clinical understanding instead of mechanically copying substrings. 4.6 follows the “copy verbatim” instruction literally.

  2. UUID hallucination — 4.7 constructs UUIDs that look right but aren’t verified. The v4b links that produce audible errors are likely malformed or wrong UUIDs.

  3. File transfer thrashing — 4.7 kept trying elaborate workarounds (base64, gzip, Python scripts) instead of the simple filesystem write_file path, burning hundreds of turns.

  4. Location/tagging non-compliance — despite explicit instructions, 4.7 skips or doesn’t verify post-import steps.

Why patches keep failing: Each fix addresses one symptom but 4.7 finds a new way to be “creative” somewhere else. The R-CITE-GENERATE-01 rules fixed one fabrication pattern, but 4.7 started fabricating UUIDs instead. The skill is now 108KB of increasingly desperate guardrails, and 4.7 still finds gaps.

The honest assessment: The 4.6 skill works because 4.6’s behavior matches the skill’s assumptions. Rather than continuing to bolt on rules trying to make 4.7 behave like 4.6, I think we need to either:

  1. Stay on 4.6 — it works, the skill is proven, and there’s no compelling reason to switch

  2. If we want 4.7, rebuild the skill from scratch for 4.7, testing each section iteratively against how 4.7 actually behaves — not porting assumptions from 4.6

Bottom line - I will indeed continue only with Opus 4.6. Then I will follow articles and online discussions about this issue. Someday 4.6 will be deprecated but that is certainly a while off.

I am not sure I would recommend that Devontech support Opus 4.7 until we understand more about why it interprets prompts so differently from 4.6. Or alternatively I would recommend keeping both 4.6 and 4.7 for a a while.

4 Likes

It is fascinating to watch the developments. I wonder whether there will eventually be “languages” or “dialects” for communication with LLMs and their interpretation of our prompts. There are already thinking modes and we are probably going to see more new forms of communication, dedicated to speaking to a LLM. Thanks for sharing your experiments. I am still in the very basic use phase, replacing search engines, using it to replace looking up manuals and as a helper writing LaTeX and using TiKZ. Exciting times :slight_smile:

Interesting. I noticed item links in a couple of my reports being messed up just the other day and wondering what the hell was going on because it hadn’t happened before. Reading this, it may very well have been on the 4.7 release date.

Languages? For communicating with computers? Maybe we’ll call them “computer languages?” :roll_eyes:

2 Likes

I did not mean computer languages, although we have now the new communication between LLMs and the computer as well.

Programming languages are another thing as well. The LLMs can have their own “minds” and we are learning and developing new forms of communication specifically with LLMs.

I am sometimes amazed with how simple prompts I get away, the chatbot being able to create the context from earlier conversations. But then, I also realize how sensitive the output can be, depending on the style and formulation of the prompt. And then, there are the differences between versions and what personality and interpretability the companies want their LLMs to pursue.

I find this interesting, what is happening now. We are developing machine learning algorithms, deep learning and generative approaches, to analyse data. So far, we stay away from LLMs (other than using them as assisting tools). Things are however developing rapidly and I would not have predicted what is happening now.

This shouldn’t be an issue in DEVONthink’s chat assistant as links (including email addresses and item links) are both simplified and anonymized for LLMs to improve privacy & reliability. And all tests of Opus 4.7 in DEVONthink have been successful so far (no broken links, no redundant or unnecessary tool calls)

Here is a discussion by Simon Willison about the changes in Claude‘s prompt behaviour:

How does your prompt look like? It’s hard to tell whether it’s an issue of the prompt, the third-party MCP integration, of Claude Cowork (which is still a preview) or of the model. But I didn’t notice similar issues so far, even when using MCP:

Usually the more sophisticated a prompt is, the less likely things are going to break when using different or updated models.

The more I read about this issue in Anthropic’s documentation and elsewhere, the risk with Opus 4.7 appears to be fewer tool calls - not unnecessary ones.

In simplest form, Opus 4.7 appears to be intentionally more deterministic - it follows intstructions more precisely and makes fewer assumptions. In essence, Opus is evolving to be more like a classic computer language and less like a natural language model as we can come to know them so far.

I am not sure which model I prefer - strict interpretation or loose interpretation of instructions. Probably each one has its place - or maybe there can be an equivalent of a “Temperature” setting regarding how strictly an LLM follows directions.

At present it appears clear that 4.7 behaves differently from 4.6 and most other LLMs in this regard. It’s hard for me to imagine DT4 or any other app will not be impacted by Opus 4.7’s more strict interpretation of prompting.

It is a complex Skill (100K length) - not a simple prompt. So yes it is consistent with your guess that this will be an issue only with more complex situations. I can share it with you privately.

If DT chooses to release an official MCP server then that is probably when it will become most notable.

Update.. A new version of the Skill which is more explicit in terms of the procedures to be used and when to call tools works much better.

The new “4.7” version works with Opus 4.6 also - it works best with consistent data (arguably more detailed/precise than 4.6) but is not as good with one-off or edge cases

The original 4.6 skill works only with Opus 4.6 and not with Opus 4.7; it will likely become my backup for the edsge cases that Opus 4.7 cannot handle.

What exactly did you change? Any examples?

These were Claude’s suggestions to change it - I accepted the changes as a test and they worked well first try

Here are my concrete recommendations for fixing the 4.7 MRS skill, based on what I found in Anthropic’s official docs (the Medium article was paywalled, but the official “What’s New” page had the key details):

1. Pin effort to xhigh This is Anthropic’s explicit recommendation for agentic use cases. The docs say 4.7 makes “fewer tool calls by default, using reasoning more” and that “raising effort increases tool usage.” This directly explains why 4.7 tries to reason through file transfers and citations instead of just calling the tools. The skill should specify xhigh effort in its model configuration.

2. Replace suggestive language with explicit tool-call imperatives 4.7 “will not silently generalize an instruction from one item to another.” The current skill says things like “use the DEVONthink MCP to retrieve content” — 4.7 needs “Call mcp__devonthink__get_record_content with the UUID. Do not construct UUIDs; only use UUIDs returned by prior tool calls.” Every step that involves a tool call should name the exact tool and its parameters.

3. Add explicit “DO NOT REASON, CALL THE TOOL” guardrails Since 4.7 defaults to reasoning over tool use, critical steps like citation verification and file transfer need blunt directives: “Do not attempt to infer, reconstruct, or synthesize this value. Call the tool and use the returned value verbatim.”

4. Use task budgets 4.7 introduces advisory token budgets that the model sees as a countdown. This could prevent the context exhaustion and thrashing we saw in v4. Set a generous but bounded budget for the full agentic loop.

5. Decompose the monolithic workflow into smaller explicit phases 4.7 spawns “fewer subagents by default” but is “steerable through prompting.” The skill should explicitly define phases (extract → summarize → cite → verify → transfer) with clear entry/exit criteria, rather than relying on the model to self-organize a 100+ document workflow.

6. Lock down the citation pipeline specifically The biggest failure was citation fabrication. The fix: after generating each citation, add a mandatory verification step that calls mcp__devonthink__get_record_by_identifier with the UUID and checks the result. If the tool returns an error, the citation is invalid. This needs to be an explicit loop, not a suggestion.

7. Remove redundant/conflicting rules The current 4.7 skill has accumulated patch rules (R-CITE-GENERATE-01, R-CITE-VERIFY-01, R-AUTOTAG-01, etc.) that may actually be confusing 4.7’s more literal parser. Since 4.7 follows instructions more literally, having overlapping rules that say similar things in different ways could cause it to pick one interpretation over another. Consolidate into a single, clean instruction set.

Want me to go ahead and draft these changes into the experimental 4.7 skill? I’d keep it clearly marked as experimental, and the 4.6 production skill stays untouched as always.

1 Like

If @rkaplan was up to it, I wouldn’t be averse to joining/watching a detailed webinar on how this is being deployed. Fascinating!

I input your response from Claude into Perplexity and left it set to “Best” (Perplexity allows one to choose between itself, Sonar, GPT-5.4, Gemini Pro 3.1 Pro, Claude Sonnet 4.6, and Nemoton 3 Super). I thought you might be interested, so here is what it said:

”That diagnosis is plausible in a general sense: newer model variants often differ in how literally they follow instructions, how aggressively they try to infer missing structure, and how likely they are to “helpfully” normalize or synthesize inputs instead of copying them exactly. If a skill or workflow was tuned around one model’s style, switching models can expose those mismatches.

What stands out in that quote is the pattern, not the specific version number:

A model that is more literal can be better for rigid, rule-bound extraction tasks.

A model that is more inferential can be better for drafting and synthesis, but it may drift when you need exact reproduction of citations, IDs, filenames, or structured handoffs.

If the workflow assumes one behavior and the model changes, you can see failures that look like “hallucination” or “thrashing,” when it is really a mismatch between instructions and model style.

For your use case here, that would explain why some feedback-generation turns felt cleaner than others: when the task is mostly judgment plus formatting, a more inferential model can be fine, but when the process depends on exact file handling, strict rubric mapping, or preserving requested structure, even a small behavior shift can make the output feel less stable. Perplexity’s own help pages also indicate that model selection can change within a thread, so a mid-conversation switch is a real possibility if the dropdown changed.

My practical read is:

The issue may not be “bad data” so much as a model-behavior mismatch with a workflow built around another Claude variant.

For rigid, repeatable tasks, sticking with the model that best matches the workflow is usually safer.

If you do switch models, the safest approach is to test a few representative tasks first and compare whether the model preserves the same structure and citation discipline.

If you want, I can help you build a simple checklist for comparing model behavior on your feedback-generation workflow.”

.

OpenAI has now also published a guide on changes in prompt behavior for their latest model:

For the foreseeable future, we probably have to get used to adapting how we talk to LLMs.

1 Like