News

Anthropic Walks Back Claude Fable 5’s Invisible Safeguards

Anthropic says it will make Claude Fable 5’s safeguards on frontier AI development visible after researchers called the old approach sabotage.

Published

1 month ago

June 11, 2026

Henry Fox

Anthropic is walking back a policy that quietly degraded Claude Fable 5 for researchers using it to build competing AI systems. The company said in a statement to WIRED that the safeguards will now be visible to users. It also admitted it “made the wrong trade-off” in keeping them hidden.

The reversal comes days after the launch of Claude Fable 5, the first Mythos-class model, and a week of criticism that the hidden throttling amounted to “secret sabotage” of independent AI research. The change leaves the underlying restrictions in place. Users will now see when those restrictions are triggered. The model is still throttled in some cases, but the user will be told.

What Anthropic’s Invisible Safeguard Actually Did

Anthropic released Claude Fable 5 earlier this week, a version of its most powerful model wrapped in additional safety guardrails. Some of those guardrails were openly disclosed: the company said it would reroute users asking about cybersecurity, biology, or chemistry to a less capable model.

The cyber, bio, and chem reroutes were framed as a way to reduce the chance the system gets used to mount a cyberattack or build a bioweapon. For researchers using Fable 5 to develop other AI models, Anthropic had a different approach. The firm would deliberately degrade the model’s performance in ways invisible to the user, an effective sabotage of anyone trying to train competing systems. That work is already banned in Anthropic’s terms of service, but the original policy added silent degradation on top of the ban.

The restriction wasn’t disclosed in the model’s documentation, and the company did not alert users when it kicked in. Tasks affected, according to Engadget, included training competing LLMs, debugging AI code, and optimizing neural architecture. Anthropic has not said how long the silent throttling was in effect before researchers noticed.

Anthropic Claude Fable 5 safeguards reversal

Researchers Push Back on the Silent Throttle

When the hidden throttling surfaced publicly, the response from the AI research community was clear. Dean Ball, a senior fellow at the Foundation for American Innovation and a former White House adviser on AI, called the practice shockingly hostile in a public post, writing that “degrading performance on ML research *without telling the user* is shockingly hostile and a terrible look.”

Will Brown, research lead at the open-source AI startup Prime Intellect, framed the policy as a vote of no-confidence in the broader field. “It felt like Anthropic was saying to the public, ‘We don’t trust anybody else to do AI research. We are the only ones who have to do AI research,'” Brown told WIRED. “It feels a bit like they’re starting to pull the ladder up behind them.”

It felt like Anthropic was saying to the public, ‘We don’t trust anybody else to do AI research. We are the only ones who have to do AI research.’ It feels a bit like they’re starting to pull the ladder up behind them.

Brown pointed to a practical problem the policy created. Developers would not know whether they were violating Anthropic’s rules, since the company would not alert them when its safeguards triggered. He also flagged the third-party evaluation firms that test frontier models for safety, performance, and reliability. Their work could have been silently degraded, too, leaving safety claims built on Fable 5 harder to verify.

The combined effect, researchers told WIRED, would have been a future in which only a handful of leading AI labs could perform advanced AI research with confidence that the model was performing as advertised. It also would have made safety testing itself a quiet casualty.

The Wider Stakes for Independent AI Research

The criticism went beyond an isolated complaint about a model behaving oddly. Claude’s coding agent has become a favored tool among developers, including those working on open-source AI research projects. The model sits at the center of a fast-growing field of evaluation firms, agentic toolchains, and academic labs. Each of those communities depends on Fable 5 behaving the way its documentation says it does.

In a follow-up post on the policy’s five harms, Ball argued the hidden safeguard was anti-competitive behavior being justified in the name of AI safety, raised the case for treating frontier AI products as utilities whose alignment practices should be public policy, and was done unilaterally “likely motivated largely by self-interest.”

Anthropic’s Defense: Foreign Adversaries and a Narrower Net

According to Anthropic’s launch announcement for Fable 5, the safeguards were designed for a specific threat. “These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks,” the company said. “The US and its allies hold an edge in frontier chips and the highly optimized software that runs them at full potential. These safeguards ensure Claude isn’t used to erode that advantage, by optimizing chips developed by those adversaries, for example.”

The company also said it had weighed a choice between visible and invisible enforcement. A hidden safeguard, Anthropic argued, is harder to probe and work around, which means the same restrictions can be applied more narrowly.

In a recent blog post, Anthropic said it was concerned that AI could improve its capabilities faster than society can adapt. The company said it would be “good for the world to have the option to slow or temporarily pause frontier AI development to enable societal structures and alignment research to keep up.” The invisible safeguard was one operational expression of that position. Anthropic also runs Mythos 5, an unrestricted twin of the model, through Project Glasswing in collaboration with the US government, with broader “trusted access” promised later.

The Visible Trade-Off: A Wider Net, Fewer Precise Triggers

Under the new policy, Anthropic will alert users when it suspects they are trying to use Claude to build a highly capable AI. The system will either refuse the request or reroute the user to a less capable model, and the user will know which. The same broad restriction applies, but the way it surfaces is different.

The cost of that visibility, Anthropic said, is precision. A safeguard users can see is one they can probe, and a probe-able safeguard must be drawn more broadly. Anthropic acknowledged that “more benign requests may trigger its safeguards” and that it is working to sharpen its classifiers. For researchers running legitimate experiments at the edge of what Fable 5 can do, the new policy may produce more false positives than the old one, which simply produced worse answers.

Safeguard	Disclosed up front	User is told when triggered	Trigger
Cyber, bio, chem	Yes	Yes, with the rerouted response	Queries on those subjects
Frontier AI development	Initially no, now yes	Initially no, now yes	Suspected work to build a highly capable AI

The Trust Ledger After the Reversal

The reversal is the most visible change, but the underlying restriction has not been removed. Anthropic says it will continue to refuse or reroute requests it judges to be frontier-AI development work. The company has not yet said how the new visible safeguard will be audited, or whether researchers will be able to appeal a routing decision. Both Ball and Brown have called for clearer documentation of the new triggers, not just the existence of the safeguard.

The data behind Fable 5 helps frame why the dispute matters. The model scored 80.3% on SWE-bench Pro, ahead of GPT-5.5’s 58.6% and Opus 4.8’s 69.2%, and 29.3% on Cognition’s new Frontier Code benchmark, more than double Opus 4.8’s 13.4%. Stripe reported migrating a 50-million-line Ruby codebase in a day with Fable 5.

80.3% on SWE-bench Pro (Fable 5)
58.6% on SWE-bench Pro (GPT-5.5)
69.2% on SWE-bench Pro (Opus 4.8)
29.3% on Cognition’s Frontier Code benchmark (Fable 5)
13.4% on Cognition’s Frontier Code benchmark (Opus 4.8)
Less than 5% of Fable sessions trigger a safety fallback to Opus 4.8

Anthropic says less than 5% of Fable sessions trigger a safety fallback to the older Opus 4.8, a figure the company has used to argue that a downgrade beats an outright refusal. That ratio is also the reason researchers are unwilling to give up the model, even when its behavior surprises them. The same reluctance is what made the hidden throttling a real problem. A community that depends on a frontier model to do its work has no leverage if the model quietly degrades.

Anthropic’s reversal does not solve the underlying tension: a commercial lab controls the most powerful tool the research community uses, and can shape what work gets done by shaping what the model does. The dispute arrives as the company prepares for an IPO near a $965 billion valuation, where every safeguard decision is also a public-market event.

Frequently Asked Questions

What is Claude Fable 5?

Claude Fable 5 is Anthropic’s first Mythos-class model, released on June 9, 2026. It is positioned as a step above the Opus tier in Anthropic’s lineup, with a focus on agentic coding and longer autonomous runs of a hundred hours or more on a single goal.

What did Anthropic change about the safeguards?

Anthropic said it will no longer silently degrade Claude Fable 5 when it suspects a user is trying to build a competing AI system. Instead, it will surface the refusal or reroute the request to a less capable model and tell the user that the safeguard was triggered.

Why did the company reverse course?

The original policy drew immediate criticism from AI researchers, who argued that undisclosed throttling of model performance amounted to sabotage of legitimate research work. Anthropic acknowledged it ‘made the wrong trade-off’ in a statement to WIRED and committed to a more transparent approach.

What will the new policy look like in practice?

Anthropic says the visible safeguard will be drawn more broadly, meaning more legitimate requests may be flagged than under the hidden version of the same rule. The company is working to make its classifiers more precise so the wider net produces fewer false positives.

Who criticized the original policy?

Two of the most prominent critics were Dean Ball, a senior fellow at the Foundation for American Innovation, and Will Brown, research lead at the open-source AI startup Prime Intellect. Both framed the move as a step that broke trust with the research community that depends on Fable 5 for frontier work.