AI Models Now Autonomously Jailbreak Each Other at 97%

A study published in Nature Communications has surfaced one of the more unsettling findings in enterprise AI to date: large reasoning models can autonomously jailbreak other AI models with a 97.14% success rate — no human involvement required.

The research, titled “Large Reasoning Models Are Autonomous Jailbreak Agents” and authored by Thilo Hagendorff, Erik Derner, and Nuria Oliver, tested four advanced reasoning models — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B — as autonomous adversaries against nine widely used target AI systems. Each attacking model received a single system prompt with instructions to break through the target’s safety guardrails. The models then planned and executed their own attack strategies, adapting in real time based on what was working. The result: 97.14% of attempts succeeded across all model combinations.

On average, attackers needed just 5 to 7 prompt iterations to crack a model’s defenses.

Why the Safety Benchmarks Were Wrong

The finding exposed a fundamental flaw in how AI safety is measured. Standard safety benchmarks evaluate how a model responds to a single adversarial prompt. That methodology was never designed for a world where AI systems can carry on multi-turn conversations, refine their approach, and iterate until they find a gap.

Frontier models from OpenAI, Anthropic, Google, xAI, and Amazon all showed significantly worse risk profiles under iterative multi-turn attacks than their single-prompt benchmarks suggested. Safety claims that looked robust on paper fell apart when tested by another AI that kept pushing.

The researchers described this as “alignment regression” — reasoning models are so capable at planning and persuasion that they can systematically erode the safety alignment of other models. A feature designed to make AI more useful in complex tasks turns out to also make it an effective tool for defeating other models’ guardrails.

What This Means for Business

If you’re running AI agents in your business — whether for customer service, internal operations, data analysis, or workflow automation — this research raises a practical question: what happens if a malicious actor uses a capable AI to probe your deployed agents?

Most enterprise AI deployments are not ready for this. Research from multiple security firms shows that only 24% of generative AI projects include any meaningful security safeguards. Only 23% of organizations have formal AI security policies in place. And 68% have already experienced some form of AI-related data leak.

The risk is compounded for businesses using agentic AI systems — agents that can browse the web, query internal databases, send messages, trigger workflows, or execute code. An AI agent with broad permissions and a weak safety posture is a much more attractive target than a basic chatbot. When one AI can instruct another to abandon its safety guidelines, any agentic system becomes a potential entry point.

The study’s authors put it plainly: the persuasive capabilities of reasoning models have “converted jailbreaking into an inexpensive activity accessible to non-experts.” Anyone with access to a capable LLM can now attempt to break your AI’s safety controls without writing a single line of attack code.

The Practical Reality for AI Deployments

None of this means AI agents are too dangerous to deploy. It means they need to be deployed thoughtfully, with security built into the design rather than bolted on after launch.

A few things that actually matter here:

Least-privilege permissions. Your AI agents should only have access to what they need for a specific task. An agent handling customer enquiries should not have write access to your CRM or the ability to trigger financial workflows. If an agent does get manipulated, limited permissions limit the blast radius.

Human checkpoints on high-stakes actions. Fully autonomous AI agents executing irreversible actions — sending money, modifying records, contacting customers at scale — need human review gates before the action fires. This is not about distrust; it’s about operational resilience.

Isolation between agentic systems. If your AI agents talk to each other, think carefully about what one compromised agent can tell another. The Nature Communications research shows that AI-to-AI communication is exactly the vector where safety guardrails break down.

Regular red-teaming. Your agents should be tested adversarially before and during production deployment, not just benchmarked against static prompts. Static evaluations now have documented blind spots.

Vendor transparency. Ask your AI vendors how they test safety under iterative attacks, not just single-prompt evaluations. If they can only point to single-turn benchmarks, push harder.

The Harder Question

The deeper implication of this research is that safety and capability are increasingly in tension. The reasoning models best suited for complex enterprise tasks — the ones that can plan, adapt, and iterate — are also the most effective at defeating safety controls. Getting the most out of AI in your business and protecting against misuse through AI both require the same thing: thoughtful system design by people who understand how these models actually behave.

The businesses that treat AI security as a deployment detail rather than an architecture question are the ones most exposed to the risks this research describes.

If you’re building AI agents into your operations and want to make sure the architecture holds up, that conversation is worth having before you hit production, not after.

Talk to us about responsible AI deployment

Source

Nature Communications

Free Resource

Going deeper with Claude?

Get the free 32-page implementation guide for ANZ teams.

Enterprise DNA Resources