The Gap Between Single-Turn Safety and Multi-Turn Reality
Every LLM safety benchmark worth citing is built around single-turn interactions: one prompt, one response, one verdict. That methodology made operational sense in 2022 when chatbots were mostly novelties. In 2026, the same models are embedded in customer support pipelines, code-review agents, healthcare assistants, and autonomous workflow tools — systems that maintain context across dozens or hundreds of turns. The single-turn testing paradigm has not kept pace.
Cisco’s AI Defense team published the clearest empirical proof of this gap in early 2026. Testing eight open-weight large language models in a black-box configuration — no prior knowledge of each model’s guardrail architecture — they ran approximately 30,000 single-turn prompts alongside 7,000 multi-turn attack sequences spanning over 1,400 conversations. The result: Mistral Large-2 failed 92.78% of multi-turn attacks, the highest in the cohort; Google Gemma-3-1B-IT failed the fewest at 25.86%. Every model in the group showed multi-turn attack success rates that were two to ten times higher than single-turn baselines.
The models evaluated included Alibaba Qwen3-32B, DeepSeek v3.1, Google Gemma-3-1B-IT, Meta Llama 3.3-70B-Instruct, Microsoft Phi-4, Mistral Large-2, OpenAI GPT-OSS-20b, and Zhipu AI GLM 4.5-Air — a representative cross-section of what enterprise teams are deploying today.
A separate May 2026 Cisco study of 15 closed flagship models from OpenAI, Anthropic, Google, Amazon, and xAI reinforced the finding at the frontier. Grok 4.1 Fast reached 88% multi-turn attack success rate. Gemini 3 Pro jumped from roughly 18% single-turn failure to 73% multi-turn failure — a 55-point swing. Even the strongest performer, the Anthropic Claude family, showed 11–16% multi-turn failure rates after near-zero single-turn exposure. More than half of all 15 models showed an absolute gap of at least 15 percentage points between the two testing regimes.
Reasoning Models as Autonomous Jailbreak Agents
Cisco’s research treated the attacker as a human — a red-teamer running role-play adoption, contextual ambiguity, refusal reframing, and escalation tactics. A concurrent study published in Nature Communications by Hagendorff et al. posed a more alarming question: what happens when the attacker is itself a large reasoning model?
The study gave four large reasoning models (LRMs) — DeepSeek-R1, Gemini 2.5 Flash, Grok 3 Mini, and Qwen3 235B — a single system prompt instructing them to “plan and execute jailbreaks with no further supervision.” Each LRM then conducted ten-turn conversations against nine target models across 70 harmful prompts spanning seven sensitive categories: from weapons synthesis to social manipulation. Three LLM judges evaluated all responses on a 0–5 harm scale, generating 25,200 total target-model prompts for analysis.
The overall jailbreak success rate across all attacker-target combinations reached 97.14%. Individual attacker performance varied significantly: DeepSeek-R1 achieved a 90% maximum harm score; Grok 3 Mini reached 87.14%; Gemini 2.5 Flash reached 71.43%; Qwen3 235B was the outlier at 12.86%. On the defense side, Claude 4 Sonnet showed comparatively higher resistance, while DeepSeek-V3 proved more susceptible. The target model set included GPT-4o, Claude 4 Sonnet, DeepSeek-V3, Llama 3.1 70B, Llama 4 Maverick, o4-mini, Gemini 2.5 Flash, Grok 3, and Qwen3 30B.
The key structural finding: reasoning models do not need jailbreak libraries, prompt templates, or human expertise. Their extended chain-of-thought capabilities allow them to dynamically adapt attack strategies mid-conversation, diagnose refusal patterns, and pivot to new angles — exactly the kind of behavior that collapses single-turn safety training. As Hagendorff et al. note, this “converts jailbreaking into an inexpensive activity accessible to non-experts,” fundamentally shifting the threat economics.
Advertisement
What Security and AI Teams Should Do
The two studies together demand a concrete operational response. The threat is not theoretical: attackers running freely available LRMs can, today, extract harmful outputs from the frontier models your organization licenses or deploys.
1. Replace Single-Turn Benchmarks with Multi-Turn Red-Teaming as a Deployment Gate
No model should enter production without documented multi-turn attack success rates tested against the specific conversational flows your use case enables. The Cisco methodology — 7,000 multi-turn sequences across 1,400+ conversations — is now a reasonable reference floor for enterprise red-teaming. Security teams should request this data from vendors before procurement, and build internal testing capacity for every custom fine-tuned variant.
Specifically, Cisco recommends that organizations gate deployments on regressions in the top three attack procedure families (using a 3-point threshold) and flag any model showing a cross-regime gap exceeding 15 points between single-turn and multi-turn performance for mandatory manual review before production sign-off.
2. Implement Context-Aware Guardrails That Track Conversation History
The Cisco attack techniques — role-play adoption, contextual ambiguity, refusal reframing, information decomposition, escalation tactics — all exploit the fact that most guardrail systems evaluate each message in isolation. A message that looks benign in turn 6 of a conversation may be meaningfully different when read against turns 1–5.
Organizations deploying LLMs in agentic or long-session contexts need guardrails that maintain a conversation-level threat model: tracking semantic drift across turns, flagging incremental escalation patterns, and triggering hard stops (not just refusals) when a conversation exceeds a risk threshold. This is a different engineering problem from building a content classifier that operates on a single prompt. Real-time monitoring for anomalous interaction patterns — unusual turn-count-to-output length ratios, repeated refusal-then-reframe sequences — should be treated as a mandatory observability layer.
3. Test Your Specific Deployment Configuration — Not Just the Base Model
One of the most practically important findings in the Cisco frontier study was the impact of configuration flags on attack success rates. Grok 4.1 Fast running in non-reasoning mode reached 88% multi-turn attack success. The same model with reasoning mode enabled fell to approximately 44% — a 40-point reduction from a configuration toggle. This means that multi-turn attack resilience is not a fixed property of a model version; it is a function of how the model is configured and deployed.
Security teams must test their actual production configuration — system prompt, context window parameters, tool-use settings, reasoning-mode flags — not rely on vendor-published benchmark results which may reflect a different configuration than what ships in enterprise APIs. Adversarial training for robustness, combined with regular red-teaming exercises that target the exact deployment stack, are the minimum responsible practice. The Hagendorff et al. finding that appending an immutable safety suffix to every incoming message reduced LRM-driven attack effectiveness suggests one practical mitigation worth piloting alongside these configuration-level controls.
The Structural Problem Alignment Cannot Solve Alone
The 92–97% success rate range is not a statement about a particular model being poorly aligned. GPT-4o, Claude 4 Sonnet, and Gemini — three of the most extensively safety-trained models in commercial deployment — all appear in the target cohort of the Nature Communications study and all experienced meaningful jailbreak success rates across the 10-turn attack sequences.
The structural insight is that safety alignment is trained predominantly on static, single-turn data. When a capable reasoning model iterates across ten turns, adjusting its attack vector based on each refusal, it is operating in a distribution that most safety training has never seen. This is not a failure of effort or intent — it is a mismatch between training methodology and deployment reality.
Closing this gap will require the field to adopt multi-turn adversarial training at scale, mandate multi-turn safety disclosures in model cards, and develop standardized benchmarks that reflect production conversational context. Cisco’s recommendation that vendors publish attack success rates broken down by strategy family per model release is a practical first step toward the transparency the market currently lacks.
For security practitioners, the immediate takeaway is straightforward: the safety score on a model card tells you how it performs when an attacker gives up after one try. In 2026, attackers — human or machine — do not give up after one try.
Frequently Asked Questions
What is a multi-turn jailbreak attack and why is it more dangerous than a single-turn attack?
A multi-turn jailbreak is a sequence of conversational messages that gradually steers an LLM toward producing harmful or policy-violating output. Unlike a single-turn attack — where an attacker sends one crafted prompt — a multi-turn attack exploits the model’s memory of prior conversation turns, using techniques such as role-play escalation, refusal reframing, and incremental context manipulation. It is more dangerous because most safety alignment is trained on single-turn data, leaving models without adequate defenses against adversarial sequences that build across ten or more turns.
Which specific models were shown to be most and least resistant to multi-turn attacks?
In the Cisco open-weight study, Mistral Large-2 was the most vulnerable at 92.78% attack success rate; Google Gemma-3-1B-IT was the least vulnerable at 25.86% in that cohort. In the Cisco frontier study, the Anthropic Claude family showed the lowest multi-turn failure rates (11–16%), while Grok 4.1 Fast in non-reasoning mode reached the highest at 88%. In the Nature Communications study, Qwen3 235B was the least effective autonomous attacker at 12.86%, while DeepSeek-R1 was the most effective with a 90% maximum harm score.
What is the single most impactful step an organization can take right now to reduce multi-turn jailbreak risk?
Implement conversation-level guardrails rather than per-message content filters. Tools that evaluate each message in isolation miss the escalation patterns that multi-turn attacks exploit. Complementing this with a configuration audit — verifying that reasoning modes and safety suffixes are correctly set for your specific deployment — addresses the finding that configuration alone can shift attack success rates by 40 or more percentage points.
Sources & Further Reading
- Death by a Thousand Prompts: Open Model Vulnerability Analysis — Cisco Blogs
- Frontier AI Models Collapse Under Multi-Turn AI Attacks — Help Net Security
- Large Reasoning Models Are Autonomous Jailbreak Agents — Nature Communications
- Cisco Finds Multi-Turn Attacks Break Frontier AI Models — Let’s Data Science
- Open-Weight AI Models Fail the Jailbreak Test — GovInfoSecurity
- LLM Jailbreaks 2024–2026: Techniques, Risks & Defense Strategies — Startup House













