Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models¶

Conference: AAAI 2026 arXiv: 2505.17089 Code: None Area: LLM Reasoning / Safety Defense Keywords: adversarial defense, chain-of-thought, jailbreak, seamless response, inference-time defense

TL;DR¶

This paper proposes ASE (Adversarial Scenario Extrapolation), an inference-time CoT defense framework that enables LLMs to autonomously simulate adversarial scenarios and formulate defensive strategies prior to responding. ASE achieves near-zero attack success rates across four categories of safety threats (jailbreak, toxicity, hallucination, and bias), while reducing direct refusal rates to ≤4%, effectively balancing robustness and user experience.

Background & Motivation¶

Background: LLMs face diverse safety threats including jailbreaks, toxicity, hallucinations, and bias, yet existing defenses typically target only a single threat type.

Limitations of Prior Work: (1) Existing defenses lack transferability across threat types — methods designed against jailbreaks are ineffective against bias or hallucination; (2) Successful defenses typically manifest as blunt refusals ("Sorry, I can't help"), sacrificing user experience and interpretability; (3) Static instruction-level defenses are easily bypassed by adaptive attacks (e.g., "ignore everything before this").

Key Challenge: A fundamental trade-off exists between robustness and seamlessness — detailed responses risk leaking harmful content, while terse refusals degrade user experience.

Goal: Design a unified defense framework that simultaneously improves robustness (against diverse attacks) and seamlessness (avoiding blunt refusals by providing helpful, guided responses).

Key Insight: Internalize adversarial awareness into the LLM's reasoning process — rather than relying on external filtering or detection, the model itself reasons through possible adversarial scenarios and responds in a prepared, defensively-informed manner.

Core Idea: A three-step CoT reasoning process — (1) self-generate adversarial scenarios; (2) formulate defensive strategies; (3) produce a guarded response grounded in the scenario analysis.

Method¶

Overall Architecture¶

ASE prepends two CoT reasoning steps as a "warm-up" at inference time: \(x \rightarrow r_{scenario} \rightarrow r_{defense} \rightarrow y\). No offline fine-tuning is required; the framework operates entirely at inference time.

Key Designs¶

Step 1: Adversarial Scenario Generation (\(r_{scenario}\))
Function: The LLM autonomously infers potential adversarial scenarios based on the received query.
Mechanism: Rather than assuming a specific threat type, the LLM broadly considers ways the query could be misused, even when the query appears benign.
Design Motivation: Even when the inferred adversarial scenarios are inaccurate, this process shifts the LLM into a "risk-aware" state, reducing overconfidence.
Step 2: Defensive Strategy Formulation (\(r_{defense}\))
Function: Generate mitigation strategies targeting the inferred adversarial scenarios.
Mechanism: The LLM practices formulating defensive responses within adversarial contexts, forming a "defensive cocoon."
Design Motivation: This "defense rehearsal" further reinforces the LLM's safety awareness, generating strong defensive momentum.
Step 3: Guarded Response Generation
Function: Generate a guarded response to the original query based on the preceding two-step analysis.
Mechanism: Responses typically consist of: a soft refusal statement + an explanation of the refusal + alternative assistance available to the user.
Design Motivation: Avoids harmful content while refraining from blunt, unhelpful rejections.

Three Core Robustness Advantages¶

Momentum: The defensive momentum generated through deep reasoning cannot be neutralized by "ignore previous instructions"-style attacks.
Transferability: Threat-agnostic adversarial extrapolation enables defense coverage across jailbreaks, toxicity, hallucinations, and bias.
Self-Detection-Free: Does not rely on pretrained detection capabilities, remaining effective against novel attacks.

Key Experimental Results¶

Main Results (Four Threat Types × Four Models)¶

Model	Jailbreak ASR (Base→ASE)	Direct Refusal (Base→ASE)	Safe Response (Base→ASE)
GPT-4o	6.25% → 0.68%	88.3% → 10.9%	5.5% → 88.4%
Llama-3.3	62.4% → 3.15%	23.2% → 18.1%	14.4% → 78.8%
Gemma-2	79.5% → 5.97%	13.5% → 6.6%	7.0% → 87.4%
Claude-3.5	10.5% → 2.2%	71.4% → 3.95%	18.1% → 93.9%

Claude-3.5 + ASE achieves the best overall performance: 93.9% safe responses, only 3.95% direct refusals, and 2.2% ASR.
Gemma-2 exhibits the largest improvement, with ASR dropping dramatically from 79.5% to 5.97%.

Cross-Threat Transferability¶

Threat Type	Metric	Base vs. ASE
Toxicity	Toxicity score	Substantial reduction (~50–80%)
Hallucination	QA accuracy	92–99%
Bias	Bias score	4–10× reduction
Hallucination	TruthfulQA	Llama-3.3: 68.2% → 92.1%; Gemma-2: 71.5% → 99.3%

Key Findings¶

ASE reduces direct refusal rates from 71–88% to ≤4–11% while maintaining near-zero ASR.
No performance degradation is observed on MMLU and summarization tasks, demonstrating that ASE does not impair general reasoning capabilities.
Two-step ASE (merging the first two steps) reduces token consumption by ~30% with minimal performance loss.
On MMLU and news summarization tasks, ASE introduces no side effects (GPT-4o MMLU accuracy: 86.8% vs. 86.5%).
ASE's seamless responses consistently comprise three components: (1) a gentle refusal statement; (2) an explanation of the refusal; (3) guidance directing users toward alternative assistance.

Highlights & Insights¶

The dual objective of "robustness + seamlessness" is a valuable contribution — most defenses focus solely on reducing ASR, whereas ASE simultaneously addresses user experience.
Internalizing adversarial awareness into the reasoning process is more resistant to circumvention than external filtering — attackers may instruct the model to ignore system prompts, but cannot bypass the model's own reasoning steps.
Threat-agnostic unified defense represents a significant contribution — a single method simultaneously addresses four distinct threat types: jailbreaks, toxicity, hallucinations, and bias.

Limitations & Future Work¶

The two-step CoT overhead at inference time increases latency and token costs, though the Two-step variant partially mitigates this issue.
ASE's defensive effectiveness depends on the LLM's intrinsic safety knowledge, potentially limiting efficacy for models with insufficient safety training.
A residual ASR of 5.97% on Gemma-2 indicates that reasoning momentum is not universally sufficient.
Multi-turn dialogue scenarios remain untested — adversaries may gradually erode ASE's defensive momentum across turns.
The ASR improvement on GPT-4o (6.25% → 0.68%) is less pronounced than on models with weaker safety baselines.

vs. Instruction Tuning defenses: Static instructions can be bypassed via "ignore" commands; ASE's reasoning momentum is more persistent.
vs. SmoothLLM/Paraphrase: These approaches only sanitize inputs without improving response quality; ASE simultaneously enhances robustness and seamlessness.
vs. Constitutional AI/DPO: These methods require offline training; ASE operates entirely at inference time and is plug-and-play.
vs. BadThink (co-located paper): BadThink attacks reasoning efficiency, while ASE leverages reasoning to enhance safety — the two works collectively illuminate the security dimensions of CoT reasoning from opposing attack-defense perspectives.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to leverage CoT reasoning as a general inference-time safety defense while jointly targeting robustness and seamlessness.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 models × 4 threat types × 6 baselines, 20K jailbreak samples, with human annotation validation.
Writing Quality: ⭐⭐⭐⭐ Motivation is well-argued and the methodological principles are clearly articulated.
Value: ⭐⭐⭐⭐⭐ A highly practical inference-time defense solution with significant implications for the safe deployment of LLMs.