Red Queen: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking¶
Conference: ACL 2025
arXiv: 2409.17458
Code: https://github.com/kriti-hippo/red_queen
Area: AI Safety / RLHF Alignment
Keywords: Jailbreak Attack, Multi-Turn, Theory of Mind, Intent Concealment, DPO Defense
TL;DR¶
This paper proposes Red Queen Attack, the first jailbreak attack method based on the Theory of Mind (ToM) that constructs multi-turn dialogue scenarios to conceal malicious intent, generating 56K multi-turn concealed attack data points and achieving an 87.6% ASR on GPT-4o. Concurrently, the Red Queen Guard defense strategy is introduced to reduce the ASR to <1% through multi-turn DPO data training, without compromising general benchmark performance.
Background & Motivation¶
Background: LLM safety alignment (RLHF/DPO) has achieved effective defense against single-turn direct malicious requests, but in real-world scenarios, attackers can progressively conceal malicious intent through multi-turn dialogues. Existing research on jailbreaking primarily focuses on single-turn attacks (e.g., GCG, AutoDAN) or simple multi-turn strategies.
Limitations of Prior Work: (a) Most existing jailbreak methods are single-turn with highly explicit malicious intent, failing to reflect real-world attack scenarios; (b) existing multi-turn attacks (CoSafe, CoU) and concealed attacks (ArtPrompt, DeepInception) are fragmented, with no work combining "multi-turn + concealment"; (c) the ToM capability of LLMs is weak—they struggle to infer the implicit intentions of users, a vulnerability that is severely underestimated in safety contexts.
Key Challenge: LLMs are trained to be "helpful", but safety mechanisms rely on identifying explicit malicious intent. When a malicious intent is covertly wrapped (such as pretending to be a good person stopping a bad person), the helpfulness tendency of the LLM overrides its harmlessness.
Goal: (a) Systematically evaluate the threat level of multi-turn concealed attacks on LLMs; (b) provide effective defense strategies.
Key Insight: Starting from the ToM theory, the attack is formalized by constructing scenarios where the model's inferred explicit intent \(I_e\) is safe (Safe(\(I_e\))=1), while the implicit malicious intent \(I_i\) is unsafe.
Core Idea: Leverage weak LLM ToM capabilities to construct "pseudo-protector" scenarios across multi-turn dialogues to hide malicious intent for jailbreaking, and defend against it using multi-turn DPO training.
Method¶
Overall Architecture¶
Attack: Construct multi-turn concealed scenarios \(\rightarrow\) combine harmful behaviors \(\rightarrow\) generate 56K attack data \(\rightarrow\) evaluate 10 models.
Defense: Collect attack success/rejection data \(\rightarrow\) construct multi-turn DPO preference pairs \(\rightarrow\) train Red Queen Guard.
Key Designs¶
-
Based on ToM Attack Formalization (Section 3.2):
- Model behavior: \(LLM(S, T, I_e, I_i) = R\), where \(S\) is the scenario and \(T\) is the task
- \(I_e = Infer(S, T)\): The explicit intent inferred by the model (safe "helping to prevent crime")
- \(I_i\): The user's actual implicit malicious intent (obtaining criminal methods)
- Attack conditions: \(Safe(I_e) = 1\) and \(Safe(I_i) = 0\)
- Design Motivation: LLMs excel at responding to explicit requests but struggle to infer implicit intentions.
-
Scenario Construction and Data Generation (Section 3.3-3.4):
- Function: Generate 40 types of multi-turn concealed scenarios using Llama3.1-70B, covering various professions and relationships.
- Two major scenario categories: Professional (authoritative identities like police/teachers, 5 categories) and Relational (friends/parent-child, etc., 5 categories).
- Each scenario is expanded into 1/3/4/5-turn variants, with added turns incorporating more details and trust-building.
- 56K Dataset = 40 scenarios \(\times\) 1400 harmful behaviors (BeaverTails 14 classes \(\times\) K-means sampling of 100 cases per class).
- Design Motivation: Diverse scenarios prevent overfitting; authoritative roles (e.g., police) yield the best results.
-
Red Queen Guard Defense (Section 6):
- Function: Fine-tune the model on multi-turn safety preference data using DPO.
- Sample successful attacks + Llama3.1-405B-generated safe rejections \(\rightarrow\) 11.2K preference dataset.
- DPO fine-tuning enables the model to reject harmful requests even in concealed-intent scenarios.
- Control: 11.5K HH-RLHF general preference data is ineffective, indicating the necessity of specific data tailored to multi-turn concealed attacks.
Loss & Training¶
Standard DPO loss is utilized for defense. Evaluation is conducted using Llama3-70B combined with a customized judging prompt (achieving 96% accuracy, validated as superior to all prior methods on JailbreakBench).
Key Experimental Results¶
Main Results (10 Models Attack Success Rate ASR)¶
| Model | Direct Attack | 1-Turn Concealed | 3-Turn | 5-Turn | Overall |
|---|---|---|---|---|---|
| GPT-4o | 0.6% | 64.7% | 87.6% | 85.2% | 82.1% |
| Llama3-70B | 0.9% | 20.4% | 52.4% | 77.1% | 68.3% |
| Llama3.1-405B | 2.4% | 23.8% | 52.8% | 46.7% | 50.2% |
| Qwen2-72B | 1.3% | 29.6% | 38.3% | 54.1% | 49.2% |
| GPT-4o-mini | 0.6% | 49.1% | 30.6% | 54.8% | 44.9% |
Red Queen outperforms all baselines on 9 out of 10 models, representing a 2%-64% increase in ASR.
Ablation Study (Contribution of Multi-turn vs. Concealment)¶
| Setting | GPT-4o | Llama3-70B | Description |
|---|---|---|---|
| D (Direct) | 0.6% | 0.9% | Baseline |
| C (Concealed Only) | 64.7% | 20.4% | Concealment is the main contributor |
| M&D (Multi-turn Only) | 0.9% | 1.1% | Multi-turn without concealment is almost ineffective |
| M&C | 87.6% | 52.4% | Multi-turn + concealment synergistically enhance effectiveness |
Defense Effectiveness (Red Queen Guard via DPO)¶
| Model | Original ASR | +RQG ASR | MMLU-Pro | AlpacaEval |
|---|---|---|---|---|
| Llama3.1-8B | 19.8% | 1.2% | 48.3→48.3 | 27.8→26.0 |
| Llama3.1-70B | 37.9% | 1.3% | 55.1→55.1 | 34.9→36.8 |
| Llama3.1-405B | 50.2% | 0.6% | 64.5→64.5 | 32.0→Maintained |
Key Findings¶
- Intent concealment is core: Utilizing the concealment strategy alone achieves a 64.7% ASR on GPT-4o, with the multi-turn structure providing auxiliary enhancement.
- Larger models are more vulnerable: Within the same model family, larger models are consistently easier to attack than smaller ones, because larger models have a better understanding of the mock scenarios \(\rightarrow\) leading to more cooperative execution.
- GPT-4o is the most vulnerable: Reaching an 87.6% ASR, challenging the common perception that "GPT-4o is the safest model".
- Authoritative scenarios are the most effective: Professional scenarios such as the police yield the highest ASR.
- Increasing turns generally raises ASR: The 5-turn setting performs best on 6 out of 10 models, as more turns establish higher trust.
- DPO defense is effective with no side-effects: Red Queen Guard reduces ASR to <1% without degrading general capabilities.
Highlights & Insights¶
- ToM-perspective safety analysis: Systematically explaining the attack effectiveness from a cognitive science perspective. The formalization of \(LLM(S,T,I_e,I_i)=R\) can guide further safety research.
- Counter-intuitive finding: larger models are less safe: Models with stronger capabilities are more easily "persuaded" by scenarios, challenging the "scaling = safer" assumption. This has profound implications for AI Safety.
- Complete attack-defense closed loop: 56K attack data + 11.2K DPO defense data + evaluation methodology + open-source release constitute a complete infrastructure for security research.
Limitations & Future Work¶
- Scenario templates can be targeted for defense: If specifically trained on certain types of scenarios, specific templates will fail (though generalized intent concealment remains hard to defend against).
- English-only evaluation: The effectiveness of multi-turn concealed attacks in other languages or cultures is untested.
- Limited defensive generalization: Red Queen Guard is specifically trained against this attack, and its performance on entirely new attack variants remains to be validated.
- Attack generation cost: Requiring an auxiliary LLM to generate scenarios introduces a higher cost compared to simple prompt attacks.
Related Work & Insights¶
- vs. DeepInception: Uses sci-fi writing scenarios to conceal but is single-turn; Red Queen is more effective by combining multi-turn and authoritative roles.
- vs. CoSafe: Utilizes co-reference strategies in multi-turn dialogues but does not conceal intent; Red Queen demonstrates that concealment is the primary driver of ASR gains.
- vs. BadChain/SEED Attack: Manipulates reasoning steps rather than conversational scenarios—revealing LLM security vulnerabilities from a different dimension.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Pioneering systematic study of the ToM perspective + multi-turn concealment; complete attack and defense datasets.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 10 models \(\times\) 56K data points, broad analysis covering ablation, baselines, defense, and model sizes.
- Writing Quality: ⭐⭐⭐⭐⭐ Complete progression from ToM formalization to attack design and defense logic; vivid case studies.
- Value: ⭐⭐⭐⭐⭐ A highly significant contribution to LLM safety research; the finding that "larger models are more vulnerable" has profound implications.
- Common Template: Claiming others are planning harmful behaviors, positioning the user as a protector
- Two categories of scenarios: (a) Professional-based (5 categories such as police, doctors); (b) Relational-based (5 categories such as friends, family)
- Multi-turn expansion: 3-turn, 4-turn, and 5-turn variants + a single-turn control group
- Total of 40 scenarios \(\times\) 1400 harmful behaviors (from 14 BeaverTails classes) = 56K data points
-
Harmful behavior extraction: Extracted by GPT-4o from BeaverTails, with manual validation to ensure quality
-
Methodological Improvements in Evaluation:
- Found that existing evaluation methods (GCG, GPT-4o, BERT, Llama Guard) all have an accuracy of <0.8
- Redesigned the judging prompt to focus on "whether a detailed plan or suspicious tips were provided"
- Llama3-70B + the new judging prompt achieved a 0.96 agreement
-
Red Queen Guard Defense (Section 6):
- Collect success/refusal pairs of attack data to construct multi-turn DPO preference data
- Perform DPO training on Llama3-8B and 70B
- Key: Use preference pairs in a multi-turn dialogue format, rather than single-turn
Key Experimental Results¶
Main Experimental Table (Table 2: ASR across Models and Turns)¶
| Model | Direct | 3-Turn | 4-Turn | 5-Turn | Overall |
|---|---|---|---|---|---|
| GPT-4o | 0.64 | 87.62 | 73.43 | 85.19 | 82.08 |
| GPT-4o-mini | 0.57 | 30.64 | 49.19 | 54.77 | 44.87 |
| Llama3-70B | 0.93 | 52.41 | 75.40 | 77.11 | 68.31 |
| Llama3.1-405B | 2.36 | 52.79 | 51.19 | 46.66 | 50.21 |
| Qwen2-72B | 1.25 | 38.26 | 55.24 | 54.10 | 49.20 |
| Mixtral-8×22B | 22.95 | 28.04 | 45.52 | 46.17 | 39.91 |
- GPT-4o, as the commercial model with the best security, reaches an ASR as high as 87.6% (3-Turn)
- Direct attack ASR is almost 0 (<3%), whereas the Red Queen Attack improves it by 15-81%
Comparison with Prior Attacks (Figure 3)¶
- Red Queen Attack outperforms all existing methods on 9 out of 10 models
- ASR improvement range: 2% to 64%
- Particularly significant outperformance on GPT-4o and Llama3-70B
Key Factor Analysis¶
- Multi-turn vs. Single-turn: The multi-turn structure (M&C) improves ASR by 5-28% compared to single-turn concealment (C)
- Larger models are more vulnerable: Within the same model family, larger models yield higher ASR (Llama3-8B: 19.8% vs 70B: 68.3%)
- Scenario Category: Professional scenarios (e.g., Police) yield the highest ASR, while relational scenarios show relatively balanced performance
Red Queen Guard Defense Effectiveness¶
- Llama3-8B: ASR from 19.8% \(\rightarrow\) 0.52%
- Llama3-70B: ASR from 68.3% \(\rightarrow\) 0.97%
- No performance degradation on general benchmarks such as MT-Bench
Key Findings¶
- Larger models are more vulnerable instead—likely because stronger instruction-following capabilities make them easier to exploit via the "helpful" bias
- The combined effect of multi-turn + concealment is greater than the sum of their individual effects
- Simple multi-turn DPO training can defend effectively, but it requires preference data in a multi-turn format
Highlights & Insights¶
- Novelty of the ToM perspective: The concept of Theory of Mind is applied to jailbreak attack design for the first time, establishing a clear theoretical framework.
- Large-scale attack dataset: 56K multi-turn concealed attack data points covering 14 categories of harmful behaviors \(\times\) 40 scenarios represent an important resource for security research.
- Simple and effective defense: Red Queen Guard requires only DPO training to reduce the ASR to <1%, demonstrating high practicality.
- Counter-intuitive finding: Larger models are easier to jailbreak—challenging the "scale up = safer" assumption.
Limitations & Future Work¶
- The attack data features scenarios generated by Llama3.1-70B and behaviors extracted by GPT-4o, introducing bias toward other models.
- The defense is only validated on Llama3; DPO cannot be directly applied to closed-source models like GPT-4o.
- The attack templates are relatively fixed ("a friend/colleague is doing something bad, I must stop them"); highly targeted defenses might be bypassed.
- More complex adaptive attacks—where attackers are aware of the defense mechanisms—were not evaluated.
Related Work & Insights¶
- Comparison with Cipher-Based Attack (Yuan et al., 2024): The latter is concealed but single-turn, while this work's multi-turn + concealment exhibits stronger effects.
- Comparison with CoSafe (Yu et al., 2024): The latter is multi-turn but unconcealed, resulting in an ASR significantly lower than Red Queen.
- Insight: ToM capability is a critical gap in LLM safety—enhancing the model's intent inference capability may prove more effective than more RLHF data.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (The first ToM + multi-turn concealed jailbreak framework)
- Theoretical Depth: ⭐⭐⭐⭐ (Clear ToM formalization, comprehensive factor analysis)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (10 models \(\times\) 4 families \(\times\) multi-turn variants, comparison with 5+ prior methods)
- Practical Value: ⭐⭐⭐⭐⭐ (Attack dataset + defense strategy, offering both offense and defense abilities)
- Overall Recommendation: ⭐⭐⭐⭐⭐ (A major contribution to the field of LLM safety)