Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks (DialTree)¶

Conference: ICLR 2026
arXiv: 2510.02286
Code: None
Area: AI Safety / Red-Teaming
Keywords: Multi-turn jailbreak, red-teaming, reinforcement learning, tree search, dialogue policy optimization

TL;DR¶

This paper proposes DialTree, which frames multi-turn red-teaming as a goal-oriented dialogue policy optimization problem. By employing tree-structured rollouts with quality-based pruning to explore the attack trajectory space, combined with an adaptive mask to prevent format forgetting, DialTree achieves an average ASR of 81.5% across 12 target models—44.2% higher than the previous SOTA—and attains 71% ASR even on Claude-4-Sonnet.

Background & Motivation¶

Background: Red-teaming is a critical technique for discovering safety vulnerabilities in LLMs. Existing methods fall into single-turn attacks (GCG/PAIR/TAP) and multi-turn attacks (MTSA/ActorAttack/X-Teaming). Research has shown that multi-turn attacks are far more effective than single-turn ones, as they can gradually erode safety boundaries.

Limitations of Prior Work: - Existing multi-turn methods rely on hand-crafted heuristics or templates and cannot learn long-horizon adaptive strategies. - The state space of multi-turn dialogue grows exponentially, making effective exploration with standard RL methods difficult. - Jailbreak rewards come from imperfect proxy models (non-verifiable rewards), leading to unstable guidance signals. - Format-following capability undergoes catastrophic forgetting during RL training.

Key Challenge: The dialogue space for multi-turn attacks is enormous, while effective attack strategies are sparse and difficult to discover.

Goal: How to efficiently explore the multi-turn attack space, learn long-horizon dialogue strategies, and stabilize RL training.

Key Insight: Framing red-teaming as goal-oriented strategic dialogue, using tree search for structured exploration and adaptive masking for training stabilization.

Core Idea: Tree rollout + pruning = structured exploration of the multi-turn attack space; adaptive mask = protecting format tokens from being negatively updated by RL gradients.

Method¶

Overall Architecture¶

Given an attack goal \(g\), the attacker model \(\pi_\theta\) generates chain-of-thought (CoT) reasoning and an attack query, which is sent to the target model \(\pi_{\text{tgt}}\) for a response. The system then evaluates whether a jailbreak has succeeded before proceeding to the next turn. Training consists of two phases: (1) Cold-Start SFT to initialize format adherence and basic attack capability; and (2) DialTree RL to optimize the attack policy via tree search + GRPO.

Key Designs¶

Dialogue Tree Rollout + Quality Pruning:
- Function: Expands \(n\) candidate action branches from each dialogue state to form a tree structure for exploring the attack space.
- Mechanism: Starting from the initial state \(s_0 = (g, \emptyset)\), the model samples \(n\) distinct (CoT, query) pairs at each turn for every active state. Each query is sent to the target model to obtain a response, forming new states. Three pruning criteria are applied: format validation (discarding outputs without CoT or query), topic consistency (discarding outputs that deviate from the goal), and branch limit (retaining at most \(w\) nodes per turn).
- Design Motivation: Standard GRPO samples independent trajectories and cannot compare the effects of different actions under a shared dialogue context; the tree structure enables exploration of multiple directions from the same state.
- Default Parameters: \(T_{\max}=5\) turns, branching factor \(n=4\), group size \(G=32\).
Adaptive Mask Mechanism:
- Function: Selectively protects format tokens from gradient updates during RL training.
- Mechanism: Defines an adaptive mask \(M_t^{(i)} = 1 - \mathbb{I}((T_t^{(i)} \in \mathcal{V}_{\text{fmt}}) \land (A^{(i)} < 0))\). When the trajectory advantage \(A < 0\), format tokens are masked to shield them from penalty gradients; when \(A \geq 0\), format tokens are updated normally to reinforce correct formatting.
- Design Motivation: RL training was found to cause catastrophic forgetting of format-following capability—malformed outputs surge from near 0% to over 70%. The root cause is that penalty gradients from negative-advantage trajectories inadvertently penalize correct format tokens.
- vs. Static Mask: A static mask always protects format tokens, but is less effective than the adaptive variant, since format tokens in positive-advantage trajectories also need to be updated to reinforce correct formatting.
Red-Teaming Reward Design:
- Function: Uses the HarmAug-Guard safety classifier to assess whether each trajectory achieves a successful jailbreak.
- Mechanism: \(R = 1\) if any turn's (query, response) pair in the dialogue is judged harmful (\(r_\phi(g; q_t, r_t) > 0.5\)); otherwise \(R = 0\). The binary reward is simple yet effective.
- A held-out GPT-4o judge is used for evaluation, distinct from the HarmAug-Guard used during training, to avoid reward hacking.

Loss & Training¶

SFT phase: 397 manually curated red-teaming dialogues with CoT annotations.
RL phase: Dialogue GRPO, computing group-relative advantages over trajectories collected via tree rollout. 500 training goals sampled from AdvBench/DangerousQA/CatQA.
Attacker model: Llama-3.1-8B-Instruct; target model (during training): Llama-3.2-1B-Instruct.
Key insight: Training uses only a 1B target model, yet the learned attack strategies transfer to large models such as GPT-4o and Claude-4.

Key Experimental Results¶

Main Results: Attack Success Rate (ASR@1, HarmBench)¶

Method	GPT-4o	Claude-4-Sonnet	Grok-4	o3-mini	Llama-3.3-70B	Avg (12 models)
GCG	12.5	0	1.0	0	8.5	12.4
PAIR	18.0	2.5	8.5	11.5	25.5	17.6
X-Teaming	48.0	9.5	10.5	19.0	50.0	37.3
DialTree	86.0	71.0	75.0	86.5	89.5	81.5

Ablation Study: Effect of Adaptive Mask¶

Mask Strategy	Training Stability	Malformed Trajectory Rate (step 40)	Reward Trend
No Mask	Training collapse	~100%	Approaches 0
Static Mask	Partially mitigated	~100% (after step 60)	Slow decline
Adaptive Mask	Stable	<50%	Steady increase

Key Findings¶

Remarkable cross-model transfer: Trained only on a 1B model, DialTree achieves 71% ASR on Claude-4-Sonnet (widely regarded as the safest model), far exceeding the best competing method at 26%.
Tree search contributes substantially: Compared to standard rollout without tree search, the tree-structured approach yields significant ASR gains.
Adaptive mask is critical: Without masking, training collapses within 40 steps; the adaptive mask is the only approach that sustains stable training.
High data efficiency: Only 397 SFT samples and 500 RL training goals are sufficient to produce a powerful attacker.
Multi-turn vastly outperforms single-turn: Multi-turn average ASR 81.5% vs. best single-turn 33.8%.

Highlights & Insights¶

Red-teaming as strategic dialogue: Reframing jailbreaking as a goal-oriented dialogue decision problem, rather than simple prompt optimization, enables long-horizon strategic planning.
Discovery and resolution of format forgetting: Catastrophic forgetting of format-following capability during RL training is a widespread but underappreciated issue. The adaptive mask identifies the root cause (format tokens penalized by negative-advantage gradients) and proposes an elegant solution that is transferable to any RL training scenario requiring structured output formats.
Small-model training → large-model transfer: Training with a 1B target model yields attack strategies that remain effective against GPT-4o/Claude-level models at inference time, indicating strong cross-model transferability of attack strategies—a serious warning for defenders.

Limitations & Future Work¶

Absence of a defensive perspective: The paper focuses exclusively on attacks and does not explore how the discovered vulnerabilities could inform improved defenses.
Reward model reliability: HarmAug-Guard as a proxy reward may have blind spots, potentially leading to reward hacking.
Computational overhead: Tree search combined with multi-turn interaction incurs high rollout costs.
Future directions: Combining with ReSA's Answer-Then-Check defense strategy to evaluate DialTree's effectiveness against reasoning-augmented defenses would be a natural next step.

vs. X-Teaming: X-Teaming uses multi-agent collaborative planning for multi-turn attacks (37.3% ASR), whereas DialTree uses a single agent with tree-search RL (81.5% ASR), demonstrating that learned policy optimization outperforms heuristic planning.
vs. PAIR/TAP: These methods iteratively optimize single-turn prompts; DialTree extends the paradigm to multi-turn dialogue policy space, achieving a dramatic performance leap.
vs. ActorAttack: ActorAttack guides attacks through semantically related entities; DialTree directly learns dialogue policies, offering greater flexibility and superior performance.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The framework combining tree-search RL with adaptive masking is novel; the discovery of format forgetting has independent value.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 12 target models (including GPT-4o/Claude-4/Grok-4), 8 baselines, with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear, though some notation could be simplified.
Value: ⭐⭐⭐⭐⭐ Significant implications for understanding multi-turn safety vulnerabilities in LLMs and for improving defenses.