From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents¶
Conference: ICML 2026
arXiv: 2601.22607
Code: https://github.com/inclusionAI/AReaL/tree/main/examples/tau2 (available)
Area: LLM Agent / Reinforcement Learning / Tool Use
Keywords: Multi-turn tool use, verifiable reward RL, self-evolving synthetic data, GRPO, user simulator fine-tuning
TL;DR¶
To address two major bottlenecks in post-training "multi-turn interactive tool-using agents"—the high cost of quality data and RL signal corruption from user simulation noise—the authors propose "self-evolving multi-agent data synthesis (AReaL-SEA)" paired with executable verifiers as rewards. Combined with an RL recipe of "first SFT the user model, then large batch + dynamic filtering GRPO," this approach pushes Qwen3-235B to Airline 73.0 / Telecom 98.3 pass^1 on τ²-bench, matching or surpassing Claude/Gemini/GPT-5 across the board.
Background & Motivation¶
Background: LLMs are evolving from "question-answering machines" to "task completion assistants," requiring them to communicate with humans and interact with environments (APIs/tools) within dialogues to accomplish complex tasks (e.g., τ²-bench Airline's "reschedule → query → policy check → execute" flow). While foundational models like ReAct/Toolformer/OpenVLA exist for tool-using agents, multi-turn interactive agents add the "user-in-the-loop" dimension, making them much more challenging than single-turn tool use.
Limitations of Prior Work: Training open-source models into competitive interactive agents faces two bottlenecks. (1) Data: Multi-turn tool-use dialogue data is extremely hard to scale—manual annotation is costly, and automatic synthesis struggles to simultaneously meet "complex domain rules + simulated user private info + sufficient RL task difficulty." (2) RL Instability: Interactive tasks require user-driven RL rollouts, necessitating user simulators. The authors find open-source models as user simulators are highly unstable—in τ²-bench's dual-control scenarios, users also issue tool calls, but open-source models often misuse tools or ignore instructions, causing rollout failures and misattributing rewards to the agent.
Key Challenge: RL is needed to train the agent, but RL requires stable rollouts, which require stable user simulation, which in turn needs good training data—yet good data depends on agent + user co-rollouts. This forms a circular dependency.
Goal: (i) Design a scalable, verifiable multi-turn tool-use data synthesis pipeline; (ii) Develop an RL recipe for interactive agents robust to unstable user simulation.
Key Insight: Make data synthesis a "hierarchical multi-agent system + self-evolving feedback loop," enabling the system to learn from its own failures; pretrain the user simulator with synthetic data via SFT before RL rollouts to suppress user noise at the source; use large batch + dynamic filtering to absorb remaining reward variance.
Core Idea: Data = self-evolving multi-agent + executable verifier; RL = first stabilize the user simulator, then train the agent with stable GRPO; the two tightly integrate into a cyclically improvable post-training pipeline.
Method¶
Overall Architecture¶
Two main modules. AReaL-SEA Data Synthesis (§4): The meta-planner first generates \(N\) diverse (synthesis plan, evaluation plan) pairs, each running an independent pipeline (task synthesis → task verification → trajectory rollout → trajectory verification). Failure cases are aggregated into a reflection module to iteratively update plans, looping for \(K\) rounds. RL Recipe (§5): First, SFT the user simulator with synthetic data; then train the agent with GRPO (group-relative advantage + dynamic filtering + large batch), using rewards from a verifier comparing final vs. ground-truth states.
Key Designs¶
-
AReaL-SEA Self-Evolving Data Synthesis Pipeline:
- Function: Generates diverse, complex, and verifiable multi-turn tool-use training samples.
- Mechanism: (a) Diversified Plan Generation: The meta-planner sequentially generates \(N\) non-overlapping plan pairs, each specifying different domains/complexity/tool modes/user styles, explicitly constructing diversity without relying on randomness. (b) Four-stage Agent Pipeline: Task Synthesis Agent generates structured task tuples \(q = (u, t, a^*)\) via multi-turn tool use; Task Verification Agent checks task quality; Trajectory Rollout simulates full dialogues with user + assistant; Trajectory Verification Agent evaluates trajectory quality and assigns attribution tags (failure due to task or trajectory). (c) Reflection Loop: Failures and attributions are aggregated to a reflection agent, which updates \((\mathcal{P}_s, \mathcal{P}_e)\) for more precise plans and calibrated rubrics in the next round, forming a closed loop: \((\mathcal{P}_s^{(n,k+1)}, \mathcal{P}_e^{(n,k+1)}) = \text{Reflect}(\mathcal{P}_s^{(n,k)}, \mathcal{P}_e^{(n,k)}, \{\text{failures}\})\).
- Design Motivation: Previous pipelines like APIGen-MT / TOUCAN are static and cannot learn from their own errors; this work makes data generation an evolvable multi-agent system, enabling domain-specific rule iteration. Ablations show: removing the evolution loop drops performance from 56.0 → 44.0; reducing prompt diversity from 64 → 4 drops from 56.0 → 42.5—both are key contributions.
-
Executable Per-instance Verifier as RL Reward:
- Function: Each synthetic task comes with an executable check function, serving as a sparse RL reward signal.
- Mechanism: During task synthesis, generate ground-truth final state and verifier function; after RL trajectory, the verifier compares \(s_T\) to ground-truth for key entities and actions—full match yields 1, else 0, forming a binary outcome reward. Reward function: \(\mathcal{R}(s_t, a_t) = R(s_T)\) for \(t = T\), else 0.
- Design Motivation: Using LLM-as-judge for interactive agent rewards is noisy and expensive; generating deterministic verifiers during synthesis is fast and accurate, implementing the RLVR paradigm for agent tasks.
-
GRPO + User Model SFT + Large Batch + Dynamic Filtering:
- Function: Stabilizes RL training under user simulation noise.
- Mechanism: (a) User Model SFT: SFT the user simulator (based on Qwen3-30B-A3B-2507) with AReaL-SEA dialogue data to ensure stable instruction following and role-based tool use—ablation shows using a base user model for RL drops performance from SFT checkpoint 85.4 to 75.6, while SFT user model boosts to 95.6, a 20-point gap. (b) GRPO: For each task, sample \(G\) independent trajectories, compute group-normalized advantage \(\hat{A}(\tau^{(g)}) = \frac{R(\tau^{(g)}) - \mu_G}{\sigma_G}\); token-level clipping surrogate loss. (c) Large Batch: Ablation shows increasing total batch from 256 → 512 raises pass^1 from 64-66 to 70.5, providing more stable advantage estimates. (d) Dynamic Filtering: Tasks with all-success or all-failure in a group yield \(\hat{A} = 0\) (no learning signal) and are filtered out, retaining only differentiated groups—removing this step drops performance from 70.5 to 65.0.
- Design Motivation: User simulation noise is unique to this problem, making user model SFT both novel and necessary; the remaining trio (GRPO + large batch + dynamic filtering) ensures the limited reward signal drives learning as stably as possible.
Loss & Training¶
The RL objective is \(\mathcal{J}_\text{RL}(\theta) = \mathbb{E}_{q \sim \mathcal{D}}[\frac{1}{\sum_g N_G}\sum_g \sum_t \sum_i \mathcal{L}_{t,i}^{(g)}(\theta)]\), where \(\mathcal{L}_{t,i}^{(g)} = \min(\rho_{t,i}^{(g)} \hat{A}^{(g)}, \text{clip}(\rho_{t,i}^{(g)}, 1-\epsilon, 1+\epsilon)\hat{A}^{(g)})\), and token-level importance ratio \(\rho_{t,i}^{(g)} = \pi_\theta / \pi_{\theta_\text{old}}\). SFT uses standard cross-entropy. The 30B model is trained on 64 H200 GPUs, 235B on 80 H200.
Key Experimental Results¶
Main Results¶
τ²-bench, three domains (Airline / Retail / Telecom), pass^k means all k independent attempts must succeed (stricter than pass@k):
| Model | Airline pass^1 | Retail pass^1 | Telecom pass^1 |
|---|---|---|---|
| Claude-Sonnet-4.5 | 70.0 | 86.2 | 98.0 |
| Gemini 3.0 Pro | 73.0 | 85.3 | 98.0 |
| GPT-5 | 62.5 | 81.6 | 95.8 |
| Qwen3-235B baseline | 58.0 | 59.9 | 53.7 |
| Qwen3-235B + SFT | 64.0 | 71.5 | 87.9 |
| Qwen3-235B + RL | 73.0 | 75.0 | 98.3 |
| Qwen3-30B-A3B-2507 baseline | 56.0 | 54.2 | 28.5 |
| Qwen3-30B-A3B-2507 + SFT | 60.0 | 69.1 | 85.4 |
| Qwen3-30B-A3B-2507 + RL | 70.5 | 75.0 | 95.6 |
The 235B version matches Gemini 3.0 Pro on Airline and surpasses all frontier models on Telecom; Retail is the hardest domain (Claude 86.2 still leads), with the open-source version reaching 75.0. The 30B version is also highly competitive, with Telecom 95.6 close to GPT-5.
Mix Training (combining data from all three domains) enables Qwen3-235B to achieve an average pass^1 of 81.3%, surpassing Qwen3-Max-Thinking (80.7) and GPT-5 (80.0); on the stricter pass^4 metric, 68.5% also exceeds Max-Thinking (66.8) and GPT-5 (64.0).
Ablation Study¶
| Configuration | Airline pass^1 (SFT) | Notes |
|---|---|---|
| Qwen3-30B baseline | 38.0 | Starting point |
| Human Expert data | 52.0 | Manually designed workflow |
| AReaL-SEA Full (64 plans, all components) | 56.0 | Surpasses manual |
| w/o Validation | 50.0 | Lacks quality filtering, -6 points |
| w/o Evolution | 44.0 | Lacks reflection loop, -12 points |
| 4 prompt sets only | 42.5 | Lacks diversity, -13.5 points |
| User Model | Telecom pass^1 (RL) | Notes |
|---|---|---|
| SFT checkpoint | 85.4 | Before RL |
| RL + base user model | 75.6 | Drops 10 points |
| RL + SFT user model | 95.6 | Gains 10 points |
| RL Configuration | Airline pass^1 | Notes |
|---|---|---|
| 8×32 (total 256) | 64.0 | Small batch |
| 16×16 (total 256) | 66.0 | Prompt vs. trajs split less important |
| 8×64 (total 512) | 70.5 | Large batch is key |
| 8×64 + no dynamic filtering | 65.0 | Filtering is essential |
Key Findings¶
- Automatic synthesis ≥ human expert: AReaL-SEA full at 56.0 surpasses human expert data at 52.0, showing self-evolution not only saves labor but also raises the data quality ceiling.
- User model SFT is a hidden key to RL success: Using a base user model cannot even maintain SFT checkpoint performance (75.6 < 85.4), a failure mode rarely emphasized in prior literature. Figure 2 case study shows base users ignore instructions and misuse tools, passing wrong signals to the agent.
- Total batch size matters more than prompt:traj split: 8×32 vs 16×16 are similar (64 vs 66), but 8×64 vs 8×32 is significant (70.5 vs 64.0), indicating GRPO advantage estimation stability mainly depends on total sample size.
- Mix training benefits large models, harms small models: For 30B, mix training drops average pass^1 from 71.5 to 63.7 (Telecom drops 15 points), but 235B remains nearly unchanged (74.5 vs 74.7)—supporting the intuition that small models lack capacity for multi-domain absorption, guiding domain split strategies for deployment.
Highlights & Insights¶
- "User simulator SFT" is the most underrated contribution: All prior agent RL work assumes the user model is given (whether GPT-4.1 or open-source base); this is the first to explicitly demonstrate that "user simulator quality directly determines RL improvement," with a 20-point empirical gap—a key warning for all interactive agent RL research.
- Self-evolving data synthesis is a general paradigm: Making "task generation → verification → trajectory rollout → verification → reflection → plan update" a closed loop enables LLMs to learn from failures in data synthesis, more scalable than static pipelines like APIGen-MT/TOUCAN. This architecture can transfer to other domains needing complex synthetic data (e.g., reasoning chains, long-context QA).
- Verifiable reward + agent RL as a paradigm: Extending RLVR from math/code to multi-turn tool-using agents, the key is generating verifiers during synthesis, avoiding LLM judges at training time—this "data with verifier" design can transfer to any task with programmatically checkable final states.
- Mix vs. Separate depends on model scale: A practical but overlooked finding, directly informing engineering decisions on "training a single general agent vs. per-domain experts" for enterprise deployment.
Limitations & Future Work¶
- Evaluation is limited to three τ²-bench domains; Retail, the hardest domain, still lags behind Claude Sonnet 4.5.
- The number of reflection loop steps \(K\) in AReaL-SEA is not systematically ablated; optimal convergence rounds remain open.
- No discussion of the distribution gap between synthetic and real production dialogues—τ²-bench's synthetic user styles may not cover real users.
- The RL recipe relies heavily on infrastructure (80 H200s for 235B), making replication challenging for smaller teams; lightweight extensions (e.g., distillation to small models) are a natural direction.
- Tool-use safety is not deeply discussed (impact statement briefly mentions "potential misuse"); real-world deployment requires dedicated permission/audit layers.
Related Work & Insights¶
- vs APIGen-MT (Prabhakar et al.): Also synthesizes multi-turn tool use, but APIGen-MT uses static reviewer-style validation; AReaL-SEA adds self-evolution and co-generation of verifiers, making it more RL-friendly.
- vs TOUCAN (Xu et al.): TOUCAN focuses on scale (1.5M trajectories), while this work pursues "small, high-quality + self-evolution," showing 64 high-quality plan sets can surpass human experts.
- vs ToolRL / Search-R1: These address single-turn tool-use RL; this work targets multi-turn interactive settings, introducing the critical dimension of user simulator quality.
- vs π₀ / GR00T: Also RL for agents, but robotics uses real environments as ground truth; this work uses synthetic verifiers, lowering cost but with potential fidelity gaps.
- vs ARENA-RL (tournament RL): The latter uses relative ranking to address reward sparsity; this work uses dynamic filtering + large batch to address advantage noise—complementary approaches.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of self-evolving data synthesis + user model SFT + verifier-based RL is new for agent post-training, especially the finding that "user model SFT is key."
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three domains × three model scales × separate/mix × data ablation + user model ablation + RL algorithm ablation, comprehensively covering and comparing all mainstream commercial frontier models.
- Writing Quality: ⭐⭐⭐⭐ Clear narrative (data problem + RL problem → two solutions), concise formulas and figures; appendix provides solid training details.
- Value: ⭐⭐⭐⭐⭐ Open-source models achieving or surpassing frontier models on τ²-bench is genuine SOTA, and the framework is reproducible (open code + detailed hyperparameters), directly valuable for industrial deployment of tool-using agents.