From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents¶

Conference: ICML 2026
arXiv: 2601.22607
Code: https://github.com/inclusionAI/AReaL/tree/main/examples/tau2 (available)
Area: LLM Agent / Reinforcement Learning / Tool Use
Keywords: Multi-turn tool use, verifiable reward RL, self-evolving synthetic data, GRPO, user simulator fine-tuning

TL;DR¶

To address two major bottlenecks in post-training "multi-turn interactive tool-using agents"—the high cost of quality data and RL signal corruption from user simulation noise—the authors propose "self-evolving multi-agent data synthesis (AReaL-SEA)" paired with executable verifiers as rewards. Combined with an RL recipe of "first SFT the user model, then large batch + dynamic filtering GRPO," this approach pushes Qwen3-235B to Airline 73.0 / Telecom 98.3 pass^1 on τ²-bench, matching or surpassing Claude/Gemini/GPT-5 across the board.

Background & Motivation¶

Background: LLMs are evolving from "question-answering machines" to "task completion assistants," requiring them to communicate with humans and interact with environments (APIs/tools) within dialogues to accomplish complex tasks (e.g., τ²-bench Airline's "reschedule → query → policy check → execute" flow). While foundational models like ReAct/Toolformer/OpenVLA exist for tool-using agents, multi-turn interactive agents add the "user-in-the-loop" dimension, making them much more challenging than single-turn tool use.

Limitations of Prior Work: Training open-source models into competitive interactive agents faces two bottlenecks. (1) Data: Multi-turn tool-use dialogue data is extremely hard to scale—manual annotation is costly, and automatic synthesis struggles to simultaneously meet "complex domain rules + simulated user private info + sufficient RL task difficulty." (2) RL Instability: Interactive tasks require user-driven RL rollouts, necessitating user simulators. The authors find open-source models as user simulators are highly unstable—in τ²-bench's dual-control scenarios, users also issue tool calls, but open-source models often misuse tools or ignore instructions, causing rollout failures and misattributing rewards to the agent.

Key Challenge: RL is needed to train the agent, but RL requires stable rollouts, which require stable user simulation, which in turn needs good training data—yet good data depends on agent + user co-rollouts. This forms a circular dependency.

Goal: (i) Design a scalable, verifiable multi-turn tool-use data synthesis pipeline; (ii) Develop an RL recipe for interactive agents robust to unstable user simulation.

Key Insight: Make data synthesis a "hierarchical multi-agent system + self-evolving feedback loop," enabling the system to learn from its own failures; pretrain the user simulator with synthetic data via SFT before RL rollouts to suppress user noise at the source; use large batch + dynamic filtering to absorb remaining reward variance.

Core Idea: Data = self-evolving multi-agent + executable verifier; RL = first stabilize the user simulator, then train the agent with stable GRPO; the two tightly integrate into a cyclically improvable post-training pipeline.

Method¶

Overall Architecture¶

Two main modules. AReaL-SEA Data Synthesis (§4): The meta-planner first generates \(N\) diverse (synthesis plan, evaluation plan) pairs, each running an independent pipeline (task synthesis → task verification → trajectory rollout → trajectory verification). Failure cases are aggregated into a reflection module to iteratively update plans, looping for \(K\) rounds. RL Recipe (§5): First, SFT the user simulator with synthetic data; then train the agent with GRPO (group-relative advantage + dynamic filtering + large batch), using rewards from a verifier comparing final vs. ground-truth states.

Key Designs¶

AReaL-SEA Self-Evolving Data Synthesis Pipeline:
- Function: Generates diverse, complex, and verifiable multi-turn tool-use training samples.
- Mechanism: (a) Diversified Plan Generation: The meta-planner sequentially generates \(N\) non-overlapping plan pairs, each specifying different domains/complexity/tool modes/user styles, explicitly constructing diversity without relying on randomness. (b) Four-stage Agent Pipeline: Task Synthesis Agent generates structured task tuples \(q = (u, t, a^*)\) via multi-turn tool use; Task Verification Agent checks task quality; Trajectory Rollout simulates full dialogues with user + assistant; Trajectory Verification Agent evaluates trajectory quality and assigns attribution tags (failure due to task or trajectory). (c) Reflection Loop: Failures and attributions are aggregated to a reflection agent, which updates \((\mathcal{P}_s, \mathcal{P}_e)\) for more precise plans and calibrated rubrics in the next round, forming a closed loop: \((\mathcal{P}_s^{(n,k+1)}, \mathcal{P}_e^{(n,k+1)}) = \text{Reflect}(\mathcal{P}_s^{(n,k)}, \mathcal{P}_e^{(n,k)}, \{\text{failures}\})\).
- Design Motivation: Previous pipelines like APIGen-MT / TOUCAN are static and cannot learn from their own errors; this work makes data generation an evolvable multi-agent system, enabling domain-specific rule iteration. Ablations show: removing the evolution loop drops performance from 56.0 → 44.0; reducing prompt diversity from 64 → 4 drops from 56.0 → 42.5—both are key contributions.
Executable Per-instance Verifier as RL Reward:
- Function: Each synthetic task comes with an executable check function, serving as a sparse RL reward signal.
- Mechanism: During task synthesis, generate ground-truth final state and verifier function; after RL trajectory, the verifier compares \(s_T\) to ground-truth for key entities and actions—full match yields 1, else 0, forming a binary outcome reward. Reward function: \(\mathcal{R}(s_t, a_t) = R(s_T)\) for \(t = T\), else 0.
- Design Motivation: Using LLM-as-judge for interactive agent rewards is noisy and expensive; generating deterministic verifiers during synthesis is fast and accurate, implementing the RLVR paradigm for agent tasks.
GRPO + User Model SFT + Large Batch + Dynamic Filtering:
- Function: Stabilizes RL training under user simulation noise.
- Mechanism: (a) User Model SFT: SFT the user simulator (based on Qwen3-30B-A3B-2507) with AReaL-SEA dialogue data to ensure stable instruction following and role-based tool use—ablation shows using a base user model for RL drops performance from SFT checkpoint 85.4 to 75.6, while SFT user model boosts to 95.6, a 20-point gap. (b) GRPO: For each task, sample \(G\) independent trajectories, compute group-normalized advantage \(\hat{A}(\tau^{(g)}) = \frac{R(\tau^{(g)}) - \mu_G}{\sigma_G}\); token-level clipping surrogate loss. (c) Large Batch: Ablation shows increasing total batch from 256 → 512 raises pass^1 from 64-66 to 70.5, providing more stable advantage estimates. (d) Dynamic Filtering: Tasks with all-success or all-failure in a group yield \(\hat{A} = 0\) (no learning signal) and are filtered out, retaining only differentiated groups—removing this step drops performance from 70.5 to 65.0.
- Design Motivation: User simulation noise is unique to this problem, making user model SFT both novel and necessary; the remaining trio (GRPO + large batch + dynamic filtering) ensures the limited reward signal drives learning as stably as possible.

Loss & Training¶

The RL objective is \(\mathcal{J}_\text{RL}(\theta) = \mathbb{E}_{q \sim \mathcal{D}}[\frac{1}{\sum_g N_G}\sum_g \sum_t \sum_i \mathcal{L}_{t,i}^{(g)}(\theta)]\), where \(\mathcal{L}_{t,i}^{(g)} = \min(\rho_{t,i}^{(g)} \hat{A}^{(g)}, \text{clip}(\rho_{t,i}^{(g)}, 1-\epsilon, 1+\epsilon)\hat{A}^{(g)})\), and token-level importance ratio \(\rho_{t,i}^{(g)} = \pi_\theta / \pi_{\theta_\text{old}}\). SFT uses standard cross-entropy. The 30B model is trained on 64 H200 GPUs, 235B on 80 H200.

Key Experimental Results¶

Main Results¶

τ²-bench, three domains (Airline / Retail / Telecom), pass^k means all k independent attempts must succeed (stricter than pass@k):

Model	Airline pass^1	Retail pass^1	Telecom pass^1
Claude-Sonnet-4.5	70.0	86.2	98.0
Gemini 3.0 Pro	73.0	85.3	98.0
GPT-5	62.5	81.6	95.8
Qwen3-235B baseline	58.0	59.9	53.7
Qwen3-235B + SFT	64.0	71.5	87.9
Qwen3-235B + RL	73.0	75.0	98.3
Qwen3-30B-A3B-2507 baseline	56.0	54.2	28.5
Qwen3-30B-A3B-2507 + SFT	60.0	69.1	85.4
Qwen3-30B-A3B-2507 + RL	70.5	75.0	95.6

The 235B version matches Gemini 3.0 Pro on Airline and surpasses all frontier models on Telecom; Retail is the hardest domain (Claude 86.2 still leads), with the open-source version reaching 75.0. The 30B version is also highly competitive, with Telecom 95.6 close to GPT-5.

Mix Training (combining data from all three domains) enables Qwen3-235B to achieve an average pass^1 of 81.3%, surpassing Qwen3-Max-Thinking (80.7) and GPT-5 (80.0); on the stricter pass^4 metric, 68.5% also exceeds Max-Thinking (66.8) and GPT-5 (64.0).

Ablation Study¶

Configuration	Airline pass^1 (SFT)	Notes
Qwen3-30B baseline	38.0	Starting point
Human Expert data	52.0	Manually designed workflow
AReaL-SEA Full (64 plans, all components)	56.0	Surpasses manual
w/o Validation	50.0	Lacks quality filtering, -6 points
w/o Evolution	44.0	Lacks reflection loop, -12 points
4 prompt sets only	42.5	Lacks diversity, -13.5 points

User Model	Telecom pass^1 (RL)	Notes
SFT checkpoint	85.4	Before RL
RL + base user model	75.6	Drops 10 points
RL + SFT user model	95.6	Gains 10 points

RL Configuration	Airline pass^1	Notes
8×32 (total 256)	64.0	Small batch
16×16 (total 256)	66.0	Prompt vs. trajs split less important
8×64 (total 512)	70.5	Large batch is key
8×64 + no dynamic filtering	65.0	Filtering is essential

Key Findings¶

Automatic synthesis ≥ human expert: AReaL-SEA full at 56.0 surpasses human expert data at 52.0, showing self-evolution not only saves labor but also raises the data quality ceiling.
User model SFT is a hidden key to RL success: Using a base user model cannot even maintain SFT checkpoint performance (75.6 < 85.4), a failure mode rarely emphasized in prior literature. Figure 2 case study shows base users ignore instructions and misuse tools, passing wrong signals to the agent.
Total batch size matters more than prompt:traj split: 8×32 vs 16×16 are similar (64 vs 66), but 8×64 vs 8×32 is significant (70.5 vs 64.0), indicating GRPO advantage estimation stability mainly depends on total sample size.
Mix training benefits large models, harms small models: For 30B, mix training drops average pass^1 from 71.5 to 63.7 (Telecom drops 15 points), but 235B remains nearly unchanged (74.5 vs 74.7)—supporting the intuition that small models lack capacity for multi-domain absorption, guiding domain split strategies for deployment.

Highlights & Insights¶

"User simulator SFT" is the most underrated contribution: All prior agent RL work assumes the user model is given (whether GPT-4.1 or open-source base); this is the first to explicitly demonstrate that "user simulator quality directly determines RL improvement," with a 20-point empirical gap—a key warning for all interactive agent RL research.
Self-evolving data synthesis is a general paradigm: Making "task generation → verification → trajectory rollout → verification → reflection → plan update" a closed loop enables LLMs to learn from failures in data synthesis, more scalable than static pipelines like APIGen-MT/TOUCAN. This architecture can transfer to other domains needing complex synthetic data (e.g., reasoning chains, long-context QA).
Verifiable reward + agent RL as a paradigm: Extending RLVR from math/code to multi-turn tool-using agents, the key is generating verifiers during synthesis, avoiding LLM judges at training time—this "data with verifier" design can transfer to any task with programmatically checkable final states.
Mix vs. Separate depends on model scale: A practical but overlooked finding, directly informing engineering decisions on "training a single general agent vs. per-domain experts" for enterprise deployment.

Limitations & Future Work¶

Evaluation is limited to three τ²-bench domains; Retail, the hardest domain, still lags behind Claude Sonnet 4.5.
The number of reflection loop steps \(K\) in AReaL-SEA is not systematically ablated; optimal convergence rounds remain open.
No discussion of the distribution gap between synthetic and real production dialogues—τ²-bench's synthetic user styles may not cover real users.
The RL recipe relies heavily on infrastructure (80 H200s for 235B), making replication challenging for smaller teams; lightweight extensions (e.g., distillation to small models) are a natural direction.
Tool-use safety is not deeply discussed (impact statement briefly mentions "potential misuse"); real-world deployment requires dedicated permission/audit layers.

vs APIGen-MT (Prabhakar et al.): Also synthesizes multi-turn tool use, but APIGen-MT uses static reviewer-style validation; AReaL-SEA adds self-evolution and co-generation of verifiers, making it more RL-friendly.
vs TOUCAN (Xu et al.): TOUCAN focuses on scale (1.5M trajectories), while this work pursues "small, high-quality + self-evolution," showing 64 high-quality plan sets can surpass human experts.
vs ToolRL / Search-R1: These address single-turn tool-use RL; this work targets multi-turn interactive settings, introducing the critical dimension of user simulator quality.
vs π₀ / GR00T: Also RL for agents, but robotics uses real environments as ground truth; this work uses synthetic verifiers, lowering cost but with potential fidelity gaps.
vs ARENA-RL (tournament RL): The latter uses relative ranking to address reward sparsity; this work uses dynamic filtering + large batch to address advantage noise—complementary approaches.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of self-evolving data synthesis + user model SFT + verifier-based RL is new for agent post-training, especially the finding that "user model SFT is key."
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three domains × three model scales × separate/mix × data ablation + user model ablation + RL algorithm ablation, comprehensively covering and comparing all mainstream commercial frontier models.
Writing Quality: ⭐⭐⭐⭐ Clear narrative (data problem + RL problem → two solutions), concise formulas and figures; appendix provides solid training details.
Value: ⭐⭐⭐⭐⭐ Open-source models achieving or surpassing frontier models on τ²-bench is genuine SOTA, and the framework is reproducible (open code + detailed hyperparameters), directly valuable for industrial deployment of tool-using agents.