T\(^2\)PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning¶
Conference: ICML 2026
arXiv: 2605.02178
Code: https://github.com/WillDreamer/T2PO (Available)
Area: LLM Reasoning / Agentic RL / Multi-turn Reinforcement Learning
Keywords: Multi-turn RL, Training Collapse, Self-calibrated Uncertainty, Token-level Thinking Intervention, Turn-level Dynamic Resampling
TL;DR¶
T\(^2\)PO attributes the training collapse of multi-turn agentic RL to "hesitation"—characterized by over-thinking at the token level and redundant, ineffective turns at the turn level. By utilizing a self-calibrated uncertainty signal \(M_t\), which fuses entropy and confidence, T\(^2\)PO simultaneously drives Token-level Thinking Intervention (dynamically truncating thought segments) and Turn-level Dynamical Sampling (resampling ineffective turns). This approach consistently outperforms PPO, GRPO, and GiGPO on WebShop, ALFWorld, and Search QA with superior stability.
Background & Motivation¶
Background: Multi-turn agentic RL—where agents interact multiple times with environments like WebShop or ALFWorld and self-evolve—is a core paradigm for building reasoning-capable LLM agents. Prevailing methods include PPO, GRPO, and GiGPO (group-based critic-free), often combined with rejection-sampling fine-tuning (RFT) for cold starts and length penalties.
Limitations of Prior Work: All SOTA baselines suffer from "training collapse," where success rates drop sharply and both KL divergence and gradient norms explode across different seeds. Existing mitigation strategies (fine-grained credit assignment, internal reward shaping, trajectory filtering) are either too coarse-grained (trajectory-level) or rely on indirect reward shaping, making training dynamics highly sensitive to hyperparameters.
Key Challenge: Current research treats "training efficiency" and "training stability" as a trade-off—accelerating rollouts introduces off-policy drift, while dense reward shaping distorts the RL objective. Ours argues these are not contradictory if the root cause of collapse is identified.
Goal: 1) Explain the causes of poor stability by identifying a unified failure mechanism; 2) Design dual-scale interventions at both token and turn levels; 3) Synchronously improve efficiency and stability without introducing additional reward shaping.
Key Insight: Analysis of training trajectories reveals that collapse stems from low exploration efficiency, manifested as two types of hesitation: (i) token-level over-thinking, where long reasoning chains yield saturated information gain, and (ii) turn-level redundancy, where agents repeatedly attempt similar ineffective turns. This represents a systematic violation of the exploration-exploitation balance.
Core Idea: A self-calibrated signal \(M_t = \alpha \tilde{H}_t + (1 - \alpha)(1 - \tilde{C}_t)\), capturing both distribution sharpness and top-1 confidence, is used to monitor the variation rate of \(M_t\). If the variation is too small (signaling information saturation), the reasoning segment is truncated at the token level; if the change in \(\Phi^k\) between turns is negligible, the turn is resampled.
Method¶
Overall Architecture¶
T\(^2\)PO inserts two uncertainty-guided intervention modules into the standard multi-turn RL pipeline: TTI (Token-level Thinking Intervention), which dynamically truncates thinking segments during rollout, and TDS (Turn-level Dynamical Sampling), which identifies and resamples ineffective turns. Both modules are grounded in \(M_t\). Training is conducted using a memory context window (limited to the last \(P\) turns), turn-level discounted returns \(R(\tau^k) = \sum_{j=k}^K \beta^{j-k} r^j\), strict formatting penalties, and GRPO-style policy updates.
Key Designs¶
-
Self-calibrated Uncertainty Signal \(M_t\):
- Function: Provides a scalar signal that can distinguish "nearly uniform" from "highly sharp" distributions while remaining sensitive to tail probabilities in large vocabularies (e.g., 152K in Qwen3).
- Mechanism: Shannon entropy \(H_t = -\sum_i p_t^{(i)} \log p_t^{(i)}\) lacks discriminative power at extremes in large vocabularies, while top-\(j\) confidence \(C_t = -\frac{1}{j} \sum_{i=1}^j \log p_t^{(i)}\) ignores the tail. \(M_t\) is constructed by normalizing these signals (\(\tilde{H}_t\) and \(\tilde{C}_t\)) and fusing them: \(M_t = \alpha \tilde{H}_t + (1 - \alpha)(1 - \tilde{C}_t)\). This retains both tail sensitivity and top-1 stratification.
- Design Motivation: Single indicators have blind spots; the fused \(M_t\) acts as a reliable scalar for "local distribution stability," allowing threshold rules to maintain consistent semantics across tokens and turns.
-
TTI (Token-level Thinking Intervention) — Adaptive Termination of Reasoning:
- Function: Dynamically determines when reasoning has saturated and injects the reasoning terminator
</think>into the logits to stop over-thinking. - Mechanism: After a minimum prefix length \(L_{\min}\), the algorithm monitors the change \(\Delta_t^k = |M_t^k - M_{t-1}^k|\). When the average change within window \(N\) falls below threshold \(\varepsilon\), the
</think>token logit is set to \(+\infty\). A fixed queue \(\mathcal{Q} = [\texttt{</think>}, \backslash n, \texttt{<action>}]\) is then injected to ensure structured output. - Design Motivation: Unlike fixed-length truncation or indirect reward control, TTI provides direct, adaptive, token-level hard truncation. It uses a sliding window to smooth spikes and avoids premature truncation of critical task-specific tokens.
- Function: Dynamically determines when reasoning has saturated and injects the reasoning terminator
-
TDS (Turn-level Dynamical Sampling) — Resampling Ineffective Turns:
- Function: Detects ineffective interactions at the turn level (trips where the agent's state remains stagnant) and resamples them to optimize rollout budgets.
- Mechanism: A turn-level signal \(\Phi^k = (\prod_{t=1}^T M_t)^{1/T}\) is calculated via geometric mean. If the change between turns \(\Gamma^k = |\Phi^k - \Phi^{k-1}|\) is less than \(\eta\) (indicating no belief shift), the agent discards the turn and rollouts again from the same state, up to \(B_{\max}\) times.
- Design Motivation: Repeated ineffective turns are a primary cause of multi-turn RL collapse. TDS removes them during the rollout phase, stabilizing gradient signals and saving computation.
Loss & Training¶
The framework utilizes RFT cold starts, a memory context window for VRAM efficiency, and turn-level discounted returns. Strategy updates are performed using GRPO-style critic-free updates. TTI and TDS intervene during rollouts without modifying the core policy update equations.
Key Experimental Results¶
Main Results¶
Evaluation on WebShop and ALFWorld (averaging 5 seeds ± std) using Qwen3-4B + RFT:
| Method | WebShop Task Score | WebShop Success Rate | ALFWorld Success Rate |
|---|---|---|---|
| GPT-4o (Prompting) | 31.8 | 23.7 | 48.0 |
| Gemini-2.5-Pro (Prompting) | 42.5 | 35.9 | 60.3 |
| Claude Sonnet 4 (Prompting) | 45.6 | 39.8 | 63.7 |
| Qwen3-4B + SFT | 70.91 | 26.56 | 64.06 |
| PPO | 70.34 ± 8.63 | 61.93 ± 5.93 | 75.39 ± 3.81 |
| GRPO | 80.02 ± 7.94 | 68.56 ± 4.11 | 77.35 ± 0.62 |
| GiGPO | 86.03 ± 4.18 | 73.83 ± 3.04 | 80.47 ± 2.43 |
| T\(^2\)PO (Ours) | Highest & Lowest Std | Highest | Highest |
Ours achieves the best performance across all tasks with significantly lower variance across seeds, directly addressing training collapse.
Ablation Study¶
| Configuration | Key Observation |
|---|---|
| Full T\(^2\)PO | Optimal performance and stability; TTI + TDS synergistic. |
| TTI Only | Shorter reasoning segments, improved average stability. |
| TDS Only | Fewer ineffective turns, higher rollout efficiency. |
| Pure Entropy \(H_t\) | Threshold rules fail due to poor discriminative power in large vocabularies. |
| Pure Confidence \(C_t\) | Loss of tail information; TTI causes mis-truncation. |
| Truncate at \(M_t\) Peak | Performance decreases—critical task tokens are removed. |
Key Findings¶
- The \(M_t\) trajectory typically follows a "hump" shape. The peak area contains task-specific tokens (e.g., product names in WebShop), while the post-peak "convergence area" contains the redundant reasoning suitable for TTI.
- The combination of one-time activation, \(L_{\min}\) prefix protection, and sliding windows ensures TTI does not degrade performance.
- Using the geometric mean for \(\Phi^k\) is more stable than the arithmetic mean, as it prevents extreme token outliers from skewing the Turn-level belief state.
Highlights & Insights¶
- Unified Perspective: Using a common self-calibrated uncertainty signal for dual-scale interventions (TTI/TDS) provides an elegant solution to previously disjoint control problems.
- Hard Truncation vs. Soft Penalty: Implementing stop-gradient hard truncation and token injection during rollouts is a more robust engineering tool than indirect signals like length penalties.
- Strategic Truncation: The insight that one should not truncate at the peak of \(M\) demonstrates a nuanced understanding of reasoning traces—peaks represent high information density, not over-thinking.
- Broad Applicability: The TDS mechanism, based on belief shifts, can be generalized to various multi-turn tasks (tool-use, dialogue, coding) as a trajectory quality controller.
Limitations & Future Work¶
- The abundance of hyperparameters (\(\varepsilon, \eta, L_{\min}, N, B_{\max}\)) requires tuning across different tasks; an automated tuning method is not yet provided.
- The self-calibrated signal depends on estimates of \(H_{\min}\) and \(H_{\max}\), which may drift in long horizons.
- Experiments were conducted on 4B-scale models; scalability to larger models (70B+) and hyper-complex environments like SWE-Bench remains to be tested.
Related Work & Insights¶
- vs. trajectory-level filters (SimpleTIR/rStar2-Agent): These filter whole trajectories post-hoc; T\(^2\)PO resamples at the turn level during rollouts, maintaining finer granularity and preserving valid data segments.
- vs. GiGPO/DAPO: While they focus on advantage estimation, T\(^2\)PO modifies the rollout itself, allowing the methods to be orthogonal and combinable.
- vs. Internal Reward methods (SEED-GRPO): Instead of feeding internal signals back into rewards (which can pollute training dynamics), T\(^2\)PO uses them for explicit behavioral control, offering a cleaner logical framework.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Dual-scale hesitation perspective + unified signal)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Multi-task coverage + stability analysis)
- Writing Quality: ⭐⭐⭐⭐ (Clear logical flow and progression)
- Value: ⭐⭐⭐⭐ (Plug-and-play stabilization tool for agentic RL)