AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints¶

Conference: ICLR 2026 arXiv: 2603.13348 Code: None Area: Reinforcement Learning Keywords: Tool Use, Reinforcement Learning, Test-Time Scaling, Entropy Constraint, Automatic Reasoning Scaling

TL;DR¶

This paper proposes AutoTool, which addresses reasoning collapse in direct RL training for LLM tool use and the overthinking problem in scaled models via a decoupled adaptive entropy constraint strategy. AutoTool enables automatic switching between long and short reasoning modes based on problem difficulty, achieving a 9.8% accuracy improvement while reducing reasoning token overhead by ~81%.

Background & Motivation¶

Background: Integration of LLMs with external tools is a key capability for AI agents. RLVR (Reinforcement Learning with Verifiable Rewards) has successfully enabled test-time scaling on math and code tasks, but its effectiveness in tool use remains unvalidated.
Limitations of Prior Work: (1) Direct RL training in tool-use settings leads to "reasoning collapse" — the model fails to sufficiently extend its reasoning length to solve complex problems; (2) Distilled models generate lengthy reasoning for all problems, wasting substantial tokens on simple queries.
Key Challenge: While RL training on math tasks naturally increases reasoning length, it shortens reasoning length in tool-use tasks. The root cause is that low entropy causes the model to prematurely converge to short-reasoning strategies.
Goal: Design a training method that automatically selects reasoning modes based on problem difficulty — extended thinking for complex problems and direct answers for simple ones.
Key Insight: The paper identifies a strong positive correlation between low information entropy and reasoning collapse, while also finding that a naive entropy constraint is extremely sensitive to its coefficient.
Core Idea: Decouple the policy losses for long and short reasoning, applying an adaptive entropy constraint to long reasoning to maintain exploration capability, while applying a fixed constraint to short reasoning to prevent over-exploration.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) Data preparation — constructing the PubTool mixed dataset; (2) Warm-up SFT — mixing long and short reasoning data to make the model difficulty-aware; (3) Decoupled adaptive entropy constraint RL — GRPO combined with decoupled entropy regularization.

Key Designs¶

Warm-up SFT with Mixed Reasoning Data:
- Function: Enable the model to initially perceive problem difficulty.
- Mechanism: Training data is annotated by sampling 8 times each from a non-thinking model (Qwen2.5-7B-Instruct) and a thinking model (Qwen3-32B). If the non-thinking model answers correctly, its short reasoning is used as the label; otherwise, the thinking model's long reasoning is used. An auto-thinking template is designed to support mode switching.
- Design Motivation: Direct RL training leads to collapse; SFT warm-up is needed to establish initial long and short reasoning capabilities.
Decoupled Adaptive Entropy Constraint:
- Function: Differentially control the exploration capacity of long and short reasoning.
- Mechanism: The entropy coefficient \(\beta_i\) in the policy loss \(\mathcal{L}_p\) is decoupled by reasoning mode: short reasoning uses a fixed \(\beta_s\), while long reasoning uses an adaptive \(\beta_l\). \(\beta_l\) is dynamically adjusted via an auxiliary loss \(\mathcal{L}_\beta^l = \frac{1}{N}\sum (1-m_i)\cdot\beta_l\cdot(H_i - H_l)\), which increases \(\beta_l\) when \(H_i < H_l\) to encourage exploration.
- Design Motivation: Low entropy in direct RL leads to reasoning collapse, yet globally high entropy causes over-exploration on simple problems.
Asymmetric Reward Design:
- Function: Encourage short reasoning for simple problems and long reasoning for complex ones.
- Mechanism: Correct + no-thinking: +1.0; Correct + thinking: +0.5; Incorrect + thinking: −0.5; Incorrect + no-thinking: −1.0.
- Design Motivation: Higher rewards for correct short reasoning encode efficiency preference; heavier penalties for incorrect short reasoning encourage the model to think when needed.

Loss & Training¶

Base model: Qwen2.5-7B-Instruct
RL algorithm: GRPO
Data: PubTool (8.2k SFT + 7k RL), sourced from ToolACE + xLAM + Hermes
Data quality optimization: Removal of trivially easy or excessively hard samples; filtering based on reward variance across multiple training rounds.

Key Experimental Results¶

Main Results¶

Model	BFCL Overall	Non-Live	Live	Multi-Turn
Qwen2.5-7B (Base)	53.69	86.46	67.44	7.62
PubTool-SFT	58.17	88.98	77.28	9.68
PubTool-Distilled	60.30	87.73	78.64	15.65
AutoTool-7B	70.12	89.76	80.22	38.18
GPT-4o	70.42	87.67	79.88	43.00

Ablation Study¶

Configuration	Overall	Change
Full method	70.12	—
w/o data refine	63.69	−6.43
w/o decouple	64.23	−5.89
w/o adapt coeff	67.78	−2.34

Key Findings¶

The thinking rate reaches 45% in Multi-Turn settings and 0% in Non-Live settings, demonstrating that the model has learned to automatically judge problem difficulty.
Reasoning trajectories for complex problems are extended by 5×, while simple problems remain concise.
Data quality optimization is the most critical component (−6.43% upon removal).
AutoTool-7B approaches GPT-4o on BFCL (70.12 vs. 70.42).

Highlights & Insights¶

This work is the first to identify and address the "reasoning collapse" phenomenon in RL training for tool use.
The collapse mechanism is understood through the lens of information entropy, revealing a strong positive correlation with entropy rather than with data difficulty distribution.
The asymmetric reward design elegantly encodes efficiency preference.
The 7B model outperforms most SFT/RLVR models of comparable scale and approaches frontier model performance.

Limitations & Future Work¶

Validation is limited to tool-use scenarios and has not been extended to other agent tasks.
The warm-up SFT stage relies on an external reasoning model (Qwen3-32B) to generate long reasoning data.
The target entropy value for the adaptive entropy constraint still requires manual pre-specification.
Data sources are limited to mixtures of public datasets; effectiveness on domain-specific tools has not been validated.

vs. DeepSeek-R1: R1 successfully scales reasoning on math/code tasks but faces collapse in tool-use settings.
vs. Thinkless: Thinkless also performs reasoning mode switching, but relies on SFT distillation rather than RL.
vs. AdaCtrl: AdaCtrl performs adaptive reasoning control; AutoTool achieves better automation through decoupled entropy constraints.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The discovery of reasoning collapse and the decoupled entropy constraint solution represent significant contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks, multiple baselines, and detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ The logical chain of preliminary analysis → findings → method is clearly articulated.
Value: ⭐⭐⭐⭐⭐ Directly informative for LLM agent training practices.