Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments¶
Conference: ACL 2026
arXiv: 2508.08791
Code: https://github.com/bytedance/FTRL
Area: Reinforcement Learning / Tool Use
Keywords: Tool Calling, Reinforcement Learning, Automated Environment Construction, Verifiable Rewards, LLM Training
TL;DR¶
This paper proposes the FTRL framework, which constructs stable and controllable tool-use training environments through a five-stage automated pipeline. By designing a verifiable reward mechanism that balances tool-calling precision and task completion—integrated with preference optimization RL algorithms—the approach achieves an average tool-use performance gain of over 10% on 7B-14B models, even surpassing the strongest closed-source models.
Background & Motivation¶
Background: The tool-use capability of LLMs is essential for executing complex real-world tasks. Current methods to enhance this capability primarily involve fine-tuning open-source models on interaction trajectories generated by proprietary models or allowing models to learn through interaction with environments using RL methods.
Limitations of Prior Work: RL-based tool-use training frameworks face two core limitations: (1) Difficulty in building stable training environments—frameworks relying on online tools are susceptible to API rate limits and service interruptions, leading to high standardized deployment costs; (2) Lack of verifiable reward signals—the complexity of tool interactions and the diversity of valid trajectories often require high-level LLM evaluation, introducing model bias while reducing training efficiency and algorithmic stability.
Key Challenge: Effective tool-use RL training needs to simultaneously satisfy "environment stability/controllability" and "reliable reward signals," but existing solutions fail to address both.
Goal: (1) Automate the generation of large-scale, high-quality tool-use training environments; (2) Design verifiable reward mechanisms relying solely on environmental feedback; (3) Seamlessly integrate with standard RL algorithms for feedback-driven training.
Key Insight: Deconstruct tool environment construction into five automated stages (scene decomposition \(\to\) document generation \(\to\) function integration \(\to\) complexity scaling \(\to\) local deployment). All tools are implemented locally in code to avoid external dependencies.
Core Idea: Automated construction of locally executable tool environments + F1-style precision-completion balanced rewards = Stable, verifiable tool-use RL training.
Method¶
Overall Architecture¶
FTRL consists of two core components: (1) A five-stage automated environment construction pipeline that decomposes user input into sub-problems and generates corresponding toolsets, documentation, and locally executable implementations; (2) A feedback-driven model training framework that iteratively improves the model's tool-use capability through verifiable reward mechanisms and preference optimization RL algorithms.
Key Designs¶
-
Five-Stage Automated Environment Construction Pipeline:
- Function: Automatically generates diverse and stable tool-use training environments.
- Mechanism: (a) Scene Decomposition—defines four tool-use scenarios (single-hop, parallel single-hop, multi-hop, parallel multi-hop) to cover different logical relations of sub-problems; (b) Document Generation—generates tool documentation for each sub-problem with one-to-one mapping; (c) Function Integration—merges tools with overlapping functions, reducing \(n\) tools to \(m \leq n\); (d) Complexity Scaling—increases difficulty through functional generalization, parameter expansion, type generalization, and toolset expansion; (e) Local Deployment—maps each tool document to a Python function for local execution to ensure stable feedback.
- Design Motivation: Ensure training data diversity via scene decomposition, eliminate external API dependency via local deployment, and enhance generalization to complex tools via complexity scaling.
-
Verifiable Reward Mechanism:
- Function: Provides precise, bias-free reward signals for tool-use behavior.
- Mechanism: Inspired by the F1 score to balance tool-calling precision and task completion. Let \(p\) be the number of calls, \(q\) the number of successfully solved sub-problems, \(t\) the number of remaining unsolved sub-problems, and \(a\) the correct answer. The reward is \(R = \frac{2q}{p+1}\) (when \(p>0\)). Penalties of -0.5 and -0.3 are given for empty outputs and formatting errors, respectively. A reward of \(\frac{1}{t+1}\) is granted if the answer is correct.
- Design Motivation: Optimizing only for precision leads to incomplete tasks, while optimizing only for completion leads to tool abuse. The F1-style reward balances both and relies solely on environment feedback without requiring external model evaluation.
-
Preference Optimization Training Flow:
- Function: Optimizes the model's tool-use policy using collected trajectories and reward signals.
- Mechanism: Model \(\mathcal{M}\) samples interaction trajectories in the constructed environment, recording tool calls, intermediate results, and final answers. Combined with verifiable rewards, policy optimization is performed using preference optimization RL algorithms like Reinforce++ or GRPO. Training trajectories are re-sampled each epoch to expand the exploration space.
- Design Motivation: Eliminates the need for hand-annotated solution paths; the model autonomously discovers effective tool-use strategies through interaction.
Loss & Training¶
Trained using the VeRL framework with a learning rate of \(1\times10^{-6}\), batch size 512, mini-batch 32, and 16 rollouts per update. Max response length is 1024 for non-reasoning mode and 8192 for reasoning mode. Training lasts 3 epochs, with trajectories re-sampled at the start of each epoch. Training is conducted on 8 A100 GPUs.
Key Experimental Results¶
Main Results¶
Tool-use performance of different model scales (Solve-F1 / Avg across benchmarks)
| Model | Benchmark Avg | FTRL-Reinforce++ | FTRL-GRPO |
|---|---|---|---|
| Qwen2.5-7B | 26.52 | 37.09 (+10.57) | 37.80 (+11.28) |
| Qwen2.5-14B | 34.33 | 44.25 (+9.92) | 41.23 (+6.90) |
| Qwen3-8B (Non-Reasoning) | 31.01 | 42.41 (+11.40) | 45.43 (+14.42) |
| Qwen3-14B (Non-Reasoning) | 33.34 | 44.14 (+10.80) | 44.90 (+11.56) |
| GPT-4o | 42.79 | — | — |
| Claude-4.0-Sonnet | 42.71 | — | — |
Ablation Study¶
Comparison of different reward mechanisms on Solve-F1 (Qwen2.5-7B)
| Reward Design | Effect | Description |
|---|---|---|
| \(R_{\text{Solve-P}} = q/p\) | High Solve-P, Low Solve-R | Optimizes only precision; incomplete tasks |
| \(R_{\text{Solve-R}} = q\) | High Solve-R, Low Solve-P | Optimizes only completion; tool abuse |
| \(R_{\text{Solve-PR}} = q^2/p\) | Unstable | Discrete reward distribution hinders training |
| \(R = 2q/(p+1)\) | Optimal Balance | Balanced improvement in precision and completion |
Key Findings¶
- Open-source 7B-14B models trained with FTRL outperform strongest closed-source models like GPT-4o (42.79) and Claude-4.0-Sonnet (42.71) in average scores.
- Parameter-level analysis reveals that performance gains primarily stem from updates in the underlying MLP parameters (layers 0-2), indicating the method enhances the model's understanding and representation of contextual information rather than simple overfitting.
- Reasoning mode performs better in complex scenarios (multi-hop, parallel multi-hop) but shows degraded performance in simple scenarios (single-hop), suggesting that current reasoning mechanisms are not yet optimized for tool use.
- Post-training models show no performance loss on general capability benchmarks like MMLU, BBH, and GSM8K.
Highlights & Insights¶
- The five-stage environment construction pipeline is elegantly designed—forming a closed loop from scene decomposition to local deployment, ensuring both environmental diversity and feedback stability. This is transferable to other RL scenarios requiring stable interactions.
- The F1-style reward design is simple yet efficient—simultaneously constraining precision and completion with one formula, avoiding the complexity of multi-objective optimization.
- The discovery of "lower-layer MLP-driven improvement" is insightful—suggesting that tool-use capability improvements are rooted in better contextual understanding rather than surface-level pattern matching.
Limitations & Future Work¶
- The method primarily improves tool-calling behavior but does not optimize the model's underlying reasoning process; a gap remains between the reasoning mode of current open-source models and tool-use tasks.
- Due to resource constraints, verification was limited to 7B-14B models; the effect on larger-scale models remains unknown.
- Environment construction relies on GPT-4o assistance; future work could explore fully automated generation schemes.
- Training data lacks multi-turn user interactions and noisy environments, though the model remains effective on multi-turn benchmarks like \(\tau\)-bench, indicating strong generalization.
Related Work & Insights¶
- vs SFT-based methods (e.g., ToolLlama): SFT methods rely on proprietary models to generate trajectories for supervised fine-tuning. FTRL enables autonomous learning through RL interaction, removing dependence on proprietary model trajectories.
- vs existing RL tool-use methods: Existing methods rely on online APIs and LLM-as-judge rewards. FTRL's local environments and verifiable rewards solve the issues of stability and reward reliability.
- vs Ye et al. (2024): Their controllable environment construction is limited to multi-hop scenarios and test data. FTRL covers four scenarios and supports training.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of a five-stage environment pipeline and F1-style rewards is a solid engineering-oriented system innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering multiple model families, RL algorithms, benchmarks, reward ablations, and parameter analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with deep experimental analysis, though technical details of environment construction could be more granular.
- Value: ⭐⭐⭐⭐⭐ Provides a complete, usable tool-use RL training framework; 7B models surpassing GPT-4o holds significant practical value.