Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments¶
Conference: ACL 2026 arXiv: 2508.08791 Code: https://github.com/bytedance/FTRL Area: Reinforcement Learning / Tool Use Keywords: Tool Calling, Reinforcement Learning, Automated Environment Construction, Verifiable Reward, LLM Training
TL;DR¶
This paper proposes FTRL, a framework that constructs stable and controllable tool-use training environments through a five-stage automated pipeline, and designs a verifiable reward mechanism balancing tool-call precision and task completion in an F1-inspired manner. Combined with preference-optimization RL algorithms, FTRL achieves an average performance improvement of over 10% on tool-use benchmarks for 7B–14B models, surpassing even the strongest closed-source models.
Background & Motivation¶
Background: Tool-use capability is critical for LLMs to handle complex real-world tasks. Current approaches to improving tool use include fine-tuning open-source models on interaction trajectories generated by proprietary models, and RL-based methods that train models through environment interaction.
Limitations of Prior Work: RL-based tool-use training frameworks face two core limitations: (1) difficulty constructing stable training environments — frameworks relying on numerous online tools are susceptible to API rate limits and service interruptions, with high standardization and deployment costs; (2) lack of verifiable reward signals — the complexity of tool interactions and the diversity of valid trajectories typically necessitate LLM-based evaluation, introducing model bias and reducing training efficiency and algorithmic stability.
Key Challenge: Effective RL training for tool use requires simultaneously satisfying two conditions — a stable, controllable environment and a reliable reward signal — yet existing approaches fail to address both simultaneously.
Goal: (1) Automatically generate large quantities of high-quality tool-use training environments; (2) design a verifiable reward mechanism relying solely on environment feedback; (3) integrate seamlessly with standard RL algorithms for feedback-driven training.
Key Insight: Decompose tool environment construction into five automated stages (scenario decomposition → documentation generation → functionality consolidation → complexity scaling → local deployment), with all tools executed locally as code to eliminate external dependencies.
Core Idea: Automated construction of locally executable tool environments + an F1-inspired precision–completion balanced reward = stable, verifiable RL training for tool use.
Method¶
Overall Architecture¶
FTRL comprises two core components: (1) a five-stage automated environment construction pipeline that decomposes user inputs into sub-problems and generates corresponding tool sets, documentation, and local executable implementations; and (2) a feedback-driven model training framework that iteratively improves tool-use capability via a verifiable reward mechanism and preference-optimization RL algorithms.
Key Designs¶
-
Five-Stage Automated Environment Construction Pipeline:
- Function: Automatically generate diverse and stable tool-use training environments.
- Mechanism: (a) Scenario Decomposition — defines four tool-use scenarios (single-hop, parallel single-hop, multi-hop, parallel multi-hop) covering different logical relationships among sub-problems; (b) Documentation Generation — generates a corresponding tool document for each sub-problem, establishing a one-to-one mapping; (c) Functionality Consolidation — merges functionally overlapping tools, reducing the count from \(n\) to \(m \leq n\); (d) Complexity Scaling — increases difficulty through functional generalization, parameter expansion, type generalization, and tool-set expansion; (e) Local Deployment — maps each tool document to a Python function executed locally to ensure stable feedback.
- Design Motivation: Scenario decomposition ensures diversity in training data; local deployment eliminates dependence on external APIs; complexity scaling enhances model generalization to complex tools.
-
Verifiable Reward Mechanism:
- Function: Provide precise, model-bias-free reward signals for tool-use behavior.
- Mechanism: Inspired by the F1 score, the reward balances tool-call precision and task completion. Let \(p\) denote the number of tool calls, \(q\) the number of successfully resolved sub-problems, \(t\) the number of remaining unsolved sub-problems, and \(a\) the correct answer. The reward is defined as \(R = \frac{2q}{p+1}\) (when \(p>0\)), with penalties of \(-0.5\) for empty outputs and \(-0.3\) for format errors, and a bonus of \(\frac{1}{t+1}\) for a correct final answer.
- Design Motivation: Optimizing precision alone leads to incomplete task execution; optimizing completion alone encourages tool overuse. The F1-style reward achieves a balance between the two while relying solely on environment feedback without requiring external model evaluation.
-
Preference-Optimization Training Procedure:
- Function: Optimize the model's tool-use policy using collected trajectories and reward signals.
- Mechanism: The model \(\mathcal{M}\) samples trajectories through multi-step interaction within the constructed environments, recording tool calls, intermediate results, and final answers at each step. Using the verifiable rewards, the policy is optimized with preference-optimization RL algorithms such as Reinforce++ or GRPO. Training trajectories are re-sampled at the beginning of each epoch to expand the exploration space.
- Design Motivation: No manual annotation of solution paths is required; the model autonomously discovers effective tool-use strategies through environment interaction.
Loss & Training¶
Training is conducted using the VeRL framework with a learning rate of \(1\times10^{-6}\), batch size of 512, mini-batch size of 32, and 16 rollouts per update. The maximum response length is 1024 tokens in non-reasoning mode and 8192 in reasoning mode. Training proceeds for 3 epochs, with trajectories re-sampled using the current model at the start of each epoch. All experiments are conducted on 8 A100 GPUs.
Key Experimental Results¶
Main Results¶
Tool-Use Performance Across Model Scales (Solve-F1 / Average Across Benchmarks)
| Model | Baseline Avg | FTRL-Reinforce++ | FTRL-GRPO |
|---|---|---|---|
| Qwen2.5-7B | 26.52 | 37.09 (+10.57) | 37.80 (+11.28) |
| Qwen2.5-14B | 34.33 | 44.25 (+9.92) | 41.23 (+6.90) |
| Qwen3-8B (Non-Reasoning) | 31.01 | 42.41 (+11.40) | 45.43 (+14.42) |
| Qwen3-14B (Non-Reasoning) | 33.34 | 44.14 (+10.80) | 44.90 (+11.56) |
| GPT-4o | 42.79 | — | — |
| Claude-4.0-Sonnet | 42.71 | — | — |
Ablation Study¶
Solve-F1 Comparison Under Different Reward Designs (Qwen2.5-7B)
| Reward Design | Effect | Note |
|---|---|---|
| \(R_{\text{Solve-P}} = q/p\) | High Solve-P but low Solve-R | Precision-only; incomplete task execution |
| \(R_{\text{Solve-R}} = q\) | High Solve-R but low Solve-P | Completion-only; tool overuse |
| \(R_{\text{Solve-PR}} = q^2/p\) | Unstable | Discrete reward distribution impedes training |
| \(R = 2q/(p+1)\) | Optimal balance | Balanced improvement in both precision and completion |
Key Findings¶
- FTRL-trained 7B–14B open-source models surpass the strongest closed-source models, including GPT-4o (42.79) and Claude-4.0-Sonnet (42.71), on average benchmark scores.
- Parameter-level analysis reveals that performance gains are primarily driven by updates to lower-layer MLP parameters (layers 0–2), suggesting the method enhances the model's ability to understand and represent contextual information rather than simply overfitting.
- Reasoning mode performs better in complex scenarios (multi-hop, parallel multi-hop) but degrades on simple scenarios (single-hop), indicating that current reasoning mechanisms are not yet optimized for tool-use tasks.
- Post-training models exhibit no performance degradation on general-purpose benchmarks such as MMLU, BBH, and GSM8K.
Highlights & Insights¶
- The five-stage environment construction pipeline is elegantly designed — forming a closed loop from scenario decomposition to local deployment, ensuring both environmental diversity and feedback stability, and readily transferable to other RL training scenarios requiring stable interactive environments.
- The F1-style reward is concise and effective — a single formula simultaneously constrains precision and completion, avoiding the complexity of multi-objective optimization.
- The finding that "lower-layer MLP updates drive performance gains" is thought-provoking — improvements in tool-use capability are rooted in better contextual understanding rather than surface-level pattern matching.
Limitations & Future Work¶
- The method primarily improves tool-calling behavior without optimizing the model's underlying reasoning process; an alignment gap remains between current open-source reasoning modes and tool-use tasks.
- Due to resource constraints, validation is limited to 7B–14B models; the effectiveness on larger-scale models remains unknown.
- Environment construction relies on GPT-4o assistance for generation; future work could explore fully automated generation pipelines.
- Training data lacks multi-turn user interactions and noisy environments; nevertheless, models generalize effectively to multi-turn benchmarks such as τ-bench, indicating strong generalization capability.
Related Work & Insights¶
- vs. SFT-based methods (e.g., ToolLlama): SFT methods rely on proprietary model-generated trajectories for supervised fine-tuning, whereas FTRL autonomously learns through environment interaction via RL, eliminating dependence on proprietary models.
- vs. Existing RL tool-use methods: Existing methods depend on online APIs and LLM-as-judge rewards; FTRL's local environment and verifiable rewards address the challenges of stability and reward reliability.
- vs. Ye et al. (2024): Their controllable environment construction is limited to multi-hop scenarios and test data, whereas FTRL covers four scenario types and supports training.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of the five-stage environment construction pipeline and F1-style reward represents an engineering-oriented systemic innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple model families, RL algorithms, benchmarks, reward ablations, and parameter-level analysis — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with in-depth experimental analysis; technical details of the environment construction pipeline could be elaborated further.
- Value: ⭐⭐⭐⭐⭐ Provides a complete, deployable RL training framework for tool use; the 7B model surpassing GPT-4o demonstrates strong practical value.