Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments¶

Conference: ACL 2026 arXiv: 2508.08791 Code: https://github.com/bytedance/FTRL Area: Reinforcement Learning / Tool Use Keywords: Tool Calling, Reinforcement Learning, Automated Environment Construction, Verifiable Reward, LLM Training

TL;DR¶

This paper proposes FTRL, a framework that constructs stable and controllable tool-use training environments through a five-stage automated pipeline, and designs a verifiable reward mechanism balancing tool-call precision and task completion in an F1-inspired manner. Combined with preference-optimization RL algorithms, FTRL achieves an average performance improvement of over 10% on tool-use benchmarks for 7B–14B models, surpassing even the strongest closed-source models.

Background & Motivation¶

Background: Tool-use capability is critical for LLMs to handle complex real-world tasks. Current approaches to improving tool use include fine-tuning open-source models on interaction trajectories generated by proprietary models, and RL-based methods that train models through environment interaction.

Limitations of Prior Work: RL-based tool-use training frameworks face two core limitations: (1) difficulty constructing stable training environments — frameworks relying on numerous online tools are susceptible to API rate limits and service interruptions, with high standardization and deployment costs; (2) lack of verifiable reward signals — the complexity of tool interactions and the diversity of valid trajectories typically necessitate LLM-based evaluation, introducing model bias and reducing training efficiency and algorithmic stability.

Key Challenge: Effective RL training for tool use requires simultaneously satisfying two conditions — a stable, controllable environment and a reliable reward signal — yet existing approaches fail to address both simultaneously.

Goal: (1) Automatically generate large quantities of high-quality tool-use training environments; (2) design a verifiable reward mechanism relying solely on environment feedback; (3) integrate seamlessly with standard RL algorithms for feedback-driven training.

Key Insight: Decompose tool environment construction into five automated stages (scenario decomposition → documentation generation → functionality consolidation → complexity scaling → local deployment), with all tools executed locally as code to eliminate external dependencies.

Core Idea: Automated construction of locally executable tool environments + an F1-inspired precision–completion balanced reward = stable, verifiable RL training for tool use.

Method¶

Overall Architecture¶

FTRL comprises two core components: (1) a five-stage automated environment construction pipeline that decomposes user inputs into sub-problems and generates corresponding tool sets, documentation, and local executable implementations; and (2) a feedback-driven model training framework that iteratively improves tool-use capability via a verifiable reward mechanism and preference-optimization RL algorithms.

Key Designs¶

Five-Stage Automated Environment Construction Pipeline:
- Function: Automatically generate diverse and stable tool-use training environments.
- Mechanism: (a) Scenario Decomposition — defines four tool-use scenarios (single-hop, parallel single-hop, multi-hop, parallel multi-hop) covering different logical relationships among sub-problems; (b) Documentation Generation — generates a corresponding tool document for each sub-problem, establishing a one-to-one mapping; (c) Functionality Consolidation — merges functionally overlapping tools, reducing the count from \(n\) to \(m \leq n\); (d) Complexity Scaling — increases difficulty through functional generalization, parameter expansion, type generalization, and tool-set expansion; (e) Local Deployment — maps each tool document to a Python function executed locally to ensure stable feedback.
- Design Motivation: Scenario decomposition ensures diversity in training data; local deployment eliminates dependence on external APIs; complexity scaling enhances model generalization to complex tools.
Verifiable Reward Mechanism:
- Function: Provide precise, model-bias-free reward signals for tool-use behavior.
- Mechanism: Inspired by the F1 score, the reward balances tool-call precision and task completion. Let \(p\) denote the number of tool calls, \(q\) the number of successfully resolved sub-problems, \(t\) the number of remaining unsolved sub-problems, and \(a\) the correct answer. The reward is defined as \(R = \frac{2q}{p+1}\) (when \(p>0\)), with penalties of \(-0.5\) for empty outputs and \(-0.3\) for format errors, and a bonus of \(\frac{1}{t+1}\) for a correct final answer.
- Design Motivation: Optimizing precision alone leads to incomplete task execution; optimizing completion alone encourages tool overuse. The F1-style reward achieves a balance between the two while relying solely on environment feedback without requiring external model evaluation.
Preference-Optimization Training Procedure:
- Function: Optimize the model's tool-use policy using collected trajectories and reward signals.
- Mechanism: The model \(\mathcal{M}\) samples trajectories through multi-step interaction within the constructed environments, recording tool calls, intermediate results, and final answers at each step. Using the verifiable rewards, the policy is optimized with preference-optimization RL algorithms such as Reinforce++ or GRPO. Training trajectories are re-sampled at the beginning of each epoch to expand the exploration space.
- Design Motivation: No manual annotation of solution paths is required; the model autonomously discovers effective tool-use strategies through environment interaction.

Loss & Training¶

Training is conducted using the VeRL framework with a learning rate of \(1\times10^{-6}\), batch size of 512, mini-batch size of 32, and 16 rollouts per update. The maximum response length is 1024 tokens in non-reasoning mode and 8192 in reasoning mode. Training proceeds for 3 epochs, with trajectories re-sampled using the current model at the start of each epoch. All experiments are conducted on 8 A100 GPUs.

Key Experimental Results¶

Main Results¶

Tool-Use Performance Across Model Scales (Solve-F1 / Average Across Benchmarks)

Model	Baseline Avg	FTRL-Reinforce++	FTRL-GRPO
Qwen2.5-7B	26.52	37.09 (+10.57)	37.80 (+11.28)
Qwen2.5-14B	34.33	44.25 (+9.92)	41.23 (+6.90)
Qwen3-8B (Non-Reasoning)	31.01	42.41 (+11.40)	45.43 (+14.42)
Qwen3-14B (Non-Reasoning)	33.34	44.14 (+10.80)	44.90 (+11.56)
GPT-4o	42.79	—	—
Claude-4.0-Sonnet	42.71	—	—

Ablation Study¶

Solve-F1 Comparison Under Different Reward Designs (Qwen2.5-7B)

Reward Design	Effect	Note
\(R_{\text{Solve-P}} = q/p\)	High Solve-P but low Solve-R	Precision-only; incomplete task execution
\(R_{\text{Solve-R}} = q\)	High Solve-R but low Solve-P	Completion-only; tool overuse
\(R_{\text{Solve-PR}} = q^2/p\)	Unstable	Discrete reward distribution impedes training
\(R = 2q/(p+1)\)	Optimal balance	Balanced improvement in both precision and completion

Key Findings¶

FTRL-trained 7B–14B open-source models surpass the strongest closed-source models, including GPT-4o (42.79) and Claude-4.0-Sonnet (42.71), on average benchmark scores.
Parameter-level analysis reveals that performance gains are primarily driven by updates to lower-layer MLP parameters (layers 0–2), suggesting the method enhances the model's ability to understand and represent contextual information rather than simply overfitting.
Reasoning mode performs better in complex scenarios (multi-hop, parallel multi-hop) but degrades on simple scenarios (single-hop), indicating that current reasoning mechanisms are not yet optimized for tool-use tasks.
Post-training models exhibit no performance degradation on general-purpose benchmarks such as MMLU, BBH, and GSM8K.

Highlights & Insights¶

The five-stage environment construction pipeline is elegantly designed — forming a closed loop from scenario decomposition to local deployment, ensuring both environmental diversity and feedback stability, and readily transferable to other RL training scenarios requiring stable interactive environments.
The F1-style reward is concise and effective — a single formula simultaneously constrains precision and completion, avoiding the complexity of multi-objective optimization.
The finding that "lower-layer MLP updates drive performance gains" is thought-provoking — improvements in tool-use capability are rooted in better contextual understanding rather than surface-level pattern matching.

Limitations & Future Work¶

The method primarily improves tool-calling behavior without optimizing the model's underlying reasoning process; an alignment gap remains between current open-source reasoning modes and tool-use tasks.
Due to resource constraints, validation is limited to 7B–14B models; the effectiveness on larger-scale models remains unknown.
Environment construction relies on GPT-4o assistance for generation; future work could explore fully automated generation pipelines.
Training data lacks multi-turn user interactions and noisy environments; nevertheless, models generalize effectively to multi-turn benchmarks such as τ-bench, indicating strong generalization capability.

vs. SFT-based methods (e.g., ToolLlama): SFT methods rely on proprietary model-generated trajectories for supervised fine-tuning, whereas FTRL autonomously learns through environment interaction via RL, eliminating dependence on proprietary models.
vs. Existing RL tool-use methods: Existing methods depend on online APIs and LLM-as-judge rewards; FTRL's local environment and verifiable rewards address the challenges of stability and reward reliability.
vs. Ye et al. (2024): Their controllable environment construction is limited to multi-hop scenarios and test data, whereas FTRL covers four scenario types and supports training.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of the five-stage environment construction pipeline and F1-style reward represents an engineering-oriented systemic innovation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers multiple model families, RL algorithms, benchmarks, reward ablations, and parameter-level analysis — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Well-structured with in-depth experimental analysis; technical details of the environment construction pipeline could be elaborated further.
Value: ⭐⭐⭐⭐⭐ Provides a complete, deployable RL training framework for tool use; the 7B model surpassing GPT-4o demonstrates strong practical value.