ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"¶
Conference: ACL2026
arXiv: 2508.04086
Code: https://github.com/zhongyi-zhou/toolgrad
Area: llm_agent / Tool-use Data Generation
Keywords: Tool Calling, Synthetic Data, Textual Gradients, Answer-first, API Workflow
TL;DR¶
ToolGrad reverses the tool-use data generation process from "query-first with DFS tool-chain search" to "generating executable tool chains first, then back-inferring user queries." By employing an API selection loop similar to textual gradients to construct ToolGrad-500, the framework achieves a 99.8% pass rate. Small models like Gemma-3 trained on this data outperform various strong closed-source models in single-turn tool calling.
Background & Motivation¶
Background: Tool calling allows LLMs to access search, databases, code execution, and various APIs, providing a critical path to reducing hallucinations, enhancing factuality, and executing complex tasks. The key to training such models is not just the API list, but a large number of "user request - tool calling chain - final answer" supervised samples.
Limitations of Prior Work: Mainstream synthesis schemes typically let an LLM generate a user query based on a set of APIs, then use an agent to find a feasible tool chain via DFS or ReAct-style exploration. This query-first process is expensive, has a high failure rate, and failed samples waste substantial tool calls. Even worse, even if DFS finds a successful path, low-quality or incorrect tool steps may be mixed into the exploration, contaminating the model when used as "ground truth" for training.
Key Challenge: Real user problems are naturally fuzzy, while tool chains are concrete and verifiable. Searching for an answer from a fuzzy query requires expensive exploration; however, if an executable tool chain is already available, back-inferring a query that can be solved by that chain is much easier. The problem lies in how to directly generate complex and effective tool chains from an 8k-scale API database.
Goal: The authors aim to design a data generation framework with a high pass rate, low tool calling cost, and the ability to produce complex multi-API workflows. They seek to verify whether small models trained on this inexpensive synthetic data can acquire authentic tool-calling capabilities on ToolBench and BFCL.
Key Insight: The paper draws on the "textual gradient" concept from TextGrad but replaces the optimized object from a prompt to a dataset. Instead of having a critic write natural language suggestions at each step, the LLM selects the most valuable API from candidate API execution reports. this discrete selection is regarded as the "gradient" of the data generation process.
Core Idea: Construct a successful tool answer first, then generate the corresponding user query. Tool chain construction is completed through a four-step iteration of API proposal, execution, selection, and workflow update, avoiding the large-scale failed exploration inherent in query-first searching.
Method¶
Overall Architecture¶
The data samples generated by ToolGrad are triplets \((q, \mathcal{W}, r)\): \(q\) is the user query, \(\mathcal{W}\) is a workflow composed of multiple API chains, and \(r\) is the final answer given to the user based on the workflow. Unlike ReAct's step-by-step calling, the reasoning model trained here predicts the complete tool calling in a single output, necessitating structured API workflows in the data.
The entire generation process starts with a current workflow, taking a random API mini-batch in each round. The API Proposer first selects a few potentially useful APIs and usage instructions from the mini-batch; several API Executors practically call these APIs in parallel and generate execution reports; the API Selector compares the reports and chooses the API most worth incorporating into the workflow and the chain it should append to; the Workflow Updater writes this API into the workflow and asks the LLM to generate a new user query and final answer based on the updated workflow. After several iterations, an answer-first sample is completed.
Key Designs¶
-
Answer-first Toolchain Generation:
- Function: Ensure the tool calling chain is executable and successful first, then generate the user request that can be answered by that chain.
- Mechanism: ToolGrad does not start from fuzzy queries to search for answers; instead, it uses valid API calls as generation anchors. Every time an API is added to the workflow, the system updates the corresponding user query and response to maintain triplet consistency.
- Design Motivation: Tool chains are verifiable structured objects, whereas queries are natural language descriptions. Back-inferring a query from a tool chain is easier than searching for a tool chain from a query, and it prevents "unsolvable queries" and failed tool steps from entering the training set at the source.
-
Four-module Textual-gradient Loop:
- Function: Progressively expand complex workflows within large-scale API databases while controlling the tool call cost per round.
- Mechanism: The API Proposer uses a standard LLM to propose at most \(m=3\) candidates from an API batch of size \(bs=50\); the API Executor uses an LLM agent supporting tool use to actually execute candidate APIs and return success/failure along with call history; the API Selector reads execution reports to choose the most valuable API and its position in the chain; the Workflow Updater deterministically appends the API and lets the LLM generate a new query/response.
- Design Motivation: Since tool execution is the truly expensive part, the proposer is used first to filter out most irrelevant APIs. The discrete choice of the Selector is equivalent to telling the system "in which API direction to optimize the data sample," which serves as the textual gradient in ToolGrad.
-
Negative Tool Sampling and Single-turn Function Call Formatting:
- Function: Enable training samples to simulate real-world deployment scenarios where "visible APIs outnumber required APIs" and adapt to BFCL-style single-turn tool calling training.
- Mechanism: Given a positive API in the workflow, the authors sample additional negative APIs based on embedding similarity, forcing the model to face a top-\(p\) set of candidate tools rather than just seeing the correct one. In the generation configuration, each sample runs for 10 iterations, the number of negative tools is set to \(p=10\), and gemini-1.5-flash-lite is used to generate 500 samples with different seeds, forming ToolGrad-500.
- Design Motivation: If only correct tools are provided during training, the model cannot learn tool selection; providing the full 8k APIs is unrealistic. Similar negative samples provide a harder training environment closer to RAG retrieval-based tool selection.
Loss & Training¶
ToolGrad is a data generation framework and does not train the generator itself. After generating ToolGrad-500, the authors use supervised fine-tuning to train Gemma-3 1B, 4B, and 12B models to output Python-style tool use given OpenAI-style tool definitions. Comparative data includes ToolBench-generated data, and baseline models include the Gemini-1.5, Claude-3.5, and GPT-4o series closed-source models, as well as tool-calling models like ToolACE and Hammer. Evaluation is primarily conducted on ToolBench-I3 single-turn tool use and BFCL v1/v2 single-turn tool calling.
Key Experimental Results¶
Main Results¶
The following table compares the data generation efficiency of query-first DFS versus ToolGrad. ToolGrad is not only easier to succeed with but also generates more complex tool chains.
| Data Generation Method | Pass rate ↑ | Avg. GT Tools ↑ | LLM cost ↓ | Tool cost ↓ |
|---|---|---|---|---|
| DFS / ToolBench Style | 63.8% | 2.1 | 64.5 | 34.3 |
| ToolGrad | 99.8% | 3.4 | 63.9 | 20.0 |
This table provides the most convincing evidence: ToolGrad's LLM call cost barely increases, while the tool call cost drops from 34.3 to 20.0, the pass rate rises from 63.8% to 99.8%, and average tool chain complexity increases from 2.1 to 3.4. The authors also checked failure logs and found only 3 API execution failures resulting in empty samples out of 500 generations, a failure rate of approximately 0.2%.
Ablation Study¶
The authors further compared the absolute judge scores of small models trained on ToolGrad-500 against closed-source models on ToolBench single-turn tool use.
| Model / Data | Score | Remarks |
|---|---|---|
| ToolGrad-Gemma-3-1B | 14.1 | 1B model already exceeds Gemini-1.5-flash |
| ToolGrad-Gemma-3-4B | 17.6 | Second highest in the table |
| ToolGrad-Gemma-3-12B | 19.6 | Highest in the table |
| Gemini-1.5-flash | 6.9 | Teacher for ToolGrad data generation |
| Gemini-1.5-pro | 11.4 | Closed-source strong model baseline |
| Claude-3.5-sonnet | 15.4 | Closed-source strong model baseline |
| GPT-4o-mini | 14.7 | Closed-source strong model baseline |
Training comparisons within the same set of Gemma models also support data effectiveness: ToolGrad improves Gemma-3-1B from 1.0 to 14.1, 4B from 11.2 to 17.6, and 12B from 9.8 to 19.6. The paper also reports that ToolGrad models bring overall score improvements of +8.1, +8.0, and +6.3 to 1B, 4B, and 12B respectively on BFCL, with larger improvements in the non-live synthetic subset and gains of +1.93, +4.74, and +4.22 in the live subset.
Key Findings¶
- The answer-first process significantly reduces contamination from unsolvable queries and failed trajectories. Compared to query-first, ToolGrad is anchored by executable tool chains during generation, making samples naturally easier to verify.
- The fact that small models can exceed the teacher model is a strong signal. Although Gemini-1.5-flash participated in data generation, the trained ToolGrad-Gemma-3-12B outperforms it on ToolBench and BFCL, indicating that the data structure itself provides additional supervisory value.
- Scaling does not lead to infinite improvement. Pass rates tend to saturate around 8-12 iterations; increasing the sample size from 100 to 500/1k shows benefits, but performance decreases beyond that. The authors suggest the core reason is the lack of cross-sample memory, leading to repetitive generated tool usage patterns.
Highlights & Insights¶
- The "answer first, query second" reversal is highly intuitive from an engineering perspective. In tool-calling scenarios, executable chains are easier to verify than natural language queries. Ensuring the validity of the answer before back-inferring the question converts the hardest search problem into a more controllable generation problem.
- The adaptation of textual gradients in ToolGrad is noteworthy. Instead of having the LLM write vague suggestions, it has the LLM select an API within an execution report. This discrete action is both interpretable and directly changes the direction of data generation.
- The paper evaluates data generation efficiency alongside downstream model capability, avoiding merely proving that "generation is cheap." Crucially, small models trained on cheaper data demonstrate generalization on OOD toolsets.
Limitations & Future Work¶
- The current training format is biased towards single-turn, one-time output of complete tool calls and does not directly cover multi-step interactions like ReAct/DFS or agent frameworks with intermediate reasoning.
- The paper only verifies the effect of SFT using ToolGrad data and does not explore the value of these high pass-rate tool chain data in RL or preference optimization.
- Generated queries are still back-inferred by an LLM, which may not match real user expression habits regarding language style, ambiguity, or missing context.
- The scaling plateau is a clear bottleneck. The lack of global memory results in different samples repeatedly exploring similar API combinations. Future work could incorporate shared memory, coverage constraints, or DPP-style diversity selection to improve data scaling efficiency.
Related Work & Insights¶
- vs ToolBench / ToolLLM: ToolBench generates queries first then uses DFS to search for tool chains, which is comprehensive but expensive and prone to failure. ToolGrad generates tool chains first, sacrificing some query-first naturalness for high solvability and pass rates.
- vs TextGrad: TextGrad uses natural language feedback to optimize prompts. ToolGrad borrows the "textual gradient" concept, but the gradient is manifested as the API Selector's discrete choice based on execution reports, used to optimize data samples rather than prompts.
- vs ToolACE / Hammer: ToolACE and Hammer lean more towards training or constructing robust tool-calling models. ToolGrad focuses on the data generation mechanism and can serve as a post-training data source for these models, particularly suitable for quickly bootstrapping tool capabilities in small models.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ The answer-first reversal is simple but effective, and applying textual gradients to API selection is distinctive.
- Experimental Thoroughness: ⭐⭐⭐⭐☆ Data efficiency, ToolBench, BFCL, and scaling studies are all covered, though multi-turn agent and RL usage experiments are still missing.
- Writing Quality: ⭐⭐⭐⭐☆ The motivation is clear, and the framework diagrams and four-module descriptions are easy to understand.
- Value: ⭐⭐⭐⭐⭐ Tool-use data generation is a core bottleneck in agent training; this paper provides a low-cost, reproducible, and strong baseline.