FuncBenchGen: A Contamination-Free Controllable Evaluation Framework for Reliable Benchmarking¶
Conference: ICLR 2026 arXiv: 2509.26553 Area: Video Understanding Keywords: Tool-augmented LLM, multi-step function calling, benchmark, data contamination, DAG traversal
TL;DR¶
This paper proposes FuncBenchGen, a framework that models multi-step function calling as a DAG traversal problem, enabling contamination-free and finely controllable evaluation of LLM tool-use capabilities. The framework further reveals critical failure modes of reasoning models under long call chains and connected irrelevant functions.
Background & Motivation¶
Existing benchmarks for tool-augmented language models (TaLMs) suffer from two core issues:
Data contamination risk: QA pairs in existing benchmarks (e.g., API-Bank, BFCLv4, ToolBench) may be leaked through pretraining data or test-time web search, rendering evaluation results unreliable.
Uncontrollable task complexity: Existing benchmarks lack fine-grained control over task difficulty, making it impossible to systematically analyze which factors most significantly affect model performance.
| Benchmark | Contamination-Free | Function Set Size Control | Dependency Depth Control | Distractor Type Control |
|---|---|---|---|---|
| API-Bank | ✗ | ✗ | ✗ | ✗ |
| BFCLv4 | ✗ | ✓ | ✗ | ✗ |
| ToolBench | ✗ | ✓ | ✗ | ✗ |
| FuncBenchGen | ✓ | ✓ | ✓ | ✓ |
Method¶
Overall Architecture¶
FuncBenchGen formalizes multi-step function calling as a Directed Acyclic Graph (DAG) traversal problem. Given a function set \(\mathcal{F}=\{f_1, f_2, \ldots, f_n\}\), an input variable set \(\mathcal{V}_{input}\), and a target variable \(v_T\), the LLM must determine the value of \(v_T\) by iteratively executing a sequence of function calls.
Key Designs¶
1. Graph structure generation: Accepts four control parameters: - \(n^{\text{core}}\): number of core nodes (functions required to solve the task) - \(d\): dependency depth - \(n^{\text{conn}}\): number of connected irrelevant nodes (CIN, sharing type-compatible variables with core nodes) - \(n^{\text{dis}}\): number of disconnected irrelevant nodes (DIN, with no connections to core nodes)
2. Function schema creation: Each DAG node is converted into a function definition comprising a randomly generated function name, typed input/output parameters, and a natural language description. Functions are linked via semantic type and subtype matching.
3. Deterministic execution: Each variable is assigned a three-digit random integer value. A function returns the correct output only when all input values are exactly correct; otherwise it returns a random incorrect value, simulating the silent failure behavior of real-world APIs.
Mitigation Strategy¶
To address the most prevalent failure mode (use of unknown/incorrect values), the paper proposes a simple variable value restatement strategy: upon each function return, the response includes not only the output value but also a list of all currently known variable values.
Key Experimental Results¶
Main Results: Success Rate Under Varying Core Node Counts¶
| Model | 5 Core Nodes | 10 Core Nodes | 20 Core Nodes |
|---|---|---|---|
| GPT-5 | 72.5% | 38.2% | 15.0% |
| Gemini-2.5-Pro | 46.5% | 14.4% | 6.0% |
| GPT-5-mini | 16.0% | 7.6% | 4.2% |
| Qwen3 | 11.0% | 8.2% | 3.8% |
| GPT-4.1 | 12.0% | 2.2% | 0.2% |
Failure Type Analysis¶
| Failure Type | GPT-5 | Gemini-2.5-Pro | Qwen3 | GPT-4.1 |
|---|---|---|---|---|
| Non-existent function | 0.0% | 2.4% | 0.0% | 0.0% |
| Wrong number of input arguments | 0.0% | 0.2% | 0.1% | 0.0% |
| Use of unknown values | 79.6% | 69.1% | 74.0% | 73.2% |
| Use of incorrect values | 20.4% | 28.3% | 25.8% | 26.8% |
Effect of Dependency Depth¶
- GPT-5 achieves close to 90% success rate at depth 1 (star structure), dropping to below 30% at depths 4–8.
- Path structures (depth 8–9) show marginal improvement over moderately branched structures (depth 5–7), suggesting that serialized call chains with fewer branches are easier to handle.
- Larger thinking budgets (medium vs. minimal) substantially improve performance in complex scenarios.
Key Findings¶
- Reasoning models substantially outperform general-purpose models: GPT-5 achieves 72.5% at 5 core nodes, while GPT-4.1 reaches only 12.0%.
- Performance degrades sharply with sequence length: GPT-5 drops from 72.5% (5 nodes) to 15.0% (20 nodes).
- Connected irrelevant nodes (CIN) are the most harmful: Shared type-compatible variables make it difficult for models to distinguish relevant from irrelevant functions.
- The mitigation strategy is highly effective: Variable restatement improves GPT-5's success rate from 62.5% to 81.3%.
- GPT-5 exhibits low call efficiency: Even when successful, it makes approximately 10% more redundant function calls.
- Sufficient reasoning budget is critical: Under minimal thinking budget, GPT-5's success rate falls below 20% in the presence of distractor functions.
Highlights & Insights¶
- Elegant formalization: Abstracting tool use as DAG traversal enables orthogonal decomposition of evaluation dimensions.
- Insightful failure analysis: The study reveals that the primary bottleneck across all models is state tracking rather than syntactic comprehension — 79.6% of GPT-5 errors stem from the use of unknown variable values.
- Simple yet effective mitigation: Merely restating known variable values (without providing new information) substantially improves performance, indicating that working memory is the core bottleneck in multi-step tool use.
- Warning for the MCP ecosystem: Even disconnected distractor functions severely degrade GPT-5 performance (<10%) when the function set grows to 40, suggesting current LLMs are not yet ready to handle large-scale MCP servers.
- Failure mode differences reveal model characteristics: When failing, GPT-5 tends to retry repeatedly (making more function calls), whereas Gemini-2.5-Flash tends to give up (making fewer calls).
Limitations & Future Work¶
- A gap exists between synthetic functions and real-world APIs, where function semantics are considerably more complex.
- Only DAG structures are considered; more complex control flows such as conditional logic and loops are not covered.
- Each function is fixed to a single output variable; multi-output functions are not supported.
- The capabilities of open-source small models on this task are not evaluated.
- Functions are connected via type matching, leaving the evaluation of natural language semantic reasoning underexplored.
- Model recovery and retry behavior following failed calls is not examined.
Rating ⭐⭐⭐⭐¶
This is a systematic and analytically rigorous evaluation framework. Its core contribution lies in revealing the state-tracking bottleneck in LLM multi-step tool use, offering important guidance for the design of agent systems. The DAG-based abstraction is elegant, and the mitigation strategy, though simple, yields deep insight. Limitations include a remaining gap between synthetic tasks and real-world scenarios, and a mismatch between the stated area of video understanding and the paper's actual focus on LLM agent evaluation.