FuncBenchGen: A Contamination-Free Controllable Evaluation Framework for Reliable Benchmarking¶

Conference: ICLR 2026 arXiv: 2509.26553 Area: Video Understanding Keywords: Tool-augmented LLM, multi-step function calling, benchmark, data contamination, DAG traversal

TL;DR¶

This paper proposes FuncBenchGen, a framework that models multi-step function calling as a DAG traversal problem, enabling contamination-free and finely controllable evaluation of LLM tool-use capabilities. The framework further reveals critical failure modes of reasoning models under long call chains and connected irrelevant functions.

Background & Motivation¶

Existing benchmarks for tool-augmented language models (TaLMs) suffer from two core issues:

Data contamination risk: QA pairs in existing benchmarks (e.g., API-Bank, BFCLv4, ToolBench) may be leaked through pretraining data or test-time web search, rendering evaluation results unreliable.

Uncontrollable task complexity: Existing benchmarks lack fine-grained control over task difficulty, making it impossible to systematically analyze which factors most significantly affect model performance.

Benchmark	Contamination-Free	Function Set Size Control	Dependency Depth Control	Distractor Type Control
API-Bank	✗	✗	✗	✗
BFCLv4	✗	✓	✗	✗
ToolBench	✗	✓	✗	✗
FuncBenchGen	✓	✓	✓	✓

Method¶

Overall Architecture¶

FuncBenchGen formalizes multi-step function calling as a Directed Acyclic Graph (DAG) traversal problem. Given a function set \(\mathcal{F}=\{f_1, f_2, \ldots, f_n\}\), an input variable set \(\mathcal{V}_{input}\), and a target variable \(v_T\), the LLM must determine the value of \(v_T\) by iteratively executing a sequence of function calls.

Key Designs¶

1. Graph structure generation: Accepts four control parameters: - \(n^{\text{core}}\): number of core nodes (functions required to solve the task) - \(d\): dependency depth - \(n^{\text{conn}}\): number of connected irrelevant nodes (CIN, sharing type-compatible variables with core nodes) - \(n^{\text{dis}}\): number of disconnected irrelevant nodes (DIN, with no connections to core nodes)

2. Function schema creation: Each DAG node is converted into a function definition comprising a randomly generated function name, typed input/output parameters, and a natural language description. Functions are linked via semantic type and subtype matching.

3. Deterministic execution: Each variable is assigned a three-digit random integer value. A function returns the correct output only when all input values are exactly correct; otherwise it returns a random incorrect value, simulating the silent failure behavior of real-world APIs.

Mitigation Strategy¶

To address the most prevalent failure mode (use of unknown/incorrect values), the paper proposes a simple variable value restatement strategy: upon each function return, the response includes not only the output value but also a list of all currently known variable values.

Key Experimental Results¶

Main Results: Success Rate Under Varying Core Node Counts¶

Model	5 Core Nodes	10 Core Nodes	20 Core Nodes
GPT-5	72.5%	38.2%	15.0%
Gemini-2.5-Pro	46.5%	14.4%	6.0%
GPT-5-mini	16.0%	7.6%	4.2%
Qwen3	11.0%	8.2%	3.8%
GPT-4.1	12.0%	2.2%	0.2%

Failure Type Analysis¶

Failure Type	GPT-5	Gemini-2.5-Pro	Qwen3	GPT-4.1
Non-existent function	0.0%	2.4%	0.0%	0.0%
Wrong number of input arguments	0.0%	0.2%	0.1%	0.0%
Use of unknown values	79.6%	69.1%	74.0%	73.2%
Use of incorrect values	20.4%	28.3%	25.8%	26.8%

Effect of Dependency Depth¶

GPT-5 achieves close to 90% success rate at depth 1 (star structure), dropping to below 30% at depths 4–8.
Path structures (depth 8–9) show marginal improvement over moderately branched structures (depth 5–7), suggesting that serialized call chains with fewer branches are easier to handle.
Larger thinking budgets (medium vs. minimal) substantially improve performance in complex scenarios.

Key Findings¶

Reasoning models substantially outperform general-purpose models: GPT-5 achieves 72.5% at 5 core nodes, while GPT-4.1 reaches only 12.0%.
Performance degrades sharply with sequence length: GPT-5 drops from 72.5% (5 nodes) to 15.0% (20 nodes).
Connected irrelevant nodes (CIN) are the most harmful: Shared type-compatible variables make it difficult for models to distinguish relevant from irrelevant functions.
The mitigation strategy is highly effective: Variable restatement improves GPT-5's success rate from 62.5% to 81.3%.
GPT-5 exhibits low call efficiency: Even when successful, it makes approximately 10% more redundant function calls.
Sufficient reasoning budget is critical: Under minimal thinking budget, GPT-5's success rate falls below 20% in the presence of distractor functions.

Highlights & Insights¶

Elegant formalization: Abstracting tool use as DAG traversal enables orthogonal decomposition of evaluation dimensions.
Insightful failure analysis: The study reveals that the primary bottleneck across all models is state tracking rather than syntactic comprehension — 79.6% of GPT-5 errors stem from the use of unknown variable values.
Simple yet effective mitigation: Merely restating known variable values (without providing new information) substantially improves performance, indicating that working memory is the core bottleneck in multi-step tool use.
Warning for the MCP ecosystem: Even disconnected distractor functions severely degrade GPT-5 performance (<10%) when the function set grows to 40, suggesting current LLMs are not yet ready to handle large-scale MCP servers.
Failure mode differences reveal model characteristics: When failing, GPT-5 tends to retry repeatedly (making more function calls), whereas Gemini-2.5-Flash tends to give up (making fewer calls).

Limitations & Future Work¶

A gap exists between synthetic functions and real-world APIs, where function semantics are considerably more complex.
Only DAG structures are considered; more complex control flows such as conditional logic and loops are not covered.
Each function is fixed to a single output variable; multi-output functions are not supported.
The capabilities of open-source small models on this task are not evaluated.
Functions are connected via type matching, leaving the evaluation of natural language semantic reasoning underexplored.
Model recovery and retry behavior following failed calls is not examined.

Rating ⭐⭐⭐⭐¶

This is a systematic and analytically rigorous evaluation framework. Its core contribution lies in revealing the state-tracking bottleneck in LLM multi-step tool use, offering important guidance for the design of agent systems. The DAG-based abstraction is elegant, and the mitigation strategy, though simple, yields deep insight. Limitations include a remaining gap between synthetic tasks and real-world scenarios, and a mismatch between the stated area of video understanding and the paper's actual focus on LLM agent evaluation.