DICE-Bench: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues¶

Conference: ACL 2025
arXiv: 2506.22853
Code: snuhcc/DICE-Bench
Area: LLM / NLP
Keywords: function-calling, benchmark, multi-party dialogue, multi-round, tool-use evaluation

TL;DR¶

This work proposes DICE-Bench, a function-calling benchmark targeting multi-round, multi-party dialogue scenarios. It comprises 1,607 high-quality dialogue instances and the DICE-Score metric, which quantifies information dispersion, revealing the limitations of current LLMs in tool invocation during complex dialogues.

Background & Motivation¶

Limitations of Prior Work: Existing function-calling benchmarks (such as APIBench, ToolLLM, etc.) mostly focus on single-turn interaction scenarios where all API parameters reside in a single user instruction, neglecting the complexity of real-world group chats where information is scattered across multi-turn, multi-party dialogues.
Real-world Needs: In practical applications, virtual assistants need to track multi-turn, multi-party conversations in group chat scenarios, aggregating information from dispersed contexts to complete API calls (e.g., booking hotels and flights based on group chat discussions).
Evaluation Gap: There is a lack of a metric to quantitatively measure the degree of dispersion of tool-related information in a dialogue, making it difficult to systematically evaluate the function-calling capabilities of LLMs under realistic conditions.
Ours: This work constructs the DICE-Bench benchmark and propose the DICE-Score metric, generating multi-turn, multi-party dialogue data via multi-agent simulation to systematically evaluate the tool-calling capabilities of 19 LLMs.

Method¶

Overall Architecture¶

The data construction of DICE-Bench consists of three stages: (1) Tool Graph Construction: Collecting 124 tool nodes and 270 directed edges from TaskBench and ToolEyes to model inter-tool dependencies; (2) Scenario Configuration: Sampling tool chains via DFS and configuring dialogue types (persuasive negotiation, advisory information seeking, or eristic debate), the number of participants (2–4), and independent personas; (3) Dialogue Generation: Simulating dialogues using a multi-agent system, with an orchestrator controlling the turn-taking order to iteratively generate \(N\)-turn dialogues.

Key Designs¶

DICE-Score Metric: Quantitatively measures the dispersion of tool-related information in a dialogue. The formula is \(\text{DICE}(S,T) = \frac{\min(|S_{\neq 0}|, T) \cdot \sqrt{|S| \cdot T}}{\sum_{i \in S} \ln(1 + \alpha \times S_i)}\), where \(S\) is the count vector of tool-related information mentioned in each turn, \(T\) is the total number of distinct tool items to be identified, and \(\alpha = e^2\) controls the repetition penalty. A higher score indicates more dispersed information and a more difficult task.
Three-Stage Validation Pipeline: Stage 1 uses G-Eval (GPT-4o) to automatically filter low-quality dialogues based on six-dimensional criteria; Stage 2 applies rule-based filtering (such as refusal response detection); Stage 3 involves human annotators scoring across three dimensions—dialogue quality, functional integration, and real-world applicability—comprising 15 sub-criteria to eliminate low-scoring instances.
Tool Graph Dependency Modeling: Directed edges between tools explicitly encode the cross-turn dependency of "using the previous turn's tool output as the next turn's parameter," ensuring the authenticity of multi-turn scenarios.

Loss & Training¶

Exact Match (EM) is used as the primary evaluation metric, which requires the LLM to accurately predict both the function name and all parameter values simultaneously.

Experiments¶

Main Results¶

Model	Round 1	Round 2	Round 3	Round 4	Average
GPT-4o	74.12	61.00	61.65	59.18	63.99
Gemini 2 Flash	74.47	59.45	59.40	58.73	63.01
Phi4-15B	71.29	57.06	58.02	56.44	60.70
GLM4-9B-Chat	58.24	47.55	47.24	46.03	49.76
Qwen2.5-32B	67.76	56.76	57.23	55.92	59.42
ToolAce-8B	2.47	0.66	0.33	0.51	0.99

Ablation Study¶

Analysis Dimension	Key Findings
DICE-Score vs. Performance	Pearson correlation coefficient \(r \approx -0.984\); higher DICE-Score correlates with worse performance
Alignment with Human Evaluation	Human accuracy decreases from 80.5% in Round 1 to 49.3% in Round 4, showing a strong negative correlation with DICE-Score (increasing from 1.42 to 5.36)
Impact of Dialogue Types	Eristic dialogues yield significantly lower EM due to frequent stance switching
Tool-Specific Models	Specialized models such as ToolAce-8B and CALM-8B underperform general-purpose dialogue models by a wide margin

Key Findings¶

The performance of all models drops significantly as the number of turns increases. Average performance in Round 4 decreases by approximately 15 percentage points compared to Round 1, indicating that multi-turn information aggregation is a primary bottleneck for current LLMs.
The open-source 15B-parameter Phi4 performs comparably to the closed-source GPT-4o (average score of 60.7 vs 64.0); the 128K context window of Qwen 2.5 benefits long-dialogue scenarios.
Models specifically fine-tuned for single-turn function calling perform poorly in multi-party dialogue scenarios (ToolAce-8B scoring only around 1%), suggesting that single-turn training data does not transfer well to multi-turn, multi-party contexts.

Highlights & Insights¶

The first function-calling benchmark covering both multi-turn and multi-party dialogues, filling a critical gap in existing evaluations.
The proposed DICE-Score metric is highly negatively correlated with human performance (\(r \approx -0.984\)), demonstrating solid explainability and effectiveness.
Rigorous three-stage filtering (automated + rule-based + human), selecting 1,607 high-quality instances from 1,800 candidates.

Limitations & Future Work¶

The dialogue length in Round 4 may exceed the 4K-token limit of some models, limiting the ability to evaluate all systems.
Some models, despite producing semantically correct content, are penalized due to output formats failing to strictly conform to JSON specifications.
The multi-agent orchestrator (GPT-4o) exhibits limited capabilities in dynamically assigning turn-taking orders, often defaulting to repetitive, structured turns.
The benchmark only covers daily-life scenarios, lacking professional domain-specific tools in fields like law, finance, and medicine.

Function-Calling Benchmarks: APIBench, ToolAlpaca, ToolLLM, API-Bank, MetaTool, TaskBench, etc., all focus on single-turn or single-user scenarios.
Interactive Dialogue Systems: Research on multi-turn, multi-party interactions for LLM-integrated virtual assistants is still in its infancy.
Dialogue Type Theory: Reconstruct scenario diversity based on Walton & Krabbe's seven-type dialogue classification framework.

Rating¶

Dimension	Score (1-5)
Novelty	4
Practicality	4
Experimental Thoroughness	4
Writing Quality	4
Overall Rating	4.0