MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents¶

Basic Information¶

Conference: ACL 2025
Code: MultiagentBench/MARBLE
Institution: University of Illinois Urbana-Champaign
Area: LLM Agent / Multi-Agent Systems
Keywords: multi-agent, benchmark, collaboration, competition, coordination protocol, emergent behavior

TL;DR¶

This paper proposes the MultiAgentBench benchmark and the MARBLE framework to systematically evaluate the performance of LLM multi-agent systems in collaborative and competitive scenarios. Covering 6 interactive environments (Research, Minecraft, Database, Coding, Bargaining, and Werewolf), the study introduces milestone-based KPI metrics and coordination scores. The evaluation reveals that GPT-4o-mini achieves the highest overall task score, graph-structured coordination protocols perform best in research scenarios, and cognitive planning improves milestone completion rates by 3%.

Background & Motivation¶

Limitations of Single-Agent Evaluation: Existing benchmarks such as AgentBench, GAIA, and ToolBench primarily evaluate the reasoning and generation capabilities of individual LLM agents, neglecting coordination and competitive dynamics in multi-agent interactions.
Rise of Multi-Agent Systems: LLM multi-agent systems have demonstrated great potential in software development (e.g., MetaGPT, ChatDev), scientific research, and gaming, but lack systematic evaluation standards.
Insufficient Evaluation Dimensions: Evaluation must cover not only task completion but also coordination quality, communication efficiency, and planning capabilities.
Goal: To construct a comprehensive multi-agent evaluation framework spanning both collaborative and competitive scenarios.

Method¶

Overall Architecture: MARBLE (Multi-agent cooRdination Backbone with LLM Engine)¶

MARBLE consists of four core modules:

1. Agent Graph Module¶

Models agent relationships as a graph \(G = (\mathcal{A}, E)\): - \(\mathcal{A} = \{a_1, a_2, \dots, a_n\}\): The set of agents. - Each edge \((a_i, r, a_j)\) represents a relationship type: collaborates, supervises, or negotiates. - Communication and coordination occur only between agents with explicit relationships.

2. Cognitive Module¶

Maintains each agent's internal state: persona, inter-agent relationships, and reasoning strategies.
Integrates Theory of Mind (ToM) and social intelligence.
Supports reasoning strategies such as CoT and ReACT.
Simulates the human process of continuously updating mental models based on social cues.

3. Coordination Engine¶

Supports four coordination protocols:

Protocol	Type	Characteristics
Star	Centralized	Single planner assigns tasks, strong supervision but limited scalability
Tree	Centralized	Hierarchical structure, top-level planner delegates to downstream planners
Graph-Mesh	Decentralized	Direct communication between agents, concurrent planning, and distributed decision-making
Chain	Decentralized	Sequential propagation of decisions, suitable for tasks with dependencies

4. Planner Design¶

Four planning strategies:

Vanilla Prompting: Direct zero-shot prompt-based plan generation.
Chain-of-Thought (CoT): Step-by-step reasoning using input tasks, agent profiles, and interaction history.
Group Discussion: Multi-agent collaborative deliberation sharing insights and constraints.
Cognitive Self-Evolving Planning: Generates expected outcomes, stores them in memory, compares them with actual performance, and iteratively refines them (similar to Reflexion).

Benchmark Task Design¶

Shared-Goal Scenarios (Collaboration)¶

Scenario	Description	Scale
Research	Multi-agent collaborative writing of research proposals	100 test cases
Minecraft	Collaborative structure construction	100 test cases
Database	5 agents diagnosing different root causes	100 test cases
Coding	Collective programming and module development	100 test cases

Adversarial-Goal Scenarios (Competition)¶

Scenario	Description
Werewolf	Two opposing groups involving deceptive strategies
Bargaining	Resource negotiation to maximize individual payoff

Evaluation Metrics¶

Task Completion Metrics¶

KPI: Milestone-based Key Performance Indicators, \(\text{KPI}_{\text{overall}} = \frac{1}{NM}\sum_{j=1}^N n_j\)
Task Score (TS): Final output quality score (evaluated via LLMs or heuristics-based rules).

Coordination Metrics¶

Communication Score (\(C_{\text{score}}\)): Evaluates communication quality using an LLM (5-point scale).
Planning Score (\(P_{\text{score}}\)): Evaluates task organization, role maintenance, and strategy adjustment (5-point scale).
Coordination Score (CS): The average of the above two scores.

Experiments¶

Experimental Setup¶

Models: Meta-Llama-3.1-8B, Meta-Llama-3.1-70B, Meta-Llama-3.3-70B, GPT-3.5-turbo, GPT-4o-mini.
Parameters: max_token_num=1024, temperature=0.7, top_p=1.0.
Iterations: 5 rounds for Research, 20 rounds for Minecraft, maximum of 5 communication rounds.
Default Protocol: Graph-Mesh.

Main Results I: Model Performance Comparison¶

Model	Research TS	Minecraft TS	Database TS	Coding TS	Bargaining TS	Werewolf TS
Llama-3.1-8B	80.87	6.12	34.00	59.90	72.81	12.64
Llama-3.1-70B	80.80	0.21	53.00	62.10	72.13	19.82
Llama-3.3-70B	80.00	9.15	28.50	56.60	73.15	36.33
GPT-3.5-turbo	70.20	5.05	45.00	55.50	71.67	15.69
GPT-4o-mini	84.13	33.60	45.00	65.10	74.47	14.06

Key Findings: 1. GPT-4o-mini Achieves the Highest Task Score: It leads in Research, Minecraft, Coding, and Bargaining. 2. Coordination Score \(\neq\) Task Score: Llama-3.1-70B obtains a high coordination score of 75.00 in Minecraft, but its task score is only 0.21. 3. Capabilities Vary by Task: Llama-3.3-70B achieves the highest TS in Werewolf (36.33), but does not perform prominently in other tasks.

Main Results II: Coordination Protocol Comparison¶

Graph implementation performs best in the research scenario (highest task score and planning efficiency).
Tree implementation performs the worst, suffering from high token consumption and the lowest task and coordination scores.
Star and Graph exhibit similar task scores.

Planning Strategy Comparison¶

Cognitive Evolving Planning achieves the best coordination score, with a task score comparable to CoT.
Group Discussion unexpectedly performs the worst—excessively large planning groups tend to hinder efficiency (similar to large organizational issues in reality).

Ablation Study¶

Number of Iterations: - Task and coordination scores increase between 1 and 7 rounds, plummet at 10 rounds, and the task score recovers at 20 rounds while the coordination score remains unchanged. - Excessive iterations can lead to coordination degradation (due to communication overhead or instruction conflicts).

Number of Agents: - KPI decreases as the number of agents increases (owing to more complex coordination). - Coordination scores improve significantly from 1 to 3 agents, plateauing thereafter. - The growth of the task score is more gradual.

Emergent Behavior Analysis¶

Three key emergent patterns: 1. Strategic Information Sharing: Agents selectively disclose key information (e.g., the Seer concealing inspection results in Werewolf). 2. Trust Polarization Collaboration: Role identities drive cooperative division; over-suspicious Villagers may end up attacking their own allies. 3. Role-Driven Strategy Iteration: Characters (e.g., Seer, Witch) adaptively adjust their strategies during gameplay.

Highlights & Insights¶

First Systematic Multi-Agent Evaluation Framework: Covers collaboration and competition across 6 scenarios and multiple coordination protocols, filling an important evaluation gap.
Innovative Milestone-Based KPI: Moves beyond binary task success/failure, tracking intermediate progress and individual contributions.
Counter-Intuitive Findings: Group Discussion performs the worst, high coordination scores do not equate to high task scores, and more agents are not always beneficial.
Discovery of Emergent Behaviors: LLM agents exhibit human-like social behavior patterns under information asymmetry and role-based conflicts.
Model Capabilities Remain Critical: Improvements in coordination cannot compensate for deficiencies in underlying foundational capabilities.

Limitations & Future Work¶

Limited Scenario Coverage: Does not cover highly complex environments such as open worlds or task-oriented dialogues.
Incomplete Model Coverage: Excludes recent models like DeepSeek.
Insufficiently Deep Ablation: Does not adequately study memory mechanisms (long-term, short-term, or shared memory) or different workflow methodologies.
Simplistic Competitive Mechanics: Fails to cover complex dynamics like multi-party negotiation, repeated games, or stochastic factors.
Evaluation Dependency on LLM: KPI milestone detection and coordination scoring rely heavily on LLM evaluation, which may introduce bias.

Multi-Agent Systems: MetaGPT (Hong et al., 2024), AgentVerse (Chen et al., 2023), ChatDev (Li et al., 2023)
Multi-Agent Collaboration: Cognitive expansion (Zhuge et al.), collective expansion (Qian et al., 2024)
Agents in Games: GameNGen (Valevski et al., 2024), CUISINEWORLD (Gong et al., 2023), Voyager (Wang et al., 2023)
Single-Agent Benchmarks: AgentBench (Liu et al., 2023), GAIA (Mialon et al., 2023)

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: The first systematic evaluation benchmark for multi-agent systems, featuring a comprehensive design (+1)
Experimental Thoroughness: 6 scenarios \(\times\) 5 models \(\times\) 4 coordination protocols \(\times\) 4 planning strategies (+0.5)
Depth of Insight: Emergent behavior analysis and counter-intuitive findings add to the academic contribution (+0.5)
Value: The open-source framework is highly beneficial for the community (+0.5)
Deductions: Reliability of LLM-based evaluation is questionable, model coverage is not broad enough, and some task designs are overly simplistic (-1)