Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems¶

Conference: ACL 2025
arXiv: 2505.12467
Code: None
Area: Other
Keywords: multi-agent collaboration, governance, interaction patterns, context management, LLM agents

TL;DR¶

This paper systematically decomposes multi-agent collaboration into four dimensions (governance mode, participation control, interaction pattern, context management). Through extensive experiments on two context-dependent tasks, it demonstrates that the combination of centralized governance + instructor-controlled participation + ordered interaction + instructor summarization is optimal, reducing token consumption by up to 93% while maintaining or even improving accuracy.

Background & Motivation¶

Background: Multi-agent LLM systems are increasingly used for complex tasks (medical diagnosis, scientific discovery, software development, etc.), but research primarily focuses on high-level architectural frameworks (such as CAMEL, MetaGPT, etc.) and role assignment.

Limitations of Prior Work: - Existing frameworks often adopt fixed interaction patterns, lacking analysis of fine-grained collaboration mechanisms such as "who speaks, when to speak, to whom, and using what context." - Most systems assume sequential pipeline operations, ignoring iterative discussions and consensus-building processes in real teams. - There is a lack of quantitative trade-off analysis of collaboration strategies on performance and computational efficiency.

Key Challenge: The actual effectiveness of multi-agent systems highly depends on the choice of interaction strategies, yet the community lacks systematic strategic analysis and guidance on optimal combinations.

Goal: To formalize the four dimensions of multi-agent collaboration and quantify the impact of each strategy through controlled experiments.

Key Insight: Instead of proposing a new framework, this work deeply analyzes the "mechanism design" inside frameworks, decomposing collaboration strategies into independently controllable dimensions.

Core Idea: The performance of multi-agent systems depends more on "how to collaborate" than "what framework to use"—centralized governance + selective participation + ordered interaction + summary management is the optimal combination.

Method¶

Overall Architecture¶

Define 4 collaboration dimensions × 2–4 strategies per dimension \(\rightarrow\) combine to form all valid configurations \(\rightarrow\) evaluate on 2 tasks \(\rightarrow\) measure the efficiency-accuracy trade-off using TAR (Token-Accuracy Ratio).

Key Designs¶

Governance (Governance Mode):
- G1 Decentralized: Agents self-organize, autonomously deciding when to participate and how to interact, ultimately reaching decisions through majority voting or consensus.
- G2 Centralized: An instructor agent coordinates the entire process—deciding who speaks, when to speak, managing context, and determining when to terminate.
- Design Motivation: This is the most fundamental dimension, determining the available strategy space for the other three dimensions.
Participation (Participation Control):
- P1 All Participate (G1): All agents speak in every round—high diversity but high redundancy.
- P2 Selective Participation (G1): Agents determine by themselves whether to speak—highly efficient but potentially missing crucial information.
- P3 Instructor-Controlled Participation (G2): The instructor decides who speaks in each round—most precise but highly dependent on the instructor's judgment.
Interaction Patterns (Interaction Patterns):
- I1 Synchronous: All agents generate responses simultaneously and broadcast them to everyone—parallel but prone to conflicts.
- I2 Ordered Turn-Take: Speak sequentially in a predefined order—subsequent speakers can view prior outputs for progressive improvement.
- I3 Random Turn-Take: Speak in a random order—avoiding ordering bias.
- I4 Selective Peer-to-Peer: Agents autonomously choose who to speak to—high relevance but fragmented context.
Context Management (Context Management):
- C1 Full Log of Previous Rounds: Retain the complete conversation history—rich context but causing a token explosion.
- C2 Self-Summarization: Each agent summarizes the previous rounds on their own—distributed compression.
- C3 Instructor Summarization: The instructor provides a unified summary—high consistency but has an information bottleneck.
Token-Accuracy Ratio (TAR): A newly proposed evaluation metric that considers both accuracy and token consumption simultaneously, where \(\text{TAR} = \text{Accuracy} / \text{Total Tokens}\), used to compare the efficiency-quality trade-off of different configurations.

Task Design¶

DEI (Distributed Evidence Integration): Patient discharge prediction on the MIMIC-III dataset. Five agents each hold different types of clinical information (history of present illness, procedures, lab results, medications, social history) and must collaborate to integrate this info and make a judgment.
SES (Structured Evidence Synthesis): Fact-checking on the AMBIFC dataset. Only a few out of multiple agents hold relevant evidence; agents with crucial evidence must convince others.

Key Experimental Results¶

DEI Task (Patient Discharge Prediction)¶

Configuration	Acc	Input Token	Output Token	Rounds
Best Single Agent (BHC)	60.8	541	109	1
G2-P3-I2-C3 (Centralized+Ordered+Instructor Summary)	67.5	~2K	~500	~3
G1-P1-I1-C1 (Decentralized+All+Synchronous+Full Log)	62.3	~28K	~3K	~5
Token Savings		~93%

SES Task (Fact-Checking)¶

Configuration	Acc	Rounds
Theoretical Upper Bound (A_consistent alone)	88.7	1
G2-P3-I2-C3	85.2	~3
G1-P1-I1-C1	78.5	~5

Key Findings¶

Centralized governance uniformly outperforms decentralized: On both tasks, G2 configurations achieved higher or comparable accuracy with significantly lower token consumption.
Ordered interaction (I2) outperforms synchronous (I1): Subsequent agents can see prior outputs, avoiding duplication and enabling progressive improvement.
Instructor summarization (C3) is the key to efficiency: Compared to retaining full logs in C1, C3 reduces tokens by up to 93% without losing accuracy.
Selective participation is crucial on SES: When most agents hold irrelevant information, letting the instructor filter speakers avoids noise.
The TAR metric reveals that the highest accuracy is not necessarily the optimal choice: After considering computational costs, configurations with moderate accuracy but low token usage might be more practical.

Highlights & Insights¶

Decomposing into four dimensions provides a solid framework for multi-agent collaboration research: It operationalizes the vague concept of "collaboration style" into controllable variables, offering the community a standardized analytical framework. This can be transferred to the design and evaluation of any multi-agent system.
The finding that "centralization is actually more efficient" has practical guiding significance: Contrary to the intuitive belief that "decentralization is more flexible," in LLM multi-agent scenarios, a capable coordinator/instructor is more effective than agent self-organization—since LLM agents have limited self-judgment and coordination capabilities.
The TAR metric fills the gap in efficiency evaluation: Pure accuracy evaluation neglects API costs, making TAR highly valuable for practical deployment scenarios.

Limitations & Future Work¶

Experiments conducted only on ChatGPT-4o: Differences in the collaboration capabilities of different models might lead to different optimal strategies.
Only two tasks were tested: DEI and SES represent two extreme paradigms, leaving intermediate hybrid scenarios uncovered.
Centralization relies on the quality of the instructor: If the instructor agent family has poor judgment, overall performance can drop sharply (the paper acknowledges this as a single-point-of-failure risk).
The space of strategy combinations is not fully exhausted: Certain combinations were excluded due to logical conflicts, but more flexible hybrid strategies may exist.

vs CAMEL (Li et al., 2023): CAMEL fixes an instructor-agent two-party dialogue paradigm, whereas this work analyzes more participants and interaction patterns.
vs MetaGPT (Qian et al., 2025): MetaGPT uses a pipeline architecture, while this work focuses on non-pipelined, discussion-style collaboration.
vs Debate (Du et al., 2024): Debate is a special case of the G1-P1-I1 configuration in this work, and they prove that this is not the optimal choice.

Rating¶

Novelty: ⭐⭐⭐⭐ The four-dimensional decomposition framework is systematic; although not an entirely new concept, it is highly formalized.
Experimental Thoroughness: ⭐⭐⭐⭐ Detailed ablation experiments across various strategy combinations, though limited to GPT-4o and two tasks.
Writing Quality: ⭐⭐⭐⭐⭐ Extremely clear structure, intuitive diagrams, and complete taxonomy.
Value: ⭐⭐⭐⭐ Provides direct guiding significance for the design of multi-agent systems.