Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems¶
Conference: ACL 2025
arXiv: 2505.12467
Code: None
Area: Other
Keywords: multi-agent collaboration, governance, interaction patterns, context management, LLM agents
TL;DR¶
This paper systematically decomposes multi-agent collaboration into four dimensions (governance mode, participation control, interaction pattern, context management). Through extensive experiments on two context-dependent tasks, it demonstrates that the combination of centralized governance + instructor-controlled participation + ordered interaction + instructor summarization is optimal, reducing token consumption by up to 93% while maintaining or even improving accuracy.
Background & Motivation¶
Background: Multi-agent LLM systems are increasingly used for complex tasks (medical diagnosis, scientific discovery, software development, etc.), but research primarily focuses on high-level architectural frameworks (such as CAMEL, MetaGPT, etc.) and role assignment.
Limitations of Prior Work: - Existing frameworks often adopt fixed interaction patterns, lacking analysis of fine-grained collaboration mechanisms such as "who speaks, when to speak, to whom, and using what context." - Most systems assume sequential pipeline operations, ignoring iterative discussions and consensus-building processes in real teams. - There is a lack of quantitative trade-off analysis of collaboration strategies on performance and computational efficiency.
Key Challenge: The actual effectiveness of multi-agent systems highly depends on the choice of interaction strategies, yet the community lacks systematic strategic analysis and guidance on optimal combinations.
Goal: To formalize the four dimensions of multi-agent collaboration and quantify the impact of each strategy through controlled experiments.
Key Insight: Instead of proposing a new framework, this work deeply analyzes the "mechanism design" inside frameworks, decomposing collaboration strategies into independently controllable dimensions.
Core Idea: The performance of multi-agent systems depends more on "how to collaborate" than "what framework to use"—centralized governance + selective participation + ordered interaction + summary management is the optimal combination.
Method¶
Overall Architecture¶
Define 4 collaboration dimensions × 2–4 strategies per dimension \(\rightarrow\) combine to form all valid configurations \(\rightarrow\) evaluate on 2 tasks \(\rightarrow\) measure the efficiency-accuracy trade-off using TAR (Token-Accuracy Ratio).
Key Designs¶
-
Governance (Governance Mode):
- G1 Decentralized: Agents self-organize, autonomously deciding when to participate and how to interact, ultimately reaching decisions through majority voting or consensus.
- G2 Centralized: An instructor agent coordinates the entire process—deciding who speaks, when to speak, managing context, and determining when to terminate.
- Design Motivation: This is the most fundamental dimension, determining the available strategy space for the other three dimensions.
-
Participation (Participation Control):
- P1 All Participate (G1): All agents speak in every round—high diversity but high redundancy.
- P2 Selective Participation (G1): Agents determine by themselves whether to speak—highly efficient but potentially missing crucial information.
- P3 Instructor-Controlled Participation (G2): The instructor decides who speaks in each round—most precise but highly dependent on the instructor's judgment.
-
Interaction Patterns (Interaction Patterns):
- I1 Synchronous: All agents generate responses simultaneously and broadcast them to everyone—parallel but prone to conflicts.
- I2 Ordered Turn-Take: Speak sequentially in a predefined order—subsequent speakers can view prior outputs for progressive improvement.
- I3 Random Turn-Take: Speak in a random order—avoiding ordering bias.
- I4 Selective Peer-to-Peer: Agents autonomously choose who to speak to—high relevance but fragmented context.
-
Context Management (Context Management):
- C1 Full Log of Previous Rounds: Retain the complete conversation history—rich context but causing a token explosion.
- C2 Self-Summarization: Each agent summarizes the previous rounds on their own—distributed compression.
- C3 Instructor Summarization: The instructor provides a unified summary—high consistency but has an information bottleneck.
-
Token-Accuracy Ratio (TAR): A newly proposed evaluation metric that considers both accuracy and token consumption simultaneously, where \(\text{TAR} = \text{Accuracy} / \text{Total Tokens}\), used to compare the efficiency-quality trade-off of different configurations.
Task Design¶
- DEI (Distributed Evidence Integration): Patient discharge prediction on the MIMIC-III dataset. Five agents each hold different types of clinical information (history of present illness, procedures, lab results, medications, social history) and must collaborate to integrate this info and make a judgment.
- SES (Structured Evidence Synthesis): Fact-checking on the AMBIFC dataset. Only a few out of multiple agents hold relevant evidence; agents with crucial evidence must convince others.
Key Experimental Results¶
DEI Task (Patient Discharge Prediction)¶
| Configuration | Acc | Input Token | Output Token | Rounds |
|---|---|---|---|---|
| Best Single Agent (BHC) | 60.8 | 541 | 109 | 1 |
| G2-P3-I2-C3 (Centralized+Ordered+Instructor Summary) | 67.5 | ~2K | ~500 | ~3 |
| G1-P1-I1-C1 (Decentralized+All+Synchronous+Full Log) | 62.3 | ~28K | ~3K | ~5 |
| Token Savings | ~93% |
SES Task (Fact-Checking)¶
| Configuration | Acc | Rounds |
|---|---|---|
| Theoretical Upper Bound (A_consistent alone) | 88.7 | 1 |
| G2-P3-I2-C3 | 85.2 | ~3 |
| G1-P1-I1-C1 | 78.5 | ~5 |
Key Findings¶
- Centralized governance uniformly outperforms decentralized: On both tasks, G2 configurations achieved higher or comparable accuracy with significantly lower token consumption.
- Ordered interaction (I2) outperforms synchronous (I1): Subsequent agents can see prior outputs, avoiding duplication and enabling progressive improvement.
- Instructor summarization (C3) is the key to efficiency: Compared to retaining full logs in C1, C3 reduces tokens by up to 93% without losing accuracy.
- Selective participation is crucial on SES: When most agents hold irrelevant information, letting the instructor filter speakers avoids noise.
- The TAR metric reveals that the highest accuracy is not necessarily the optimal choice: After considering computational costs, configurations with moderate accuracy but low token usage might be more practical.
Highlights & Insights¶
- Decomposing into four dimensions provides a solid framework for multi-agent collaboration research: It operationalizes the vague concept of "collaboration style" into controllable variables, offering the community a standardized analytical framework. This can be transferred to the design and evaluation of any multi-agent system.
- The finding that "centralization is actually more efficient" has practical guiding significance: Contrary to the intuitive belief that "decentralization is more flexible," in LLM multi-agent scenarios, a capable coordinator/instructor is more effective than agent self-organization—since LLM agents have limited self-judgment and coordination capabilities.
- The TAR metric fills the gap in efficiency evaluation: Pure accuracy evaluation neglects API costs, making TAR highly valuable for practical deployment scenarios.
Limitations & Future Work¶
- Experiments conducted only on ChatGPT-4o: Differences in the collaboration capabilities of different models might lead to different optimal strategies.
- Only two tasks were tested: DEI and SES represent two extreme paradigms, leaving intermediate hybrid scenarios uncovered.
- Centralization relies on the quality of the instructor: If the instructor agent family has poor judgment, overall performance can drop sharply (the paper acknowledges this as a single-point-of-failure risk).
- The space of strategy combinations is not fully exhausted: Certain combinations were excluded due to logical conflicts, but more flexible hybrid strategies may exist.
Related Work & Insights¶
- vs CAMEL (Li et al., 2023): CAMEL fixes an instructor-agent two-party dialogue paradigm, whereas this work analyzes more participants and interaction patterns.
- vs MetaGPT (Qian et al., 2025): MetaGPT uses a pipeline architecture, while this work focuses on non-pipelined, discussion-style collaboration.
- vs Debate (Du et al., 2024): Debate is a special case of the G1-P1-I1 configuration in this work, and they prove that this is not the optimal choice.
Rating¶
- Novelty: ⭐⭐⭐⭐ The four-dimensional decomposition framework is systematic; although not an entirely new concept, it is highly formalized.
- Experimental Thoroughness: ⭐⭐⭐⭐ Detailed ablation experiments across various strategy combinations, though limited to GPT-4o and two tasks.
- Writing Quality: ⭐⭐⭐⭐⭐ Extremely clear structure, intuitive diagrams, and complete taxonomy.
- Value: ⭐⭐⭐⭐ Provides direct guiding significance for the design of multi-agent systems.