Voting or Consensus? Decision-Making in Multi-Agent Debate¶

Conference: ACL 2025
arXiv: 2502.19130
Code: GitHub
Area: Others
Keywords: Multi-Agent Debate, Decision Protocols, Voting, Consensus, Answer Diversity, AAD, CI

TL;DR¶

This work systematically compares 7 decision protocols (voting vs. consensus) in multi-agent debate (MAD). It is found that consensus protocols improve performance by 2.8% on knowledge tasks, while voting protocols improve performance by 13.2% on reasoning tasks. Two new methods, AAD and CI, are proposed to enhance answer diversity, yielding performance gains of 3.3% and 7.4%, respectively.

Background & Motivation¶

Core Problem: The success of Multi-Agent Debate (MAD) highly depends on parameter choices. Among these, the decision protocol—how multiple agents converge to a final answer from discussion—has a huge impact on outcomes, yet prior studies treat it as a fixed variable rather than a key factor to be optimized.
Limitations of Prior Work:
- Lack of Systematic Comparison: Exchange-of-Thought (Yin et al. 2023) only utilizes consensus methods, Yang et al. 2024 focuses solely on voting protocols, and ReConcile (Chen et al. 2023) mixes both without analyzing the individual contribution of each protocol. Consequently, the fundamental question of "which decision protocol is optimal for a specific task type" remains unanswered.
- Parameter Confounding: Prior works simultaneously vary multiple parameters in experiments (decision protocol + discussion rounds + agent count + response generator), failing to isolate the direct impact of the decision protocol itself on performance, which leads to poor experimental comparability.
- Unquantified Task Adaptability: Intuitively, knowledge tasks and reasoning tasks may require different decision strategies, but this hypothesis has not been quantitatively verified; current methods indiscriminately apply the same protocol across all task types.
Goal: Through rigorous single-variable controlled experiments—varying only the decision protocol—this work systematically evaluates the performance differences of 4 voting protocols and 3 consensus protocols across 3 knowledge tasks and 3 reasoning tasks, and proposes new methods to promote answer diversity.

Method¶

Overall Architecture¶

A multi-agent debate system is constructed based on Llama 3 (8B / 70B): three expert personas are automatically generated, and a final answer is reached via a designated decision protocol after multiple rounds of discussion. The overall framework consists of three core components: the discussion paradigm (defining the inter-agent communication structure and round rules), the decision protocol (defining when to terminate the discussion and how to select the final solution), and the response generator (defining the agent reply style, such as neutral, critical, or reasoning-only). Each agent generates one response per round, keeping only the messages from the last two rounds to control the context length.

Key Designs¶

Unified Evaluation Framework for 7 Decision Protocols: Three consensus protocols (majority consensus >50%, supermajority consensus >66%, and unanimity 100%) and four voting protocols (plurality voting—one vote per agent, Borda count/ranked-choice voting—weighted by ranking, approval voting—multiple votes allowed per agent, and cumulative voting—distributing 25 points) are implemented. Consensus protocols require agents to gradually converge during discussion until reaching the protocol threshold, whereas voting protocols have all agents vote from candidate solutions after 3 rounds of discussion to select the final answer. The key distinction is that consensus is a "negotiation-convergence" process, while voting is an "exploration-selection" process.
All-Agents Drafting (AAD): To address the issue in default settings where subsequent agents are biased by the first agent's answer, AAD forces all agents to independently draft their initial solutions in the first round without seeing other agents' outputs. Normal discussion resumes from the second round onwards. This ensures diversity in the initial answer pool and avoids groupthink. AAD is compatible with all 7 decision protocols.
Collective Improvement (CI): Based on the independent drafting of AAD, CI further restricts communication by eliminating direct message exchanges between agents. At the end of each round, agents can only see the set of solutions from the previous round (instead of the discussion history) and must independently improve existing solutions or propose new ones. Decoupled from consensus building (which is why it is designed specifically for voting protocols), CI maintains answer diversity by suppressing excessive interaction, keeping the voting pool rich throughout the entire discussion process.

Experiments¶

Benchmarks¶

Dataset	Task Type	Details	Sample Size
MMLU	Knowledge	Multiple-choice test covering a broad range of subjects	Subset sampling
MMLU-Pro	Knowledge	Harder, expert-focused multiple-choice questions	Subset sampling
GPQA	Knowledge	Graduate-level, Google-proof Q&A	Subset sampling
SQuAD 2.0	Reasoning	Reading comprehension (containing unanswerable questions)	Subset sampling
StrategyQA	Reasoning	Multi-step reasoning yes/no questions	Subset sampling
MuSR	Reasoning	Multi-step reasoning over long narratives (e.g., murder mysteries)	Subset sampling

Main Results: Comparison of Decision Protocols (Llama 3 8B, Mean ± Std over 3 Runs)¶

Decision Protocol Category	MMLU	MMLU-Pro	GPQA	SQuAD 2.0	StrategyQA	MuSR
Voting Mean	Lower	Lower	Lower	+13.1%	+0.2%	+26.4%
Consensus Mean	+2.3%	+4.9%	+1.3%	Lower	Lower	Lower
CoT Baseline	Below MAD	Below MAD	Below MAD	Below MAD	Below MAD	Below MAD

Key Findings: Consensus protocols consistently outperform voting on all 3 knowledge tasks (average +2.8%), while voting protocols significantly outperform consensus on all 3 reasoning tasks (average +13.2%), and both outperform the single-agent CoT baseline. Consensus takes an average of 1.42 rounds to reach a decision, whereas voting requires 3.38 rounds. Approval voting fails to reach a decision in 59% of cases due to excessive agent sycophancy.

Scaling Analysis (StrategyQA, Plurality Voting Protocol)¶

Scaling Dimension	Range of Variation	Trend	Interpretation
Increase Agent Count	1 → 10	Accuracy increases linearly ↑	Similar to self-consistency multi-sampling, larger knowledge base
Increase Discussion Rounds	1 → 10	Accuracy decreases linearly ↓	Problem drift leading to deviation from the original task
Challenge Round (providing discussion history)	Extra +1 round	Challenge rate decreases by 10%, no positive effect	Agents tend to agree with existing discussions, failing to perform self-correction

Answer Diversity Experiments (StrategyQA)¶

Method	Answer Cosine Similarity	Average Accuracy	vs Baseline
Baseline	0.888	58.3%	—
AAD	0.870	62.8%	+3.3%
CI	0.845	65.7%	+7.4%
Critical Response	0.843	59.4%	+1.1%
Reasoning Response	0.916	51.9%	-6.4%

Answer diversity (lower cosine similarity) positively correlates with task accuracy. CI achieves the highest accuracy (65.7%) when similarity is at its lowest (0.845). However, directly altering diversity via prompting styles (critical/reasoning-only) yields unstable and occasionally detrimental results.

Key Findings¶

Task Type Dictates the Optimal Decision Protocol: Knowledge tasks benefit from consensus (multi-agent cross-verification reduces factual errors), whereas reasoning tasks benefit from voting (allowing parallel exploration of multiple paths before selecting the best).
Scaling Agents Outperforms Scaling Rounds: Increasing the number of agents functions like self-consistency multi-sampling, bringing linear performance gains. Conversely, increasing discussion rounds degrades performance due to problem drift, challenging the intuition that "more discussion equals better results."
Structured Communication Beats Prompt Engineering: AAD/CI enhances diversity reliably by modifying the communication structure rather than shifting prompt tones. Critical or reasoning-constrained prompts may instead degrade discussion quality.
Consensus is More Decision-Efficient: Consensus protocols require only 1.42 rounds on average (compared to 3.38 rounds for voting), achieving higher performance on knowledge tasks with lower computational costs.
Approval Voting Fails in LLM Agents: The sycophantic nature of LLM agents causes approval voting to fail in reaching a decision 59% of the time, revealing the limitations of directly transferring human decision protocols to LLM systems.

Rating¶

Dimension	Score (1-10)	Description
Novelty	6	First systematic comparison of voting vs. consensus and proposes AAD/CI, though core ideas (independent sampling, restricted communication) are not entirely new.
Experimental Thoroughness	8	6 datasets × 7 protocols × multiple ablations, rigorous control of variables, 3 experimental runs.
Value	8	Provides clear guidelines on protocol selection and reproducible practical recommendations, with open-source code and data.
Writing Quality	8	Well-structured with rich tables and figures; clear correspondence between experimental designs and findings.

Highlights & Insights¶

First systematic comparison of 7 decision protocols across both knowledge and reasoning tasks, establishing a clear task-protocol selection matrix.
Rigorous single-variable controlled experimental design: changing only the decision protocol at a time, eliminating parameter confounding.
The AAD and CI methods are simple and elegant—requiring no model fine-tuning or prompt content modifications, achieving significant gains solely by adjusting communication structures.
Findings reveal that agent scaling (increasing quantity) is more effective than discussion scaling (increasing rounds), providing clear guidance for resource allocation in multi-agent systems.
Quantitatively reveals the positive correlation between answer diversity and task performance (cosine similarity vs. accuracy).
Unveils the extreme manifestation of agent sycophancy in approval voting (59% failure to decide), sounding an alarm for protocol design.

Limitations & Future Work¶

Experiments are limited to Llama 3 (8B / 70B), without validating generalization to proprietary models such as GPT-4 or Claude.
MAD incurs high computational overhead (approx. 5× for consensus and 10× for voting compared to the CoT baseline); the ROI between performance gains and resource consumption needs careful evaluation.
Due to compute limits, dataset subsets are sampled (95% confidence level); although 3 runs are evaluated, some statistical volatility may remain.
The focus is restricted to the decision protocol dimension, without exploring the interaction effects between persona design, prompt engineering, and decision protocols.
Agent sycophancy remains a fundamental limitation; AAD and CI merely mitigate rather than cure the issue.
The study does not consider the decision dynamics of heterogeneous agents (teams composed of different models), whereas heterogeneous teams might be more common in practical deployments.

Multi-Agent Debate: Du et al. 2023 (Improving Factuality & Reasoning), Exchange-of-Thought (Yin et al. 2023, consensus method), ReConcile (Chen et al. 2023, hybrid voting+consensus), Liang et al. 2024 (encouraging divergent thinking).
LLM Agent Enhancement: Self-consistency (Wang et al. 2023, multi-path sampling voting), CoT reasoning (Wei et al. 2022), persona-based prompting (Jiang et al. 2024), Self-Refine (Madaan et al. 2023).
Decision Theory & Voting Mechanisms: Social choice theory (List 2022), consensus vs. voting (Jones 1994), Yang et al. 2024 (multi-voting protocol comparison of LLMs).
MALLM Framework: Becker et al. 2025 proposes a multi-agent LLM collaborative framework.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic comparison using controlled variables; solid methodological contribution.
Value: ⭐⭐⭐⭐⭐ — Provides clear task-protocol selection guidelines; high practical value.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple experimental runs with reported standard deviations, though evaluations are on dataset subsets.
Overall: ⭐⭐⭐⭐

Voting or Consensus? Decision-Making in Multi-Agent Debate¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Experiments¶

Benchmarks¶

Main Results: Comparison of Decision Protocols (Llama 3 8B, Mean ± Std over 3 Runs)¶

Scaling Analysis (StrategyQA, Plurality Voting Protocol)¶

Answer Diversity Experiments (StrategyQA)¶

Key Findings¶

Rating¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work¶

Rating¶

Related Papers¶