Ranked Voting based Self-Consistency of Large Language Models¶

Conference: ACL 2025
arXiv: 2505.10772
Code: https://github.com/szu-tera/RankedVotingSC
Area: LLM Reasoning
Keywords: Self-Consistency, Ranked Voting, Borda Count, Instant Runoff Voting, MRR

TL;DR¶

Upgrades the majority voting in Self-Consistency to ranked voting, allowing the LLM to generate a preference ranking of multiple candidate answers for each reasoning path instead of a single answer. It uses three ranked voting methods (IRV/BCV/MRRV) to aggregate ranking information across multiple reasoning paths, consistently performing better than traditional SC on six datasets with a maximum improvement of 12.46%.

Background & Motivation¶

Background: Self-Consistency (SC) allows LLMs to perform Chain-of-Thought (CoT) reasoning multiple times and select the final answer via majority voting, representing a mainstream method for improving reasoning accuracy.

Limitations of Prior Work: Traditional SC only selects one answer per reasoning path for voting, discarding the model's preference information over other candidate answers, which wastes "second-choice" and "third-choice" information.

Key Challenge: Majority voting only utilizes top-1 information, ignoring the correct answer if it frequently appears in top-2/3 but not top-1.

Goal: Fully utilize the preference ranking information of multiple candidate answers from the model in each reasoning run.

Key Insight: Drawing inspiration from ranked voting methods in social choice/voting theory (IRV, Borda Count, MRR), the voting granularity is upgraded from "single choice" to "ranking" in SC.

Core Idea: Have the LLM generate a ranking of candidate answers instead of a single answer, replacing majority voting with ranked voting for Self-Consistency aggregation.

Method¶

Overall Architecture¶

The differences from standard SC lie in only two aspects: (1) using few-shot prompts to guide the LLM to output \(m\) ranked candidates \(\mathcal{A}^r = \{\mathcal{A}^{r_1} \succ \mathcal{A}^{r_2} \succ ... \succ \mathcal{A}^{r_m}\}\); (2) aggregating the results of \(k\) reasoning runs using ranked voting instead of majority voting.

Key Designs¶

Three Ranked Voting Methods:
- IRV (Instant Runoff Voting): Iteratively eliminates the candidate with the fewest first-choice votes until a candidate exceeds 50%. Suitable for scenarios with dispersed votes, finding the most universally accepted answer through multiple rounds of screening.
- Borda Count (BCV): \(\text{BordaCount}(\mathcal{A}) = \sum_{i=1}^{k}(m - \text{rank}_\mathcal{A}(\mathcal{A}_i^r) + 1)\), weighting by assigned points based on rank. The 1st place gets \(m\) points, the 2nd place gets \(m-1\) points, etc., decaying linearly.
- MRRV (Mean Reciprocal Rank Voting): \(\text{MRR}(\mathcal{A}) = \frac{1}{k}\sum_{i=1}^{k}\frac{1}{\text{rank}_\mathcal{A}(\mathcal{A}_i^r)}\), assigning higher weights to top ranks. The 1st place has a weight of 1.0, 2nd has 0.5, 3rd has 0.33, etc. This non-linear decay emphasizes top position information.
Ranked Answer Generation:
- Function: Guide the LLM to output multiple candidates and their ranking using few-shot examples.
- Mechanism: Show examples in prompt, requiring the model to reason first, then output "The ranking of options by likelihood is: A > B > D > C".
- Design Motivation: Simple yet effective—no changes to the model architecture, only modifying the prompt and the voting method.
- For multiple-choice questions: Rank all options; for open-ended QA: Generate \(m\) most likely answers and rank them.
Tie-Breaking:
- Function: Decision-making mechanism when multiple candidates obtain identical scores.
- Mechanism: Calculate the token probability confidence score \(\mathcal{S}_i = \sum_{t=1}^{n} \log(p(\mathcal{C}_{i,t}))\) for each candidate and select the one with the highest confidence.
- Result: Ranked voting itself significantly reduces ties (from 5.08% to 2.29%).
Few-Shot Prompt Construction:
- Core Guideline: Ensure strong semantic correlation between questions and candidate answers.
- Expansion Strategy: Manually construct template examples, then use LLM to automatically generate more examples with minimal human validation required.

Loss & Training¶

Zero-training method, applied purely at inference time.
Applicable to both open-source and closed-source LLMs.

Key Experimental Results¶

Main Results¶

Model	Method	AQUA-RAT	CommonsenseQA	ARC-C	Average
LLaMA-3.2-3B	SC	61.81%	73.46%	80.54%	62.47%
	MRRV	71.26%	74.45%	81.40%	65.79%
Qwen-2.5-3B	SC	77.95%	77.89%	76.96%	60.13%
	IRV	79.13%	78.95%	83.45%	64.76%
LLaMA-3-8B	SC	66.93%	78.71%	86.77%	68.04%
	MRRV	75.20%	79.36%	87.63%	71.55%
Phi-3-4B	SC	73.62%	75.84%	90.13%	67.73%
	MRRV	75.20%	78.95%	90.44%	69.53%
GPT-3.5-turbo	SC	-	-	-	71.36%
	MRRV	-	-	-	76.69%

Ablation Study¶

Configuration	Finding
k=2→16	Ranked voting consistently outperforms majority voting across all \(k\) values
Single ranked answer (no voting)	No consistent improvement \(\rightarrow\) shows that the improvement stems from ranked voting aggregation
Randomly shuffle few-shot	Ranked voting exhibits lower variance and is more robust
c=1→5 candidate count	Outperforms SC when \(c \ge 4\)

Key Findings¶

Largest improvement on AQUA-RAT (+8-9%): Distractors in multiple-choice questions easily appear in top-2/3 positions.
Diminishing gains on stronger models: Only +0.48% on GPT-4-turbo, while +5.33% on GPT-3.5.
MRRV performs the best overall: Non-linear decay weights make the most full use of ranking information.
Reduced tie rate: Ranked voting reduces tie rates from 5.08% to 2.29% (using IRV).

Highlights & Insights¶

Cross-domain transfer from voting theory to LLM reasoning: Introducing established voting methods from social choice theory to SC. The strategy is simple yet effective.
Zero-cost improvement: No modifications to the model, no extra training required, only modifications to the prompt and post-processing.
Reveals information waste in SC: Majority voting only utilizes top-1 information; this finding is inspiring for all methods using SC.

Limitations & Future Work¶

Limited candidates: The diversity of ranked candidates in open-ended QA depends heavily on the model's generation capacity.
Small gains on stronger models: GPT-4-level models already have very high top-1 accuracy, leading to diminishing marginal returns from ranking information.
Only three voting methods explored: More complex voting methods, such as the Schulze method or Copeland method, remain unexplored.
Potential improvements: Combining weighted or adaptive voting strategies.

vs Self-Consistency (Wang et al., 2023): SC only uses top-1 votes, whereas this work utilizes complete ranking information. It achieves an average gain of 2-5% on SC benchmarks without increasing extra inference cost (sample size \(k\) remains identical).
vs Universal SC (Chen et al., 2023): USC also improves SC but focuses on answer equivalence determination (e.g., treating "1/2" and "0.5" as identical), which is orthogonal to and can be combined with ranked voting.
vs Adaptive-SC: Dynamically adjusts sampling frequency but still relies on majority voting. Ranked voting is consistently better under the same sampling budget.
vs Best-of-N: Selects the response with the highest model probability directly without voting. Ranked voting outperforms Best-of-N across all models.

Rating¶

Novelty: ⭐⭐⭐⭐ Simple and effective cross-domain transfer of voting theory.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 datasets, multiple model sizes, and detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Clear formulas and comprehensive experiments.
Value: ⭐⭐⭐⭐ Highly practical, plug-and-play upgrade of SC.