Scalable Best-of-N Selection for Large Language Models via Self-Certainty¶

Conference: NeurIPS 2025 arXiv: 2502.18581 Code: GitHub Area: LLM Inference & Selection Strategies Keywords: Best-of-N, self-certainty, distributional quantification, reward-model-free, inference scaling

TL;DR¶

This paper proposes Self-Certainty, a metric that quantifies model confidence via the token probability distribution of LLM outputs, enabling scalable Best-of-N selection without any auxiliary reward model. The approach achieves performance comparable to or exceeding reward-model-based methods.

Background & Motivation¶

Cost: Existing Best-of-N methods rely on reward models (ORM/PRM), incurring high training and inference costs.

Distribution Shift: Reward models are susceptible to distributional shift and prone to reward hacking.

Limitations of Prior Work: Self-consistency is restricted to tasks with deterministic answers, limiting its generality.

Opportunity: The token probability distribution of LLMs inherently encodes model confidence and can be leveraged directly.

Method¶

Overall Architecture¶

Self-Certainty is grounded in distributional confidence quantification. The core idea is: - A more concentrated token probability distribution → higher model confidence. - The distance between the output distribution and the uniform distribution reflects token-level confidence.

Key Designs¶

1. Self-Certainty Metric: Based on KL divergence (equivalently, cross-entropy):

\[\text{Self-Certainty} = -\frac{1}{nV} \sum\sum \log\bigl(V \cdot p(j \mid x, y_{\leq i})\bigr)\]

where \(n\) is the response length, \(V\) is the vocabulary size, and \(p\) is the token probability.

2. Baseline Metrics for Comparison: - AvgLogP: direct average of log-probabilities - Perplexity: exponentiated negative log-probability - Entropy: information-theoretic entropy - Gini Impurity: decision-tree-derived measure - DP: distributional perplexity

3. Borda Voting Fusion: - Rank \(N\) samples by confidence score - Assign weighted votes: \(v(r) = (N - r + 1)^p\) - Aggregate answers by accumulating votes

Key Experimental Results¶

Confidence Metric Comparison (LiveBench-Math)¶

Metric	N=8	N=16	N=32	N=64	Trend
AvgLogP	17.66%	17.5%	18.2%	18.3%	Flat
Perplexity	20.44%	18.3%	16.5%	15.8%	Declining
Entropy	—	—	—	—	Complementary
KL-Divergence	22.1%	25.2%	28.1%	29.8%	Rising ✓
Self-Certainty	20.87%	22.01%	27.5%	28.5%	Strongly Rising ✓

Main Results (Table 1)¶

Method	LiveBench-Math	GSM8K	MATH	CRUXEval-O	Avg.
Greedy	12.23%	47.96%	46.02%	39.88%	36.5%
Self-Consistency	22.50%	89.42%	58.60%	47.58%	56.15%
Self-Certainty	20.87%	87.32%	54.63%	45.38%	52.71%
Borda Voting (\(p{=}1.2\))	23.21%	89.51%	59.04%	47.93%	56.51%

Ablation Study: Effect of \(p\) in Borda Voting¶

\(p\)	N=8	N=16	N=32	N=64	Characteristics
0.0	23.02%	22.5%	22.5%	26.25%	Majority voting
0.3	23.69%	26.5%	26.5%	26.47%	Recommended for small \(N\)
0.7–1.2	23.21%	26.69%	26.69%	26.41%	Optimal range
2.0+	22.45%	26.41%	24.1%	18.2%	Degradation

Open-Ended Task Performance (LiveCodeBench)¶

Model	Greedy	USC	Self-Certainty	Borda Voting
Llama-8B	42.93%	43.78%	45.83%	50.85%
Qwen-32B	78.6%	76.8%	79.5%	81.2%

Highlights & Insights¶

Intrinsic Signal Theory: LLM token distributions inherently encode confidence, requiring no external annotation.
No Distributional Bias: Self-Certainty relies solely on native probabilities, avoiding the biases of learned reward models.
Length Robustness: Unlike negative perplexity, Self-Certainty does not inflate scores for longer sequences.
Open-Ended Generalization: Overcomes the deterministic-answer constraint of self-consistency, extending applicability to tasks such as code generation.

Limitations & Future Work¶

Spurious Confidence: Certain generations may receive high scores due to superficial confidence despite being factually incorrect.
Performance Gap: Self-consistency retains a marginal ~0.5% advantage on closed-form MATH tasks.
Hyperparameter Sensitivity: The Borda exponent \(p\) requires tuning with respect to \(N\) and task type.
Theoretical Gap: The deeper theoretical explanation for why KL divergence outperforms other metrics remains incomplete.

Best-of-N: Self-consistency, USC, ORMs/PRMs
Confidence Estimation: BSDetector, TrustScore, self-evaluation methods
Inference Scaling: Test-time compute scaling, multi-path reasoning

Rating¶

⭐⭐⭐⭐⭐