ConfTuner: Training Large Language Models to Express Their Confidence Verbally¶

Conference: NeurIPS 2025
arXiv: 2508.18847
Code: GitHub
Area: LLM Evaluation
Keywords: Confidence Calibration, Verbalized Confidence, Brier Score, Proper Scoring Rule, LLM Overconfidence

TL;DR¶

ConfTuner proposes a tokenized Brier score loss function (theoretically proven to be a proper scoring rule) that requires only 2000 samples and 4 minutes of LoRA fine-tuning to enable LLMs to output calibrated verbalized confidence (e.g., "I am 80% confident"), reducing ECE by up to 60.9% and supporting downstream applications like self-correction and model cascading.

Background & Motivation¶

Background: Reliable deployment of LLMs in high-risk domains (medicine, law, science) requires knowing "how certain the model is." Existing approaches include logit-based calibration (which is inapplicable to API models) and verbalized confidence (prompting the model to state its confidence).

Limitations of Prior Work: LLMs suffer from severe overconfidence—frequently asserting "I am 100% confident" even when the answer is incorrect. Methods like SaySelf require large-scale data (9 million samples), while LACIE requires 100k samples and long training times. A key theoretical question remains: what kind of loss function can guarantee confidence calibration?

Key Challenge: Verbalized confidence consists of discrete tokens (e.g., "80%" is a token rather than a numeric probability), while the classical Brier score is defined on continuous probabilities. How can proper scoring rule theory be extended to the token space?

Goal: (1) Design a theoretically-guaranteed confidence calibration loss; (2) achieve calibration at an extremely low cost (2000 samples / 4 minutes).

Key Insight: Generalize the Brier score from scoring probability values to scoring "probability distributions over confidence tokens"—if the model assigns more probability mass to the "80%" token, and the true accuracy is indeed close to 80%, the loss is minimized.

Core Idea: Tokenized Brier score = a proper scoring rule over confidence tokens, forcing the model to place the highest probability on the token closest to the true accuracy.

Method¶

Overall Architecture¶

Two stages: (1) Extract the logit distribution \(\mathbf{q}\) of confidence tokens from the SFT model; (2) fine-tune the model using LoRA with the tokenized Brier score \(\ell(\mathbf{q}, y) = \sum_i q_i(y - i/N)^2\) (\(y\) is the correctness indicator).

Key Designs¶

Tokenized Brier Score (Core Contribution):
- Function: Defines a loss over the softmax distribution \(\mathbf{q}\) on the confidence token set \(\mathcal{T}_N = \{0\%, 10\%, ..., 100\%\}\)
- Formula: \(\ell(\mathbf{q}, y) = \sum_{i=0}^{N} q_i (y - i/N)^2\)
- Theoretical guarantee (Theorem 1): This is a proper scoring rule—the optimal strategy for the model to minimize the loss is to concentrate probability on the token closest to the true conditional accuracy \(\eta(x)\)
- Design Motivation: The classical Brier score evaluates a single probability value, but LLMs output a distribution over tokens—hence the need for generalization
Minimalist LoRA Fine-Tuning:
- Function: Fine-tunes only the query/value projections with rank-8 LoRA
- Needs only 2000 samples and 4 minutes of training (single GPU)
- Comparison: SaySelf requires 9M samples/120 minutes, LACIE requires 100k/26 minutes
- Design Motivation: Preserve the model's original capabilities with minimal modifications
Downstream Applications (Self-Correction + Cascading):
- Self-Correction: Refusing to answer or rethinking when confidence is low \(\rightarrow\) 3-10% accuracy improvement
- Model Cascading: Routing to GPT-4o when confidence is low \(\rightarrow\) +9.3% on HotpotQA

Loss & Training¶

Tokenized Brier score + LoRA. LLaMA uses \(\mathcal{T}_{100}\) (0-100%), Qwen/Ministral use \(\mathcal{T}_9\) (0-9 discrete levels).

Key Experimental Results¶

Main Results (Average ECE↓ / AUROC↑ on 5 Reasoning Benchmarks)¶

Model	ECE (Base)	ECE (ConfTuner)	Gain	AUROC (Base)	AUROC (ConfTuner)
LLaMA	0.2768	0.1082	-60.9%	0.5923	0.6740
Qwen	0.3781	0.2872	-23.9%	0.6155	0.6861
Ministral	0.4393	0.1884	-57.1%	0.5216	0.6810

Ablation Study¶

Configuration	Key Findings	Description
2000 vs 5000 vs 10000 samples	2000 is sufficient	Extremely low data requirement
4 min vs SaySelf 120 min	30x speedup	Highly computationally efficient
Verbalized variants (high/medium/low)	Effectively generalizes	Not limited to numerical confidence
Implicit confidence (expressed in CoT)	Also effective	Non-explicit "X%" can also be calibrated
Black-box models (GPT-4o)	Also improves	Via prompting framework
Self-correction	+3-10% accuracy	Practical downstream application
Model cascading	+9.3% (HotpotQA)	Confidence used for routing

Key Findings¶

An ECE reduction of -60.9% is the most prominent figure—demonstrating that LLMs' confidence transitions from near-random noise to a meaningful signal.
2000 samples + 4 minutes defines the "lower bound of cost"—making it difficult to conceive a lower-cost calibration scheme.
Cascading applications demonstrate the practical utility of confidence—knowing when to query a stronger model.

Highlights & Insights¶

The Power of Theory: The proper scoring rule guarantees that the model cannot "cheat" (e.g., predicting 50% for all answers) to lower the loss—the sole optimal strategy is honest calibration. This offers a fundamental advantage over heuristically designed loss functions.
"2000 samples, 4 minutes": The extremely low cost means any team utilizing LLMs can perform confidence calibration, democratizing uncertainty estimation.
From Calibration to Decision-making: Calibration is not the ultimate end; self-correction and cascading are. The paper presents a complete "calibration \(\rightarrow\) utilization" pipeline.

Limitations & Future Work¶

Restricted to a fixed confidence token set; flexible natural language expressions (e.g., "I am fairly sure but not entirely certain") are not handled.
A gap remains between theoretical guarantees and practice—data quality and optimization dynamics affect actual calibration.
Validated only on reasoning tasks; calibration for tasks like creative writing or translation remains unexplored.
The proper scoring rule assumes independent samples—dependencies in multi-turn dialogues are not considered.

vs SaySelf (Xu et al., 2024): SaySelf iteratively improves confidence via self-training but requires 9 million samples; ConfTuner is theory-driven and requires only 2000 samples.
vs LACIE (Band et al., 2024): LACIE trains an auxiliary model to predict confidence; ConfTuner directly calibrates the token distribution of the model itself.
vs Temperature Scaling: TS calibrates logit probabilities but does not change the verbalized output; ConfTuner directly calibrates verbalized expressions.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The theoretical generalization of the Tokenized Brier score is elegant and practically meaningful.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 models × 5 benchmarks × downstream applications × efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Closely integrates theory with experimental narratives.
Value: ⭐⭐⭐⭐⭐ The 2000-sample + 4-minute calibration scheme has direct practical utility for LLM deployment. \n
vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.
vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.
vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.