ConfTuner: Training Large Language Models to Express Their Confidence Verbally¶
Conference: NeurIPS 2025
arXiv: 2508.18847
Code: GitHub
Area: LLM Evaluation
Keywords: Confidence Calibration, Verbalized Confidence, Brier Score, Proper Scoring Rule, LLM Overconfidence
TL;DR¶
ConfTuner proposes a tokenized Brier score loss function (theoretically proven to be a proper scoring rule) that requires only 2000 samples and 4 minutes of LoRA fine-tuning to enable LLMs to output calibrated verbalized confidence (e.g., "I am 80% confident"), reducing ECE by up to 60.9% and supporting downstream applications like self-correction and model cascading.
Background & Motivation¶
Background: Reliable deployment of LLMs in high-risk domains (medicine, law, science) requires knowing "how certain the model is." Existing approaches include logit-based calibration (which is inapplicable to API models) and verbalized confidence (prompting the model to state its confidence).
Limitations of Prior Work: LLMs suffer from severe overconfidence—frequently asserting "I am 100% confident" even when the answer is incorrect. Methods like SaySelf require large-scale data (9 million samples), while LACIE requires 100k samples and long training times. A key theoretical question remains: what kind of loss function can guarantee confidence calibration?
Key Challenge: Verbalized confidence consists of discrete tokens (e.g., "80%" is a token rather than a numeric probability), while the classical Brier score is defined on continuous probabilities. How can proper scoring rule theory be extended to the token space?
Goal: (1) Design a theoretically-guaranteed confidence calibration loss; (2) achieve calibration at an extremely low cost (2000 samples / 4 minutes).
Key Insight: Generalize the Brier score from scoring probability values to scoring "probability distributions over confidence tokens"—if the model assigns more probability mass to the "80%" token, and the true accuracy is indeed close to 80%, the loss is minimized.
Core Idea: Tokenized Brier score = a proper scoring rule over confidence tokens, forcing the model to place the highest probability on the token closest to the true accuracy.
Method¶
Overall Architecture¶
Two stages: (1) Extract the logit distribution \(\mathbf{q}\) of confidence tokens from the SFT model; (2) fine-tune the model using LoRA with the tokenized Brier score \(\ell(\mathbf{q}, y) = \sum_i q_i(y - i/N)^2\) (\(y\) is the correctness indicator).
Key Designs¶
-
Tokenized Brier Score (Core Contribution):
- Function: Defines a loss over the softmax distribution \(\mathbf{q}\) on the confidence token set \(\mathcal{T}_N = \{0\%, 10\%, ..., 100\%\}\)
- Formula: \(\ell(\mathbf{q}, y) = \sum_{i=0}^{N} q_i (y - i/N)^2\)
- Theoretical guarantee (Theorem 1): This is a proper scoring rule—the optimal strategy for the model to minimize the loss is to concentrate probability on the token closest to the true conditional accuracy \(\eta(x)\)
- Design Motivation: The classical Brier score evaluates a single probability value, but LLMs output a distribution over tokens—hence the need for generalization
-
Minimalist LoRA Fine-Tuning:
- Function: Fine-tunes only the query/value projections with rank-8 LoRA
- Needs only 2000 samples and 4 minutes of training (single GPU)
- Comparison: SaySelf requires 9M samples/120 minutes, LACIE requires 100k/26 minutes
- Design Motivation: Preserve the model's original capabilities with minimal modifications
-
Downstream Applications (Self-Correction + Cascading):
- Self-Correction: Refusing to answer or rethinking when confidence is low \(\rightarrow\) 3-10% accuracy improvement
- Model Cascading: Routing to GPT-4o when confidence is low \(\rightarrow\) +9.3% on HotpotQA
Loss & Training¶
Tokenized Brier score + LoRA. LLaMA uses \(\mathcal{T}_{100}\) (0-100%), Qwen/Ministral use \(\mathcal{T}_9\) (0-9 discrete levels).
Key Experimental Results¶
Main Results (Average ECE↓ / AUROC↑ on 5 Reasoning Benchmarks)¶
| Model | ECE (Base) | ECE (ConfTuner) | Gain | AUROC (Base) | AUROC (ConfTuner) |
|---|---|---|---|---|---|
| LLaMA | 0.2768 | 0.1082 | -60.9% | 0.5923 | 0.6740 |
| Qwen | 0.3781 | 0.2872 | -23.9% | 0.6155 | 0.6861 |
| Ministral | 0.4393 | 0.1884 | -57.1% | 0.5216 | 0.6810 |
Ablation Study¶
| Configuration | Key Findings | Description |
|---|---|---|
| 2000 vs 5000 vs 10000 samples | 2000 is sufficient | Extremely low data requirement |
| 4 min vs SaySelf 120 min | 30x speedup | Highly computationally efficient |
| Verbalized variants (high/medium/low) | Effectively generalizes | Not limited to numerical confidence |
| Implicit confidence (expressed in CoT) | Also effective | Non-explicit "X%" can also be calibrated |
| Black-box models (GPT-4o) | Also improves | Via prompting framework |
| Self-correction | +3-10% accuracy | Practical downstream application |
| Model cascading | +9.3% (HotpotQA) | Confidence used for routing |
Key Findings¶
- An ECE reduction of -60.9% is the most prominent figure—demonstrating that LLMs' confidence transitions from near-random noise to a meaningful signal.
- 2000 samples + 4 minutes defines the "lower bound of cost"—making it difficult to conceive a lower-cost calibration scheme.
- Cascading applications demonstrate the practical utility of confidence—knowing when to query a stronger model.
Highlights & Insights¶
- The Power of Theory: The proper scoring rule guarantees that the model cannot "cheat" (e.g., predicting 50% for all answers) to lower the loss—the sole optimal strategy is honest calibration. This offers a fundamental advantage over heuristically designed loss functions.
- "2000 samples, 4 minutes": The extremely low cost means any team utilizing LLMs can perform confidence calibration, democratizing uncertainty estimation.
- From Calibration to Decision-making: Calibration is not the ultimate end; self-correction and cascading are. The paper presents a complete "calibration \(\rightarrow\) utilization" pipeline.
Limitations & Future Work¶
- Restricted to a fixed confidence token set; flexible natural language expressions (e.g., "I am fairly sure but not entirely certain") are not handled.
- A gap remains between theoretical guarantees and practice—data quality and optimization dynamics affect actual calibration.
- Validated only on reasoning tasks; calibration for tasks like creative writing or translation remains unexplored.
- The proper scoring rule assumes independent samples—dependencies in multi-turn dialogues are not considered.
Related Work & Insights¶
- vs SaySelf (Xu et al., 2024): SaySelf iteratively improves confidence via self-training but requires 9 million samples; ConfTuner is theory-driven and requires only 2000 samples.
- vs LACIE (Band et al., 2024): LACIE trains an auxiliary model to predict confidence; ConfTuner directly calibrates the token distribution of the model itself.
- vs Temperature Scaling: TS calibrates logit probabilities but does not change the verbalized output; ConfTuner directly calibrates verbalized expressions.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The theoretical generalization of the Tokenized Brier score is elegant and practically meaningful.
- Experimental Thoroughness: ⭐⭐⭐⭐ 3 models × 5 benchmarks × downstream applications × efficiency comparisons.
- Writing Quality: ⭐⭐⭐⭐⭐ Closely integrates theory with experimental narratives.
- Value: ⭐⭐⭐⭐⭐ The 2000-sample + 4-minute calibration scheme has direct practical utility for LLM deployment. \n
- vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
- vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
- Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
- Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
- Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
- Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
- Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
- Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.
- vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
- vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
- Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
- Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
- Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
- Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
- Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
- Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.
- vs CalibrateBeforeUse: CBU performs temperature scaling at inference time; ConfTuner directly calibrates the semantics of the output tokens, making it more suitable for API scenarios.
- vs Ensemble Methods (Self-Consistency): Estimating confidence by sampling multiple times for consistency is expensive (requiring \(N\) inferences); ConfTuner internalizes this knowledge into a single-pass output.
- Inspiration for Other Modalities: The tokenized proper scoring rule framework can be generalized to visual confidence in VLMs, reliability in speech recognition, etc.
- Implications for RLHF: ConfTuner's calibration can improve reward models in RLHF—calibrated confidence serves as a better training signal.
- Value for Agent Systems: Autonomous agents need to know when to seek human help—calibrated confidence provides a reliable basis for such decisions.
- Relationship to Bayesian Methods: ConfTuner's calibration can be viewed as mapping LLM outputs to approximate Bayesian posterior probabilities.
- Multilingual Extensions: Natural expressions of confidence vary across languages; cross-lingual calibration remains an open problem.
- Long-text Scenarios: When a response contains multiple claims, individual confidence for each claim is more valuable than a global confidence score.