Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator¶
Conference: NeurIPS 2025 arXiv: 2505.16690 Code: GitHub Area: LLM Evaluation Keywords: confidence calibration, temperature scaling, pre-trained LM, post-trained LM, unsupervised calibration, DACA
TL;DR¶
This paper identifies that post-training (SFT/RLHF/DPO) degrades the confidence calibration of pre-trained language models, and proposes DACA, a method that exploits the well-calibrated nature of pre-trained models by aligning confidence distributions exclusively on prediction-consistent samples, achieving label-free calibration of post-trained models with up to 15.08% ECE improvement.
Background & Motivation¶
Problem Definition¶
The standard LLM training paradigm follows a "pre-training → post-training" pipeline. Pre-trained language models (PLMs) generally exhibit well-calibrated confidence (i.e., the model's output confidence faithfully reflects its accuracy), but after post-training procedures such as SFT, RLHF, and DPO, models become overconfident—assigning high confidence scores to both correct and incorrect outputs.
Limitations of Prior Work¶
- Temperature Scaling is the most practical post-hoc calibration method, but requires labeled data.
- Obtaining labels is extremely costly and time-consuming in domains such as mathematical reasoning and medical diagnosis.
- Large quantities of unlabeled data remain unused in real-world deployments.
- Prompt-based calibration methods (verbalization) yield limited effectiveness.
Core Insight¶
Pre-trained models are inherently well-calibrated—can the confidence of PLMs be leveraged to calibrate overconfident post-trained language models (PoLMs)? A key finding emerges: direct alignment leads to underconfidence, rooted in the interference of prediction-disagreement examples.
Method¶
Overall Architecture¶
DACA (Disagreement-Aware Confidence Alignment) operates via the following mechanism:
- Detect whether the PLM and PoLM agree in their predictions.
- Align the confidence distributions of the two models exclusively on agreement samples.
- Solve for the optimal temperature parameter by minimizing KL divergence.
Why Naive Alignment Fails¶
Naive confidence alignment minimizes the KL divergence between PLM and PoLM over all unlabeled data:
Problem: When the two models disagree, the PLM's confidence reflects its own prediction accuracy rather than the accuracy of the PoLM's prediction. Since post-training typically improves the PoLM's accuracy, the PLM's confidence underestimates the PoLM's true correctness rate, pushing the temperature parameter \(\tau\) toward excessively large values.
Theoretical Analysis¶
Proposition 3.2: Even when the PLM is perfectly calibrated (\(ECE_f = 0\)), the ECE of the perfectly aligned PoLM is:
where \(\pi\) denotes the proportion of disagreement samples. Conclusion: Due to prediction disagreement, zero ECE cannot be achieved even under ideal alignment.
Proposition 3.3: For disagreement samples where the PoLM predicts class \(c\) but the PLM assigns probability \(< 1/k\) to \(c\), the optimal temperature is \(\tau^* = \infty\). Implication: The gradient of KL divergence with respect to \(\tau\) on disagreement samples is always positive, continuously pushing the temperature upward.
Key Design: DACA Loss Function¶
where \(\hat{y} = \arg\max_i f_i(\boldsymbol{x})\) (PLM prediction) and \(\hat{y}' = \arg\max_i g_i(\boldsymbol{x})\) (PoLM prediction).
Core Operation: The indicator function \(\mathbf{1}\{\hat{y} = \hat{y}'\}\) filters out disagreement samples, computing KL divergence exclusively on agreement samples.
General Extension¶
The method generalizes to arbitrary post-hoc calibration approaches (vector scaling, matrix scaling):
Open-Domain QA Extension¶
For open-ended question answering, the P(True) method is adopted to obtain confidence scores: the model is prompted to judge whether its own generated answer is correct, and \(p(\text{Yes}|x, f)\) is used as the confidence score.
Key Experimental Results¶
Main Results: Average Calibration Performance on MMLU (57 Subjects)¶
| Model | Method | ECE(%) ↓ | MCE(%) ↓ | AECE(%) ↓ | Brier ↓ |
|---|---|---|---|---|---|
| Qwen3-8B | Vanilla | 16.38 | 38.19 | 24.99 | 0.179 |
| Qwen3-8B | CAPE | 11.52 | 31.74 | 17.61 | 0.157 |
| Qwen3-8B | DACA | 8.39 | 23.70 | 12.60 | 0.144 |
| Qwen3-8B | TS (labeled) | 8.66 | 28.11 | 14.55 | 0.146 |
| Gemma-3-12B-IT | Vanilla | 23.68 | 48.51 | 35.89 | 0.235 |
| Gemma-3-12B-IT | DACA | 8.60 | 27.02 | 13.55 | 0.154 |
| Gemma-3-12B-IT | TS (labeled) | 9.75 | 29.80 | 15.60 | 0.159 |
| Yi-1.5-34B-Chat | Vanilla | 16.20 | 33.82 | 20.35 | 0.199 |
| Yi-1.5-34B-Chat | DACA | 9.47 | 19.90 | 11.70 | 0.174 |
| Llama-3-70B-IT | Vanilla | 12.87 | 36.87 | 23.84 | 0.143 |
| Llama-3-70B-IT | DACA | 7.84 | 24.28 | 13.16 | 0.120 |
Highlight: Under the label-free setting, DACA matches or surpasses labeled temperature scaling (e.g., Gemma-3-12B: 8.60 vs. 9.75).
API Model Calibration (GPT-4o + Different PLMs)¶
| Calibration PLM | PLM ECE(%) | GPT-4o ECE(%) ↓ |
|---|---|---|
| None (Vanilla) | — | 21.23 |
| Llama-3-8B | 9.45 | 7.98 |
| Qwen2.5-7B | 6.99 | 7.82 |
| Gemma-3-12B | 4.42 | 6.99 |
Finding: Better-calibrated PLMs yield more effective DACA alignment. Smaller PLMs can be used to calibrate larger or closed-source models.
Ablation Study: Different Post-Training Strategies¶
| Post-Training | Vanilla ECE(%) | DACA ECE(%) |
|---|---|---|
| SFT | 14.85 | 4.57 |
| SFT + DPO | 25.12 | 5.42 |
| SFT + DPO + RLVR | 25.19 | 5.99 |
Finding: More aggressive post-training (with DPO/RLVR) leads to more severe overconfidence, yet DACA effectively calibrates all variants.
Open-Domain QA and Selective Classification¶
- On TruthfulQA, Qwen2.5-32B-Instruct ECE decreases from 30.96% to 5.24%.
- In selective classification, accuracy of high-confidence predictions improves significantly across all thresholds (0.5–0.95) after calibration.
- The advantage is more pronounced at higher thresholds, where overconfidence is most detrimental.
Key Findings¶
- Verbalization-based methods (Elicitation series) perform substantially worse than logit-based methods.
- Larger models exhibit lower Vanilla ECE, consistent with prior findings.
- DACA is robust across model scales, architectures, and post-training strategies.
- Temperature values continue to grow to extreme magnitudes when trained on disagreement samples, empirically validating the theoretical analysis.
Highlights & Insights¶
- Profound insight: The identification of prediction disagreement as the root cause of naive alignment failure is a simple yet highly explanatory observation.
- Theory and experiments reinforce each other: Propositions 3.2 and 3.3 clearly explain why direct alignment leads to underconfidence, and the experiments perfectly validate the theory.
- Highly practical:
- Requires no labeled data.
- Applicable across architectures (small PLM calibrating large PoLM).
- Applicable to closed-source API models (GPT-4o, DeepSeek-V3).
- Minimal computational overhead (single inference pass + temperature optimization).
- Minimal yet effective method: The core contribution is essentially an indicator function filtering out disagreement samples, yet the empirical gains are remarkable.
- DACA without labels outperforms labeled TS, demonstrating that PLM calibration information is remarkably informative.
Limitations & Future Work¶
- Additional inference cost: An extra PLM inference pass is required to determine prediction agreement.
- Disagreement samples are discarded: Information from filtered disagreement samples is entirely unused.
- Logit accessibility required: The method is not applicable to fully black-box models that do not return logits.
- Calibration target is MCQA: Although extended to open-ended QA, the main experiments are limited to multiple-choice scenarios.
- Future directions:
- Design methods that leverage disagreement samples (e.g., weighting rather than complete filtering).
- Explore applicability to generative tasks.
- Integrate calibration into the RLHF training process as an in-training objective.
Related Work & Insights¶
- Relation to Temperature Scaling (Guo et al. 2017): TS is the foundational method but requires labels; DACA substitutes PLM confidence for labels, representing the first label-free post-hoc calibration approach.
- Comparison with CAPE (Jiang et al. 2023): CAPE calibrates by permuting option order as a form of prompt engineering; DACA is a principled method that achieves superior performance.
- Comparison with Shen et al. 2024 (Thermometer): Thermometer trains an auxiliary model to predict temperature, requiring labels and training; DACA requires neither.
- Insight: The calibration properties of PLMs constitute an underutilized asset that can be more effectively exploited within the post-training pipeline.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First label-free post-hoc calibration method for LLMs; the core insight is deep and elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage of multiple model families, scales, post-training strategies, datasets, and both open- and closed-source models.
- Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, experimental presentation is well-structured, and the transition from motivation to method is natural.
- Value: ⭐⭐⭐⭐⭐ — Addresses a core pain point in LLM deployment; the method is simple, practical, and immediately deployable.