Skip to content

Towards Harmonized Uncertainty Estimation for Large Language Models

Conference: ACL2025
arXiv: 2505.19073
Code: O-L1RU1/Corrector4UE
Area: LLM/NLP
Keywords: Uncertainty Estimation, Reliable LLM Deployment, Calibration, Lightweight Corrector

TL;DR

Proposes the CUE framework, which calibrates existing uncertainty estimation scores by training a lightweight classifier (Corrector) aligned with the target LLM's performance. It achieves harmonized improvements across three dimensions—indicativeness, precision-recall balance, and calibration—with performance gains of up to 60%.

Background & Motivation

  1. LLM Hallucination Risk: Large language models often make "confident mistakes" by generating hallucinations and factual errors, making it difficult for users to judge whether the output is reliable. Consequently, uncertainty estimation is required to quantify output trustworthiness.
  2. Trade-offs for Existing Methods across Three Metrics: The authors systematically evaluated existing methods from a classification perspective (AUROC for indicativeness, F1 for precision-recall balance) and a calibration perspective (ECE). They found highly unbalanced performance across these three dimensions—methods excelling in one dimension usually perform poorly in others.
  3. Poor Indicativeness of Basic Methods: Basic methods such as Lexical Similarity, Verbal Confidence, P(True), and Predictive Entropy show AUROC values close to random guessing (0.5), struggling to effectively distinguish between reliable and unreliable answers.
  4. Extremely Low F1 in Logit-based Methods: Enhanced methods like SAR and SE, despite improving AUROC, yield extremely low F1 scores and fail to balance precision and recall, leading to severe false positives or false negatives in practical scenarios.
  5. Common Neglect of Calibration: The vast majority of existing methods exhibit poor ECE. There is a severe mismatch between uncertainty scores and true probabilities, violating human intuitive understanding of probabilities.
  6. Poor Complementarity Among Methods: The authors attempted to combine uncertainty scores from different methods via weighted sum, yet observed almost no improvement or even degradation. This indicates that existing methods are homogeneous and lack complementary information sources.

Method

The CUE (Corrector for Uncertainty Estimation) framework consists of three steps:

1. Dataset Crafting

  • Extract QA pairs \(\mathcal{D}=\{(q_i,a_i)\}\) from existing datasets and have the target model \(M\) generate answers \(r_i\).
  • Employ a hybrid evaluation strategy to judge correctness: a rule-based method (correct if ROUGE-L > 0.7) and an LLM-based method (GPT-3.5 directly judging semantic equivalence), combined using an OR operation.
  • Assign a binary label \(c_i\) to each sample and invert it to direct towards uncertainty: \(\mathcal{D}^*_{\text{cor}}=\{(q_i, 1-c_i)\}\).

2. Corrector Training

  • Construct a binary classifier using a lightweight encoder (e.g., RoBERTa or DeBERTa) and a fully connected layer.
  • The input is the [CLS] representation of the question text, and the output is mapped to a \([0,1]\) probability value via a sigmoid function.
  • Train the model by minimizing binary cross-entropy loss, enabling it to learn to predict whether the target LLM will make a mistake on the given question.

3. Uncertainty Correcting

  • Normalize the original uncertainty score to \([0,1]\) using Min-Max scaling: \(U_{\text{norm}}(x)=\frac{U(x)-\min(U)}{\max(U)-\min(U)}\).
  • Weighted combination with the correction score \(C(x)\) output by the corrector: \(U_{\text{cor}}(x)=w^*\cdot U_{\text{norm}}(x)+(1-w^*)\cdot C(x)\).
  • The optimal weight \(w^*\) is determined via grid search on the development set.

Core Insight: The corrector provides global alignment information orthogonal to existing UE methods—it directly predicts the target model's failure probability from the question text, instead of relying on the target LLM's internal logic or linguistic features.

Key Experimental Results

Experimental Setup: Target models include OPT-6.7B and LLaMA-3-8B-Instruct; datasets are TriviaQA (95K QA pairs) and SciQA (2565 QA pairs); baselines cover 9 methods across four categories (logit, verbalized, consistency, and internal state).

Table 1: AUROC and ECE Improvements (LLaMA-3-8B-Instruct, partial)

Method TriviaQA AUROC (Orig.→+Corrector) TriviaQA ECE (Orig.→+Corrector) SciQA AUROC (Orig.→+Corrector)
LS 19.57→69.82 (+50.25) 70.25→7.41 (-62.84) 53.67→65.38 (+11.71)
VC 62.34→74.89 (+12.55) 23.41→16.78 (-6.63) 68.22→72.15 (+3.93)
SE 80.92→82.12 (+1.20) 13.07→12.76 (-0.31) 71.59→72.93 (+1.34)
SAR 80.92→81.90 (+0.98) 16.17→13.76 (-2.41) 73.88→75.19 (+1.31)

Table 2: Ablation Study (LLaMA-3-8B-Instruct + TriviaQA)

Method AUROC (↑) ECE (↓)
Corrector Only 69.87 6.73
Original Best Method 80.92 11.53
+Corrector (Probability Value) 82.12 10.46
+Corrector (Label Value) 80.92 11.53
+GPT-4o Scoring 80.92 11.53

Key Findings: - AUROC improved on average by 0.27 (TriviaQA) and 0.09 (SciQA). - F1 score improved by 38.97% on average. - The largest improvement was observed on weak baselines like LS (+50.25 AUROC), with stable gains also achieved on strong baselines such as SE. - Calibration using continuous probability values is far superior to using discrete label values or GPT-4o scoring.

Highlights & Insights

  • Simple and Universal Design: Corrector is orthogonal to all existing UE methods, plug-and-play, and does not require access to the target model's internal states.
  • Harmonized Improvements across Three Dimensions: Concurrently improves indicativeness, precision-recall balance, and calibration, rather than optimizing only a single metric.
  • Lightweight and Efficient: Employs small models like RoBERTa/DeBERTa as the Corrector, resulting in extremely low training and inference costs.
  • Solid Empirical Analysis: Systematically exposes the limitations of existing methods prior to presenting targeted solutions.

Limitations & Future Work

  • Dependence on Labeled Data: Requires domain-specific QA pairs associated with correctness labels from the target model to train the Corrector, leading to a relatively high cold-start cost.
  • Uncertain Cross-Domain Generalization: A Corrector trained on one domain may suffer from performance degradation when transferred to other knowledge domains.
  • Evaluation Limited to White-Box Models: Experiments were conducted solely on open-source models; validation on closed-source API models (such as GPT-4) is missing.
  • Normalization Relying on Global Statistics: Min-Max normalization requires the minimum and maximum of the entire test set, which poses challenges for real-time online deployment.

vs Semantic Entropy (SE)

SE estimates uncertainty by clustering semantically equivalent responses to calculate entropy. It performs well on AUROC but poorly on F1 and ECE. As a post-processing calibration framework, CUE can further enhance all three metrics on top of SE, with the two being highly complementary—SE leverages internal semantic information, while CUE exploits external alignment information.

vs SAR (Shifting Attention to Relevance)

SAR improves predictive entropy by shifting attention to the contributions of relevant tokens, serving as one of the current state-of-the-art methods. However, SAR also suffers from extremely low F1 and poor calibration. CUE brings an additional 0.98-3.35 AUROC improvement and significant ECE reductions on top of SAR, demonstrating that the information provided by the two methods is indeed orthogonal.

vs LARS (Learnable Response Scoring)

LARS also utilizes supervision signals to learn token probability dependencies, but it improves the scoring function itself. In contrast, CUE approaches the problem entirely from the question side to predict the target model's reliability without relying on the target LLM's output logits, which grants it broader applicability.

Rating

  • Novelty: ⭐⭐⭐ — The core idea of using an external classifier to calibrate uncertainty scores is simple and direct; the novelty is moderate but highly practical.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — 2 models × 2 datasets × 9 baselines; the ablation studies cover formats, models, and retrieval methods.
  • Writing Quality: ⭐⭐⭐⭐ — The problem analysis is crystal clear, the three-dimensional evaluation framework is convincing, and the figures are rich.
  • Value: ⭐⭐⭐⭐ — As a general-purpose post-processing module, it holds direct practical value for the reliable deployment of LLMs.