CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought¶
Conference: ACL 2025
arXiv: 2502.17214
Code: https://github.com/ZBox1005/CoT-UQ
Area: LLM Reasoning
Keywords: Uncertainty Quantification, Chain-of-Thought, Keyword Extraction, Confidence Calibration, Overconfidence
TL;DR¶
To address the overconfidence of LLMs in reasoning tasks, this paper proposes the CoT-UQ framework, which integrates keyword extraction and importance scoring from CoT reasoning steps into the uncertainty quantification process, achieving an average AUROC improvement of 5.9% on logical and mathematical reasoning tasks.
Background & Motivation¶
Background: LLMs possess powerful reasoning capabilities, but they struggle to accurately quantify the uncertainty of their generated responses. Existing UQ methods mainly fall into two categories: (a) Aggregated Probabilities (AP) based on token probability; (b) Self-Evaluation (SE), such as P(True).
Limitations of Prior Work: (a) Most UQ methods are prompt-wise rather than response-wise, requiring multiple response samples for the same question, which is computationally expensive; (b) LLMs suffer from severe overconfidence, especially when using Chain-of-Thought (CoT) reasoning, where the model tends to exhibit higher confidence in incorrect answers; (c) AP methods treat all tokens equally, allowing redundant tokens to interfere with uncertainty estimation.
Key Challenge: Although CoT reasoning improves response accuracy, it concurrently leads the model to be more "confident" in its own outputs (as the reasoning chain makes the answer appear more plausible), making UQ even more challenging.
Goal: How to leverage the LLM's own reasoning steps to improve uncertainty estimation, instead of merely looking at the probability of the final answer?
Key Insight: Keywords in the reasoning chain carry the most meaningful information, and different keywords contribute differently to the final answer. Accurate confidence estimation can be achieved by extracting and weighting these key terms.
Core Idea: Extract keywords from CoT reasoning steps and assess their importance, then replace raw token probabilities with weighted keyword probabilities to estimate uncertainty.
Method¶
Overall Architecture¶
CoT-UQ is a two-stage, four-step response-wise UQ framework. The inputs are the question and the response with CoT reasoning generated by the LLM, and the output is a calibrated confidence score. The first stage (Steps 1-3) extracts and evaluates key information from the reasoning process, and the second stage (Step 4) integrates this information into existing UQ strategies.
Key Designs¶
-
Step 1 - Reasoning Extraction:
- Function: Guide the LLM to generate structured step-by-step reasoning.
- Mechanism: Add the prefix "Let's think step by step. Step 1:" to the prompt to ensure the output consists of multi-step reasoning \(s_{1 \sim k} = s_1, ..., s_k\) followed by the final answer \(a\).
- Design Motivation: Structured outputs facilitate subsequent step-by-step keyword extraction.
-
Step 2 - Keywords Extraction:
- Function: Extract keywords from each reasoning step.
- Mechanism: Leverage the LLM's own information extraction capabilities to extract \(n_i\) keywords from each step \(s_i\), building the keyword set \(\mathcal{K} = \bigcup_{i=1}^{k} \{w_j^i\}_{j=1}^{n_i}\).
- Design Motivation: Keywords represent the most meaningful elements of the reasoning steps, eliminating interference from redundant tokens. Prior works averaging or taking the minimum over all tokens introduce substantial irrelevant token noise, degrading UQ accuracy.
-
Step 3 - Importance Scoring:
- Function: Have the LLM assess the importance of each keyword to the final answer.
- Mechanism: Under a few-shot setup, provide the LLM with the full context (question, reasoning steps, answer, keywords) and let it score each keyword from 1 to 10. The updated keyword set is \(\mathcal{K} = \bigcup_{i=1}^{k} \{(w_j^i, t_j^i)\}_{j=1}^{n_i}\).
- Design Motivation: Different keywords contribute differently to the correctness of the answer—directly relevant numerical values or entities are more critical, while auxiliary descriptive words hold lower importance.
-
Step 4 - Reasoning-Enhanced UQ Strategies:
- Enhanced AP Strategy: Weighted average of keyword probabilities replaces raw token probability aggregation. The probability of a keyword \(w\) is aggregated from token-level probabilities as \(p(\hat{w}) = \text{Aggr}_{m=1}^{l}(\mathbb{P}(w_m | p, w_1, ..., w_{m-1}))\), and the final confidence is the importance-weighted average \(c = \frac{\sum_{i,j} t_j^i \cdot p(\hat{w_j^i})}{\sum_{i,j} t_j^i}\).
- Enhanced SE Strategy: Four methods are proposed to inject reasoning information into the self-evaluation prompt—ALLSteps (adding all reasoning steps), ALLKeywords (adding all keywords), KEYStep (adding only the reasoning step with the highest importance \(s^* = \arg\max_i \frac{1}{n_i}\sum_j t_j^i\)), and KEYKeywords (adding only keywords with importance exceeding a threshold \(\tau\), \(\mathcal{K}^* = \{(w,t) | t \geq \tau\}\)).
- Design Motivation: Enhanced AP reduces noise by focusing on key tokens, while enhanced SE assists the model in self-correction by providing extra reasoning context.
Loss & Training¶
- Training-free: This is an inference-time method without requiring additional training or fine-tuning.
- All steps are executed on the original LLM via prompting.
Key Experimental Results¶
Main Results (Llama 3.1-8B, AUROC ↑)¶
| Method | HotpotQA | 2WikiMHQA | GSM8K | SVAMP | ASDiv |
|---|---|---|---|---|---|
| Probas-min | 58.34 | 56.81 | 54.95 | 54.79 | 58.69 |
| + CoT-UQ | 64.37 | 70.02 | 63.09 | 60.49 | 64.84 |
| TOKENSAR | 53.57 | 56.92 | 54.46 | 55.01 | 58.71 |
| + CoT-UQ | 61.07 | 65.38 | 65.10 | 62.11 | 66.91 |
| P(True) | 62.39 | 53.56 | 48.15 | 51.58 | 47.23 |
| + CoT-UQ | 63.10 | 57.77 | 52.60 | 60.00 | 53.20 |
CoT-UQ achieves up to a 16.8% improvement on Probas-min (TOKENSAR + CoT-UQ on 2WikiMHQA increases from 56.92 to 65.38).
Ablation Study¶
| Configuration | Impact on AUROC |
|---|---|
| Full CoT-UQ (AP-Probas-min) | Baseline |
| w/o Importance Scoring | Drops by ~2-4%, validating the necessity of weighting |
| Logical Reasoning using KEYKeywords | Best performance (keywords contain high information content) |
| Mathematical Reasoning using ALLSteps/KEYStep | Best performance (mathematical keywords are overly simple, like solitary digits) |
| Random masking instead of keywords | Degraded performance |
Key Findings¶
- AP strategy benefits more: The average improvement of CoT-UQ on AP (+10.3%) is significantly higher than on SE (+4.4%), because AP directly filters noise at the probability level.
- Different optimal strategies for logical vs. mathematical reasoning: Logical reasoning benefits more from keyword-level strategies (KEYKeywords) because reasoning steps have redundant information while keywords retain logical relations. Mathematical reasoning favors step-level strategies (ALLSteps/KEYStep) because mathematical keywords are often single numbers lacking sufficient context.
- Importance scoring is consistently effective: Performance drops across all configurations when importance weighting is removed, confirming that different keywords contribute differently to UQ.
Highlights & Insights¶
- The concept of "evaluating reasoning using reasoning" is simple yet effective: Resolving UQ purely through prompting without model modification, multiple response sampling, or training. This inference-time uncertainty quantification paradigm can be applied to any LLM supporting logit outputs.
- Keyword extraction as a means of information compression: Utilizing keywords instead of all tokens for probability aggregation filters noise while maintaining focus, representing a clever trick in the UQ domain.
- Task type dictates the optimal information granularity: Logical reasoning suits keyword-level analysis, while mathematical reasoning suits step-level analysis. This finding serves as a useful reference for other CoT-related methodologies.
Limitations & Future Work¶
- Requires access to token logits: Not applicable to purely black-box APIs (though major commercial APIs now support logprobs).
- Limited to closed-ended QA: Rigorous evaluation requires definitive correct answers; open-ended generation is not yet validated.
- Additional inference overhead: Keyword extraction and importance scoring require extra LLM calls (2-3 times), increasing computational costs.
- Limited model sizes: Validated only on 8B and 13B models; whether this approach remains necessary for larger-scale models remains to be explored.
- Potential improvements: (a) Training a lightweight keyword extractor to replace LLM self-extraction, thereby reducing overhead; (b) Extending to UQ in other tasks such as code generation.
Related Work & Insights¶
- vs Semantic Entropy (Kuhn et al., 2023): Semantic Entropy requires multiple query samples (prompt-wise), while CoT-UQ functions in a single-pass inference (response-wise), greatly reducing computational costs.
- vs TOKENSAR (Duan et al., 2024): TOKENSAR evaluates answer token relevance and weights them, but exhibits limited effectiveness in short-answer scenarios (e.g., math problems). CoT-UQ focuses on keywords during the reasoning process, offering richer information.
- vs P(True) (Kadavath et al., 2022): P(True) directly queries the model on answer correctness and is severely affected by overconfidence. CoT-UQ helps the model make more informed self-evaluations by providing reasoning context.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of using the reasoning process as a UQ signal is novel, though the implementation relies heavily on prompt engineering.
- Experimental Thoroughness: ⭐⭐⭐⭐ Five datasets, two models, detailed ablation studies, and case studies, although lack of validation on larger models is a limitation.
- Writing Quality: ⭐⭐⭐⭐ The framework description is clear, and the illustrations are intuitive.
- Value: ⭐⭐⭐⭐ Highly practical for deploying reliable LLMs, as this method can be directly applied to confidence-based filtering in production environments.