Query-Level Uncertainty in Large Language Models¶
Conference: ICLR2026 arXiv: 2506.09669 Code: GitHub Area: Information Retrieval Keywords: uncertainty estimation, knowledge boundary, adaptive inference, training-free, internal confidence
TL;DR¶
This paper introduces the concept of Query-Level Uncertainty and proposes an Internal Confidence method that estimates, prior to generation (via a single forward pass), whether an LLM is capable of answering a given query. The approach is training-free and enables efficient adaptive inference strategies including RAG triggering, model cascading, and abstention.
Background & Motivation¶
- LLMs have inherent knowledge boundaries and cannot accurately answer all queries; awareness of these boundaries is critical for building trustworthy and efficient AI systems.
- Existing uncertainty estimation methods are predominantly answer-level (evaluated post-generation), incurring substantial computational overhead due to the requirement of full answer generation.
- Adaptive inference strategies (e.g., RAG, slow thinking, model cascading) require pre-generation signals to determine whether to invoke additional resources.
- Existing query-level approaches require training probes or fine-tuning (e.g., IDK token, R-Tuning), limiting their generalizability.
- The internal hidden states of LLMs encode rich information about knowledge reachability, and cross-layer consistency can improve output quality.
Method¶
Mechanism¶
A yes/no self-evaluation prompt is used to elicit the LLM's judgment on whether it can answer the query. The probability \(P(\text{Yes})\) is extracted as the confidence score without generating any answer. The prompt follows the format: "Respond only with 'Yes' or 'No' to indicate whether you are capable of answering the {Query} accurately."
From P(Yes) to Internal Confidence¶
Basic P(Yes): Uses only the hidden state at the last token of the last layer, \(\mathbf{h}_N^{(L)}\), projected onto the {Yes, No} vocabulary via the unembedding matrix followed by softmax — analogous to training-free linear probing.
Internal Confidence Extension: Rather than relying solely on the final position, \(P(\text{Yes})\) is computed across all layers and token positions and aggregated via a weighted average: $\(\text{IC}(\mathbf{h}) = \sum_{n=1}^{N}\sum_{l=1}^{L}w_n^{(l)} P(\text{Yes}|\mathbf{h}_n^{(l)})\)$
Weights are computed using Attenuated Encoding: \(\delta_j^{(i)} = \exp(-\alpha|i-j|^2) / Z\), which assigns exponentially decaying weights with distance from the "decision center" (defaulting to the last token of the last layer). The parameter \(\alpha=1.0\) controls locality — larger values concentrate weights more tightly around the center.
Design Motivation¶
- AUROC heatmaps reveal that a "decision center" indeed exists: the most discriminative position is not necessarily the last token of the last layer, though it serves as a close approximation.
- Fixing the decision center eliminates the need for a validation set to identify the optimal position, preserving the training-free property.
- Cross-layer aggregation exploits rich knowledge encoded in intermediate layers rather than relying solely on the final representation.
Application Scenarios¶
- Adaptive RAG: Retrieval augmentation is triggered only when IC is low; queries with high IC are answered directly, reducing RAG invocations by 50%+ with negligible performance loss.
- Model Cascading: Queries on which the small model yields low IC are forwarded to a larger model, achieving an optimal cost-quality trade-off.
- Abstention Strategy: Queries with high uncertainty are declined, improving system reliability.
Key Experimental Results¶
Main Results (Cross-Model and Cross-Task)¶
| Method | Phi-3.8B AUROC | Llama-8B AUROC | Qwen-14B AUROC | Avg AUROC |
|---|---|---|---|---|
| Max(-log p) | 54.0 | 56.3 | 57.8 | 56.0 |
| Predictive Entropy | 57.9 | 60.1 | 62.4 | 60.1 |
| Semantic Entropy | 55.6 | 59.7 | 60.0 | 58.4 |
| P(Yes) top-right | 57.3 | 60.5 | 60.9 | 59.6 |
| Internal Confidence | 60.8 | 64.7 | 67.1 | 64.2 |
Comparison with Answer-Level Methods (Speed)¶
| Method | GSM8K AUROC | ms/sample |
|---|---|---|
| IC (Ours) | 66.8 | 0.3 |
| Predictive Entropy | 61.0 | 9.8 |
| Min-K Entropy | 60.4 | 3.8 |
| Semantic Entropy | 60.0 | 151.8 |
Key Findings: 1. IC consistently outperforms all baselines across 3 datasets and 3 models. 2. IC is 32×–602× faster than answer-level methods while achieving higher accuracy. 3. In RAG settings, IC reduces retrieval calls by 50%+ with negligible performance degradation. 4. The effectiveness of IC scales with model size, as larger models exhibit stronger self-awareness of their knowledge boundaries. 5. Layers and token positions near the decision center are most discriminative, consistent with observations from the AUROC heatmaps.
Highlights & Insights¶
- This work is the first to formally define query-level uncertainty, shifting uncertainty estimation from a post-hoc to a pre-generation paradigm.
- The method is entirely training-free, requires only a single forward pass, and is highly practical.
- The cross-layer, cross-token weighted aggregation strategy (Attenuated Encoding) is both elegant and effective.
- The paper demonstrates significant efficiency-quality trade-off advantages in RAG and model cascading scenarios.
Limitations & Future Work¶
- The decision center is fixed at the last token of the last layer, which is suboptimal — a trade-off made to preserve the training-free property.
- Validation is limited to tasks with well-defined answers (factual QA and mathematical reasoning); open-ended generation is not addressed.
- Using greedy decoding as a proxy for the knowledge boundary is conservative and may underestimate model capability.
- Discriminative performance is relatively weaker on reasoning-intensive tasks (e.g., multi-step mathematical reasoning) compared to factual QA.
Related Work & Insights¶
- Answer-level uncertainty: Semantic Entropy (Kuhn et al. 2023), P(True) (Kadavath et al. 2022).
- Knowledge boundary detection: IDK token (Cohen et al. 2024), R-Tuning (Zhang et al. 2024a) — both require training.
- Internal state probing: Gottesman & Geva 2024 train lightweight probes; Semantic Entropy Probes (Kossen et al. 2024).
Rating¶
- Novelty: ⭐⭐⭐⭐ (the query-level concept is novel and the method is concise)
- Experimental Thoroughness: ⭐⭐⭐⭐ (3 models, 3 datasets, multiple application scenarios)
- Writing Quality: ⭐⭐⭐⭐⭐ (clear problem formulation, intuitive illustrations)
- Value: ⭐⭐⭐⭐ (direct practical value for adaptive inference)