Query-Level Uncertainty in Large Language Models¶

Conference: ICLR2026 arXiv: 2506.09669 Code: GitHub Area: Information Retrieval Keywords: uncertainty estimation, knowledge boundary, adaptive inference, training-free, internal confidence

TL;DR¶

This paper introduces the concept of Query-Level Uncertainty and proposes an Internal Confidence method that estimates, prior to generation (via a single forward pass), whether an LLM is capable of answering a given query. The approach is training-free and enables efficient adaptive inference strategies including RAG triggering, model cascading, and abstention.

Background & Motivation¶

LLMs have inherent knowledge boundaries and cannot accurately answer all queries; awareness of these boundaries is critical for building trustworthy and efficient AI systems.
Existing uncertainty estimation methods are predominantly answer-level (evaluated post-generation), incurring substantial computational overhead due to the requirement of full answer generation.
Adaptive inference strategies (e.g., RAG, slow thinking, model cascading) require pre-generation signals to determine whether to invoke additional resources.
Existing query-level approaches require training probes or fine-tuning (e.g., IDK token, R-Tuning), limiting their generalizability.
The internal hidden states of LLMs encode rich information about knowledge reachability, and cross-layer consistency can improve output quality.

Method¶

Mechanism¶

A yes/no self-evaluation prompt is used to elicit the LLM's judgment on whether it can answer the query. The probability $P(\text{Yes})$ is extracted as the confidence score without generating any answer. The prompt follows the format: "Respond only with 'Yes' or 'No' to indicate whether you are capable of answering the {Query} accurately."

From P(Yes) to Internal Confidence¶

Basic P(Yes): Uses only the hidden state at the last token of the last layer, $\mathbf{h}_N^{(L)}$, projected onto the {Yes, No} vocabulary via the unembedding matrix followed by softmax — analogous to training-free linear probing.

Internal Confidence Extension: Rather than relying solely on the final position, $P(\text{Yes})$ is computed across all layers and token positions and aggregated via a weighted average: $$\text{IC}(\mathbf{h}) = \sum_{n=1}^{N}\sum_{l=1}^{L}w_n^{(l)} P(\text{Yes}|\mathbf{h}_n^{(l)})$$

Weights are computed using Attenuated Encoding: $\delta_j^{(i)} = \exp(-\alpha|i-j|^2) / Z$, which assigns exponentially decaying weights with distance from the "decision center" (defaulting to the last token of the last layer). The parameter $\alpha=1.0$ controls locality — larger values concentrate weights more tightly around the center.

Design Motivation¶

AUROC heatmaps reveal that a "decision center" indeed exists: the most discriminative position is not necessarily the last token of the last layer, though it serves as a close approximation.
Fixing the decision center eliminates the need for a validation set to identify the optimal position, preserving the training-free property.
Cross-layer aggregation exploits rich knowledge encoded in intermediate layers rather than relying solely on the final representation.

Application Scenarios¶

Adaptive RAG: Retrieval augmentation is triggered only when IC is low; queries with high IC are answered directly, reducing RAG invocations by 50%+ with negligible performance loss.
Model Cascading: Queries on which the small model yields low IC are forwarded to a larger model, achieving an optimal cost-quality trade-off.
Abstention Strategy: Queries with high uncertainty are declined, improving system reliability.

Key Experimental Results¶

Main Results (Cross-Model and Cross-Task)¶

Method	Phi-3.8B AUROC	Llama-8B AUROC	Qwen-14B AUROC	Avg AUROC
Max(-log p)	54.0	56.3	57.8	56.0
Predictive Entropy	57.9	60.1	62.4	60.1
Semantic Entropy	55.6	59.7	60.0	58.4
P(Yes) top-right	57.3	60.5	60.9	59.6
Internal Confidence	60.8	64.7	67.1	64.2

Comparison with Answer-Level Methods (Speed)¶

Method	GSM8K AUROC	ms/sample
IC (Ours)	66.8	0.3
Predictive Entropy	61.0	9.8
Min-K Entropy	60.4	3.8
Semantic Entropy	60.0	151.8

Key Findings: 1. IC consistently outperforms all baselines across 3 datasets and 3 models. 2. IC is 32×–602× faster than answer-level methods while achieving higher accuracy. 3. In RAG settings, IC reduces retrieval calls by 50%+ with negligible performance degradation. 4. The effectiveness of IC scales with model size, as larger models exhibit stronger self-awareness of their knowledge boundaries. 5. Layers and token positions near the decision center are most discriminative, consistent with observations from the AUROC heatmaps.

Highlights & Insights¶

This work is the first to formally define query-level uncertainty, shifting uncertainty estimation from a post-hoc to a pre-generation paradigm.
The method is entirely training-free, requires only a single forward pass, and is highly practical.
The cross-layer, cross-token weighted aggregation strategy (Attenuated Encoding) is both elegant and effective.
The paper demonstrates significant efficiency-quality trade-off advantages in RAG and model cascading scenarios.

Limitations & Future Work¶

The decision center is fixed at the last token of the last layer, which is suboptimal — a trade-off made to preserve the training-free property.
Validation is limited to tasks with well-defined answers (factual QA and mathematical reasoning); open-ended generation is not addressed.
Using greedy decoding as a proxy for the knowledge boundary is conservative and may underestimate model capability.
Discriminative performance is relatively weaker on reasoning-intensive tasks (e.g., multi-step mathematical reasoning) compared to factual QA.

Answer-level uncertainty: Semantic Entropy (Kuhn et al. 2023), P(True) (Kadavath et al. 2022).
Knowledge boundary detection: IDK token (Cohen et al. 2024), R-Tuning (Zhang et al. 2024a) — both require training.
Internal state probing: Gottesman & Geva 2024 train lightweight probes; Semantic Entropy Probes (Kossen et al. 2024).

Rating¶

Novelty: ⭐⭐⭐⭐ (the query-level concept is novel and the method is concise)
Experimental Thoroughness: ⭐⭐⭐⭐ (3 models, 3 datasets, multiple application scenarios)
Writing Quality: ⭐⭐⭐⭐⭐ (clear problem formulation, intuitive illustrations)
Value: ⭐⭐⭐⭐ (direct practical value for adaptive inference)