SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?¶

Conference: ICLR 2026 arXiv: 2505.20295 Code: apple/ml-selfreflect Area: Causal Reasoning Keywords: LLM uncertainty, internal distribution, information-theoretic distance, faithfulness metric, uncertainty communication

TL;DR¶

This paper proposes SelfReflect — an information-theoretic distance metric that measures the discrepancy between an LLM's self-reported uncertainty summary and its true internal answer distribution. The study finds that modern LLMs are broadly incapable of autonomously reflecting their internal uncertainty, but that faithful uncertainty summaries can be generated by sampling multiple outputs and feeding them back into the context.

Background & Motivation¶

Communicating LLM uncertainty is central to building trustworthy AI. A common approach is to append percentage figures or hedging phrases (e.g., "I'm not entirely sure, but…") to model responses. However, this approach has a fundamental limitation: it decorates a single answer rather than genuinely reflecting the model's full internal belief distribution.

Core Problem: A truly transparent LLM must be able to introspect its internal belief distribution and output a summary of all plausible options along with their probabilities. The question is: are LLMs capable of doing this?

Limitations of Prior Work: 1. Existing uncertainty quantification methods (e.g., logit calibration, confidence estimation) target developers and are inaccessible to end users. 2. Hedging language ("approximately," "possibly") is too coarse to express precise uncertainty. 3. There is no standardized metric to measure how faithfully an LLM's self-described uncertainty matches its true internal distribution.

Key Challenge: We want LLMs to faithfully communicate their uncertainty, yet we lack tools to evaluate such faithfulness, and it remains unknown whether LLMs possess this self-reflective capability at all.

Key Insight: Design an information-theoretic metric (the SelfReflect score) that measures the distance between a natural-language uncertainty summary (e.g., "60% answer A, 30% answer B, 10% other") and the LLM's internal answer distribution, then systematically evaluate modern LLMs under this metric.

Method¶

Overall Architecture¶

The SelfReflect evaluation pipeline consists of three steps: 1. Generate the answer distribution: Given a question, sample the LLM multiple times to obtain an empirical answer distribution as a proxy for its internal distribution. 2. Generate the uncertainty summary: Apply various strategies (direct prompting, CoT reasoning, fine-tuning, etc.) to elicit a natural-language summary of the LLM's uncertainty. 3. Compute the SelfReflect score: Measure the information-theoretic distance between the summary and the true distribution.

Key Designs¶

Definition of the SelfReflect Metric: SelfReflect is an information-theoretic distance that quantifies the discrepancy between a given summary string and the answer distribution. Concretely, it is based on a divergence (e.g., a variant of KL divergence) between the probability distribution described in the summary and the empirically sampled answer distribution. Lower scores indicate more faithful summaries. A key property of the metric is its sensitivity to even slight deviations, providing a fine-grained faithfulness measure.
Empirical Approximation of the Internal Distribution: For each question, the same query is issued to the LLM multiple times (e.g., 50 times), and the resulting answers and their frequencies are collected. The empirical distribution of these sampled answers serves as a proxy for the LLM's "internal answer distribution." This is a principled approximation, as autoregressive sampling from an LLM inherently draws from its learned conditional distribution.
Multiple Uncertainty Summarization Strategies:
- Greedy decoding: Generates the single most likely answer (baseline; conveys no uncertainty).
- Direct prompting: Prompts the LLM to describe its uncertainty about the answer.
- Chain-of-thought (CoT) reasoning: Instructs the LLM to reason before providing an uncertainty description.
- Explicit fine-tuning: Fine-tunes the LLM specifically on uncertainty description tasks.
- Sample-and-Summarize: Samples multiple answers first, feeds them back into the context, and asks the LLM to summarize the resulting distribution.
Validating Metric Effectiveness: The validity of the SelfReflect metric is verified through interventional studies and human evaluations. In the interventional studies, probabilities in the summaries are artificially perturbed to verify that the metric detects the changes. The human studies confirm that the metric aligns with human judgments of summary faithfulness.
Implementation: Built on vLLM, the framework supports arbitrary LLMs. The pipeline covers: generating the answer distribution → generating uncertainty summaries → computing the SelfReflect score. A LogitProcessor hook is used to obtain the full logit vector, ensuring accurate probability computation.

Loss & Training¶

The SelfReflect metric itself does not involve training. For the explicit fine-tuning strategy, standard supervised fine-tuning is applied, training the LLM on data containing correct uncertainty descriptions.

Key Experimental Results¶

Main Results¶

Different uncertainty summarization strategies are evaluated by their SelfReflect scores (lower is better) across multiple LLMs and datasets:

Summarization Strategy	Overall Performance	Notes
Greedy (baseline)	High score	Single answer only; no uncertainty information
Direct prompting	Close to baseline	LLMs cannot autonomously reflect uncertainty
CoT reasoning	Close to baseline	Reasoning does not help reflect uncertainty
Explicit fine-tuning	Close to baseline	Fine-tuning also fails to teach effective self-reflection
Sample-and-Summarize	Significantly lower	The only effective approach

Ablation Study¶

Configuration	Key Metric	Notes
Interventional study (perturbed probabilities)	Metric sensitivity	SelfReflect detects subtle deviations
Human study	Human–metric agreement	Strongly aligned with human judgments
Different LLMs	Cross-model consistency	All models fail to autonomously reflect uncertainty
Different datasets	Cross-task consistency	Includes QA datasets such as NQ

Key Findings¶

Core finding (negative result): Modern LLMs are universally unable to autonomously disclose their uncertainty — whether via direct prompting, chain-of-thought reasoning, or explicit fine-tuning, all methods perform poorly under the SelfReflect metric.
The only effective approach: The Sample-and-Summarize method — sampling multiple outputs and asking the LLM to summarize them — is the only strategy that produces faithful uncertainty summaries.
Metric validity: The SelfReflect metric is sensitive to even minor deviations and aligns closely with human judgments.
This finding carries significant implications: LLMs lack genuine self-reflective capability and cannot directly access or report their internal uncertainty states.

Highlights & Insights¶

Precise problem formulation: Transforms the vague intuition of "can LLMs communicate uncertainty" into a quantifiable scientific question.
Elegant metric design: SelfReflect is a fine-grained information-theoretic measure capable of detecting subtle deviations overlooked by conventional approaches.
Impactful findings: The results comprehensively refute the existence of intrinsic uncertainty self-reflection capability in LLMs — an important negative result.
Pragmatic solution: Although simple, Sample-and-Summarize identifies a viable path for uncertainty communication.
From Apple: The code is open-sourced at apple/ml-selfreflect, reflecting industry attention to LLM trustworthiness.
Bridges LLM capability evaluation and uncertainty quantification: Provides a standardized evaluation tool for future research.

Limitations & Future Work¶

The internal distribution is approximated via repeated sampling; a limited number of samples (e.g., 50) may be insufficient for long-tail distributions.
The SelfReflect metric requires a large number of samples per question to establish the reference distribution, incurring substantial computational overhead.
Evaluation is primarily conducted on question-answering tasks; applicability to open-ended generation (e.g., creative writing) remains unverified.
The Sample-and-Summarize method requires multiple inference calls, increasing inference cost.
More complex uncertainty representations (e.g., calibration curves, confidence intervals) are not explored.
The effect of model scale on self-reflection capability is not analyzed — it remains unclear whether larger models perform better.
The SelfReflect metric relies on vLLM's LogitProcessor hook and is not fully applicable to closed-source API models (e.g., GPT-4).

LLM uncertainty quantification (token-level entropy, conformal prediction, etc.): Developer-facing techniques; this paper focuses on user-facing uncertainty communication.
LLM calibration: Concerned with whether confidence in a single answer is accurate; this paper addresses the faithful communication of the full distribution.
Self-Consistency (Wang et al., 2022): Samples multiple outputs and selects the answer by majority vote; this paper goes further by requiring the LLM to summarize the sampled results.
Chain-of-Thought reasoning: Shown to be ineffective at helping LLMs reflect their internal uncertainty.
Insight: The Sample-and-Summarize paradigm may be the only currently viable approach for LLM uncertainty communication; future work could explore embedding it into interactive dialogue systems.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Entirely new problem definition and metric design)
Experimental Thoroughness: ⭐⭐⭐⭐ (Multiple models, multiple strategies, interventional and human studies)
Writing Quality: ⭐⭐⭐⭐ (Clear motivation, high-impact findings)
Value: ⭐⭐⭐⭐⭐ (Provides a critical tool and key insights for LLM trustworthiness research)