👻 Hallucination Detection¶

🧪 ICML2026 · 21 paper notes

📌 Same area in other venues: 📷 CVPR2026 (33) · 🔬 ICLR2026 (40) · 💬 ACL2026 (28) · 🤖 AAAI2026 (15) · 🧠 NeurIPS2025 (17) · 📹 ICCV2025 (5)

🔥 Top topics: Multimodal/VLM ×7 · LLM ×3 · Adversarial Robustness ×2

A Unified Definition of Hallucination: It's The World Model, Stupid!: This is a position paper advocating that "hallucinations" across various tasks—translation, summarization, open-domain QA, RAG, multimodal, and agents—be unified as one phenomenon: user-observable, inaccurate world modeling relative to a "reference world model." Every scenario is simply a different configuration of the "\((W, V, P)\)" triplet (Reference World \(W\), View Function \(V\), Conflict Policy \(P\)), converging fragmented definitions into a universal template for generating large-scale, comparable benchmarks.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models: This paper proposes RUDDER, which extracts per-sample visual evidence directions from residual updates during the prefill stage of LVLMs and adaptively injects them via a Beta Gate during decoding, mitigating object hallucinations with overhead close to a single forward pass.
Automatic Layer Selection for Hallucination Detection: FEPoID (First Effective Peak of Intrinsic Dimension) is proposed as a training-free automatic layer selection criterion. Combined with the First Sentence Truncation (FST) strategy, it consistently selects near-optimal intermediate layers across various QA and summarization hallucination detection benchmarks, significantly outperformed existing baseline methods.
Building Reliable Long-Form Generation via Hallucination Rejection Sampling: This paper proposes the SHARS framework, which detects and rejects hallucinated content sentence-by-sentence during inference, retaining only verified factual segments to continue generation. Combined with an improved semantic entropy detector, HalluSE, it improves factual precision by approximately 20–26% on FactScore while maintaining or increasing the volume of factual information in the output.
Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation: The GIFT method is proposed, which constructs a visual saliency map by tracking positive changes in visual attention ("gaze shifts") as the VLM interprets user queries. During the decoding stage, it simultaneously enhances attention for both visual and query tokens to maintain cross-modal fusion balance, achieving up to 20.7% improvement on CHAIR with only 1.13× latency overhead.
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy: This paper identifies that LVLM hallucinations originate from "insufficient attention + forgetting during generation" regarding correct visual evidence. Observing a significant Inter-Layer Visual Attention Discrepancy (ILVAD) for visual evidence, the authors propose a train-free/plug-and-play method: constructing a visual evidence saliency map via inter-layer differentiation, then continuously weighting visual evidence tokens and "evidence-grounded" text tokens during generation. This consistently reduces hallucinations across 5 LVLMs and 5 hallucination/comprehensive benchmarks.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity: Ours shifts LLM hallucination detection from "analyzing output probabilities" to "analyzing loss landscape curvature"—measuring perturbations in gradient direction and magnitude by adding Gaussian noise to embeddings. Serving as a cheap proxy for the Hessian spectral radius, this method outperforms baselines like Entropy, Semantic Entropy, and EigenScore in AUROC across 12 model-dataset combinations.
From Out-of-Distribution Detection to Hallucination Detection: A Geometric View: This paper treats LLM next-token prediction as a classification task on a massive vocabulary. By migrating two lightweight OOD detectors—NCI (proximity of features to weight vectors) and fDBD (distance from features to decision boundaries)—with two adaptations ("analytical proxy \(\mu_G\) for training feature means" and "calculating boundary distance only on top-\(k\) candidate tokens"), it derives a training-free, single-sample inference-time hallucination detector. It consistently outperforms baselines such as Perplexity, Semantic Entropy, and SelfCheckGPT on CSQA, GSM8K, and AQuA.
Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing: This paper formalizes "LLMs memorizing random facts" as a membership testing problem with continuous confidence scores. It proves that in the sparse limit of facts, the optimal memory cost exactly equals the minimum KL divergence between fact and non-fact output distributions—a "rate-distortion theorem." It further concludes that under the log-loss objective and given limited memory, the optimal strategy is neither abstention nor forgetting, but rather mapping a certain proportion of non-facts and facts to the same high-confidence point, identifying hallucination as the information-theoretically optimal error form.
Hallucinations Undermine Trust; Metacognition is a Way Forward: This position paper argues that "totally eliminating LLM hallucinations" is theoretically impossible without incurring a "utility tax" (discrimination gap); the authors advocate shifting the goal from "eliminating hallucinations" to faithful uncertainty and treating this metacognition as an indispensable control layer for agentic LLMs when calling tools.
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping: This paper proposes ARS for hallucination detection in Large Reasoning Models (LRMs). Instead of perturbing reasoning traces in the text space, ARS applies small perturbations directly to the latent representations at the end of the trace to decode counterfactual answers. Using "answer agreement" as a label, a lightweight contrastive head is trained to shape trace-conditioned answer embeddings, enabling embedding-based detectors to better distinguish hallucinations from truthful responses (\(66.85 \to 86.64\) AUROC on TruthfulQA).
Honest Lying: Understanding Memory Confabulation in Reflexive Agents: This paper uncovers a systematic failure mode in Reflexion-style agents termed "memory confabulation": agents write incorrect task understandings into reflective memory and reuse them across trials. The authors quantify this phenomenon using the Reflection Repetition Rate (RRR) and replace open-ended self-diagnosis with programmatic feedback extraction, which increases the correct object mention rate from 0% to 86% and reduces RRR from 0.64 to 0.10 on ALFWorld.
Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models: The study identifies that middle-layer embeddings of instruction tokens in MLLMs naturally filter out misleading information from the visual side. Based on this, a training-free InsLen score (Calibrated Local Score + Context Consistency Score) is proposed, which improves the AUROC of object hallucination detection by up to 13.81% across 5 MLLMs and 4 benchmarks.
Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization: By concatenating the original image and a contrastive negative image into a shared multi-image context and using anchor instructions to specify which image to observe, the partition functions of visual-preference DPO are automatically aligned to produce a theoretically consistent contrastive objective. Combined with surgically edited hard negative samples, this significantly reduces multimodal hallucinations in VLMs.
Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating: CRG performs a precise linear decomposition of each attention head's output into visual and textual routes. It estimates the causal "do-effect" of both routes on the current token through one forward and one backward pass. It then systematically mitigates language prior hallucinations in LVLMs without training by suppressing only the textual routes of heads where visual and textual signs conflict and the VRI is low (i.e., prior-dominated).
MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue: Ours proposes the MM-Snowball benchmark (4992 trajectories of 6-turn adversarial dialogues) to systematically characterize the "hallucination snowballing" phenomenon in Multimodal Large Language Models (MLLMs) during long dialogues. Based on this, ours designs a training-free Conflict-Aware Visual Rectification (CAVR) method that refreshes visual signals at the representation layer and adjudicates text-visual conflicts at the logit layer, significantly flattening the performance collapse curve in later dialogue stages.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations: REALISTA constructs an "input-dependent editing direction dictionary" in the LLM latent space, transforming adversarial prompt optimization into a continuous problem under simplex constraints. This approach maintains the semantic equivalence and coherence of discrete methods like SECA while offering the search flexibility of continuous methods like LARGO, successfully inducing hallucinations in the free-form outputs of closed-source reasoning models like GPT-5 for the first time.
Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models: This paper redefines LVLM hallucination as "missing visual information suppressed by language priors." It uses orthogonal projection to extract a "pure visual vector" by stripping language priors from the raw visual direction. Then, via a risk-gating mechanism, it performs sparse intervention on a single layer at the optimal depth. This training-free approach reduces the CHAIRS hallucination rate by ~19% while preserving the general reasoning capabilities of MM-Vet.
TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling: TAG decomposes each diffusion update step into "radial + tangential" components along the direction of the current latent variable. By applying an amplification factor \(\eta \ge 1\) solely to the tangential component, it is proved via first-order Taylor expansion that this is equivalent to monotonically increasing the log-likelihood gain. This pulls samples toward high-density regions of the data manifold, mitigating semantic hallucinations in diffusion models with almost zero extra computational cost.
When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets (CAIA): CAIA establishes the first "adversarial high-stakes" agent benchmark using 17 frontier LLMs across 178 temporally anchored real-world cryptocurrency tasks. Key findings: without tools, all models achieve only 12–28% accuracy (near random guess); with tools, the strongest GPT-5 reaches only 67.4% vs. 80% for junior human analysts. Critically, 55.5% of tool calls are biased toward "unreliable web searches" bypassing authoritative on-chain data, and the Pass@k metric systematically masks dangerous "trial-and-error" behaviors.
Zero-source LLM Hallucination Detection with Human-like Criteria Probing: HCPD treats "zero-source hallucination detection" (where only Q&A pairs are available, without access to internal model states or external knowledge bases) as a multi-criteria probe mimicking human evaluation. An LLM agent adaptively generates a set of interpretable evaluation criteria, assigns weights, scores per criterion, and computes a weighted trustworthiness score. Using weakly supervised semantic consistency and GRPO to train the agent, and multi-sampling aggregation during inference, the method significantly outperforms existing approaches in AUROC across four QA datasets and multiple target models.