👻 Hallucination Detection¶
🧪 ICML2026 · 19 paper notes
📌 Same area in other venues: 💬 ACL2026 (27) · 📷 CVPR2026 (18) · 🔬 ICLR2026 (9) · 🤖 AAAI2026 (15) · 🧠 NeurIPS2025 (17) · 📹 ICCV2025 (4)
🔥 Top topics: Multimodal/VLM ×7 · LLM ×2 · Adversarial Robustness ×2
- Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
-
This paper proposes RUDDER, which extracts per-sample visual evidence directions from residual updates during the prefill phase of LVLMs and adaptively injects them via a Beta Gate during decoding to reduce object hallucinations with overhead close to a single forward pass.
- Automatic Layer Selection for Hallucination Detection
-
FEPoID (First Effective Peak of Intrinsic Dimension) is proposed as a training-free automatic layer selection criterion. Combined with the First Sentence Truncation (FST) strategy, it consistently identifies near-optimal intermediate layers across various QA and summarization hallucination detection benchmarks, significantly outperforming existing baseline methods.
- Building Reliable Long-Form Generation via Hallucination Rejection Sampling
-
The SHARS framework is proposed to detect and reject hallucinated content sentence-by-sentence during inference, only retaining verified factual segments for continuous generation. Combined with an improved semantic entropy detector, HalluSE, it improves factual precision by approximately 20–26% on FactScore while maintaining or even increasing the volume of factual information.
- Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation
-
The GIFT method is proposed, which constructs a visual saliency map by tracking positive changes in visual attention ("gaze shifts") when a VLM understands user queries, and enhances both visual and query token attention during the decoding phase to maintain cross-modal fusion balance. It achieves up to a 20.7% improvement on CHAIR with only a 1.13× latency increase.
- Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
-
This paper discovers that LVLM hallucinations stem from "insufficient attention + forgetting during generation" of correct visual evidence. Observing a significant Inter-Layer Visual Attention Discrepancy (ILVAD) for visual evidence, the authors propose a train-free, plug-and-play method: constructing a visual evidence saliency map using inter-layer differentiation, followed by continuous weighting of visual evidence tokens and "evidence-grounded" text tokens during generation. This consistently reduces hallucinations across 5 LVLMs and 5 benchmarks.
- From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
-
This paper shifts LLM hallucination detection from "analyzing output probabilities" to "analyzing loss landscape curvature." By adding Gaussian noise to embeddings and measuring perturbations in gradient direction and magnitude as a cheap proxy for the Hessian spectral radius, the method achieves AUROC results that consistently outperform baselines such as entropy, Semantic Entropy, and EigenScore across 12 model-dataset combinations.
- From Out-of-Distribution Detection to Hallucination Detection: A Geometric View
-
This paper treats LLM next-token prediction as a classification task over a massive vocabulary. It migrates two lightweight OOD detectors, NCI (proximity between features and weight vectors) and fDBD (distance from features to decision boundaries), and introduces two adaptations: an "analytical proxy \(\mu_G\) for training feature means" and "calculating boundary distances only on top-\(k\) candidate tokens." This results in a training-free, single-sample hallucination detector for reasoning tasks that consistently outperforms baselines like Perplexity, Semantic Entropy, and SelfCheckGPT on CSQA, GSM8K, and AQuA.
- Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing
-
This paper formalizes "LLM memorization of random facts" as a membership testing problem with continuous confidence scores. It proves that in the sparse limit of facts, the optimal memory overhead equals the minimum KL divergence between the output distributions of facts and non-facts—establishing a "Rate-Distortion Theorem." It further demonstrates that under a log-loss objective with finite memory, the optimal strategy is neither refusal nor forgetting, but rather mapping a proportion of non-facts and facts to the same high-confidence point; thus, hallucination is an information-theoretically optimal error pattern.
- Hallucinations Undermine Trust; Metacognition is a Way Forward
-
This position paper argues that "completely eliminating LLM hallucinations" is fundamentally subject to a "discrimination gap" (discrimination gap → utility tax); the authors advocate shifting the goal from "eliminating hallucinations" to faithful uncertainty, and view such metacognition as an indispensable control layer when agentic LLMs invoke tools.
- Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
-
This paper proposes ARS for hallucination detection in large reasoning models (LRMs): instead of perturbing the reasoning trace at the text level, it directly applies small perturbations to the latent representation at the end of the trace and continues decoding to obtain counterfactual answers. Using "answer agreement" as a label, a lightweight contrastive head is trained to shape the trace-conditioned answer embedding, enabling subsequent embedding-based detectors to better separate hallucinations from truthful answers (AUROC on TruthfulQA improves from \(66.85\to 86.64\)).
- Honest Lying: Understanding Memory Confabulation in Reflexive Agents
-
This paper uncovers a systematic failure mode in Reflexion-style agents termed "memory confabulation": agents write incorrect task understandings into reflective memory and reuse them across trials. The authors quantify this phenomenon using the Reflection Repetition Rate (RRR) and replace open-ended self-diagnosis with programmatic feedback extraction, increasing the correct object mention rate from 0% to 86% and reducing RRR from 0.64 to 0.10 on ALFWorld.
- Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models
-
This paper discovers that the middle-layer embeddings of instruction tokens in MLLMs naturally filter out misleading information introduced from the visual end. Based on this, it proposes the training-free InsLen score (comprised of Calibrated Local Score and Context Consistency Score), which improves the AUROC of object hallucination detection by up to 13.81% across 5 MLLMs and 4 benchmarks.
- Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization
-
The original image and contrastive negative image are concatenated into a shared multi-image context. Anchoring instructions are then used to specify which image to observe, allowing the partition functions of visual preference DPO to align automatically and achieve theoretically consistent contrastive goals. Combined with hard negative samples generated through fine-grained surgery-like editing, this approach significantly reduces multimodal hallucinations in VLMs.
- Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating
-
CRG performs a precise linear decomposition of each attention head's output into visual and textual routes. It estimates the causal "do-effect" of these routes on the current token using one forward and one backward gradient. By suppressing the textual routes only in heads where visual/textual signs conflict and the VRI is low (indicating prior dominance), it systematically mitigates language prior hallucinations in LVLMs without requiring training.
- MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue
-
This paper introduces the MM-Snowball benchmark (4,992 6-turn adversarial dialogues) to systematically characterize the "hallucination snowballing" phenomenon in Multimodal Large Models during long dialogues. It designs a training-free CAVR method that refreshes visual signals at the representation layer and adjudicates text-visual conflicts at the logit layer, significantly flattening the performance collapse curve in late-stage dialogues.
- REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
-
REALISTA constructs an "input-dependent dictionary of editing directions" in the LLM latent space to transform adversarial prompt optimization into a continuous problem under a simplex constraint. This approach maintains the semantic equivalence and coherence of discrete methods like SECA while achieving the search flexibility of continuous methods like LARGO. It represents the first successful induction of hallucinations in the free-form outputs of closed-source reasoning models such as GPT-5.
- Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models
-
This paper redefines LVLM hallucinations as "missing visual information suppressed by language priors." It uses orthogonal projection to remove language priors from raw visual directions to obtain a "pure visual vector," then applies risk-gating for sparse intervention at a single optimal depth. This training-free method reduces the CHAIRS hallucination rate by ~19% while preserving MM-Vet general capabilities.
- TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling
-
TAG decomposes each diffusion update step along the current latent variable direction into "radial + tangential" components. It applies an additional amplification factor \(\eta \ge 1\) only to the tangential component. Using first-order Taylor expansion, it is proven that this is equivalent to monotonically increasing the log-likelihood gain, thereby pulling samples toward high-density regions of the data manifold and mitigating semantic hallucinations in diffusion models with almost zero extra computational cost.
- When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets (CAIA)
-
CAIA establishes the first "adversarial high-stakes" agent benchmark using 17 cutting-edge large models on 178 time-anchored real-world cryptocurrency tasks. Key findings: without tools, all models achieve only 12–28% accuracy (near random guessing); with tools, even the strongest GPT-5 reaches only 67.4% vs. human junior analysts at 80%. More critically, 55.5% of model tool calls prefer "unreliable web search" over authoritative on-chain data, causing Pass@k metrics to systematically mask the dangerous "trial-and-error luck" behavior.