ICLR2026 Hallucination Detection AI paper notes paper summaries Multimodal/VLM LLM Reasoning Alignment/RLHF

👻 Hallucination Detection¶

🔬 ICLR2026 · 40 paper notes

📌 Same area in other venues: 📷 CVPR2026 (33) · 💬 ACL2026 (28) · 🧪 ICML2026 (21) · 🤖 AAAI2026 (15) · 🧠 NeurIPS2025 (17) · 📹 ICCV2025 (5)

🔥 Top topics: Multimodal/VLM ×9 · LLM ×7 · Reasoning ×5 · Alignment/RLHF ×2

AFTER: Mitigating Object Hallucinations in LVLMs with Adaptive Fact-guided Activation Editing: AFTER textualizes ground-truth image annotations into three categories of facts (category, attribute, and relationship). It constructs positive vision-text editing directions based on the activation difference between these factual descriptions and the original images. A lightweight estimator is then trained to estimate per-query offsets, adaptively pushing LVLM activations toward factual semantics, reducing hallucinations by up to 16.3% on the AMBER benchmark.
BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs: Addressing the tendency of Large Reasoning Models (LRMs) to "hallucinate rather than admit ignorance" in factual QA, this paper identifies two pathological reasoning patterns triggered by "factual overthinking." It proposes BARREL, a three-stage training framework (Knowledge Boundary Labeling \(\rightarrow\) Boundary-Aware SFT \(\rightarrow\) GRPO with Reliability Rewards). BARREL improves the reliability of DeepSeek-R1-Distill-Llama-8B from 39.33% to 61.48% while simultaneously increasing accuracy.
Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection: The authors discovered that multi-turn self-dialogues elicited from hallucinated answers exhibit uncertainty score fluctuations with far more intense "spikes" than those from truthful answers. They quantify this volatility as SpikeScore (the maximum second-order difference of the score sequence). By using a single threshold, SpikeScore enables hallucination detection across multiple domains while being trained only on a single domain. Its cross-domain AUROC consistently outperforms specialized methods like PRISM and ICR Probe across four LLMs and six benchmarks.
Cat-PO: Cross-modal Adaptive Token-rewards for Preference Optimization in Truthful Multimodal LLMs: Addressing the hallucination issue in MLLMs, this paper proposes Cat-PO: using only the model's internal cross-modal attention and similarity, it calculates a three-tier visual relevance (global, local, and semantic) for each generated token. These are fused into a smooth token reward to reweight the DPO loss along with a token-level KL regularization for fine-grained hallucination correction, outperforming existing SOTAs by 7%–15% on benchmarks like AMBER-Generation and MM-Hal.
ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations: ChainMPQ is a training-free reasoning framework that decomposes "Subject-Relation-Object" questions into five complementary sub-questions. These are fed sequentially to Large Vision-Language Models (LVLMs), passing textual answers and visual attention memory to subsequent steps to form an interleaved text-image reasoning chain, consistently reducing relation hallucinations across multiple LVLMs and benchmarks.
CoFact: Conformal Factuality Guarantees for Language Models under Covariate Shift: CoFact replaces the fixed "conformal threshold" in LLM factuality control with an adaptive threshold that adjusts to online test distribution drifts. By using online density ratio estimation to dynamically reweight the calibration set, the method ensures that the hallucination rate does not exceed a user-defined \(\alpha\), even in realistic scenarios with continuous covariate shift in prompt streams and unavailable test labels.
Copy-Paste to Mitigate Large Language Model Hallucinations: The authors propose the Copy-Paste generation paradigm, which trains LLMs to prioritize directly copying segments from the retrieval context rather than free paraphrasing. Combined with DPO training for high copy preference, this approach improves faithfulness from 80.2% to 92.8% on counterfactual RAG benchmarks.
Critical Confabulations: Can LLMs Hallucinate for Social Good?: This paper reframes "hallucination" as a viable resource: it proposes critical confabulation, where LLMs "fill in" structural gaps in historical archives under evidentiary constraints. By evaluating 19 models on a "narrative cloze" task using unpublished Black history corpora, the authors demonstrate that controlled, well-defined hallucinations can serve knowledge production without collapsing into falsehood.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models: Proposes Dynamic Multimodal Activation Steering (DMAS), which dynamically selects relevant steering vectors from a semantic-based truthfulness database and vision-aware vectors to intervene in critical attention heads during inference. It significantly mitigates LVLM hallucinations without training, improving MME by 94.66 points and reducing the CHAIR hallucination rate by 20.2%.
EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models: EmotionHallucer is a hallucination evaluation benchmark for MLLM emotion understanding. It decomposes emotion hallucinations into two primary dimensions: "Emotional Psychology Knowledge" and "Real Multimodal Emotion Perception." Using paired basic/hallucinated binary QA, it detects whether models can both make fundamental emotional judgments and reject plausible but incorrect emotional descriptions. Furthermore, the proposed PEP-MEK inference framework improves model performance on the multimodal emotion perception subset by an average of 9.90%.
Enhancing Hallucination Detection through Noise Injection: By injecting uniform noise into the MLP activations of intermediate LLM layers to approximate the Bayesian posterior, this method captures epistemic uncertainty. It complements aleatoric uncertainty captured via sampling temperature, improving the hallucination detection AUROC on GSM8K from 71.56 to 76.14.
Estimating Semantic Alphabet Size for LLM Uncertainty Quantification: This paper identifies that the classic "Discrete Semantic Entropy" (DSE) systematically underestimates true semantic entropy in low-sample regimes. Drawing from the "unseen species" problem in population ecology, the authors propose a hybrid semantic alphabet size estimator and apply coverage correction to DSE. This allows black-box uncertainty estimation to match or exceed complex SOTA methods like KLE and SNNE while maintaining superior interpretability.
FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs: FREAK utilizes an automated "generate-then-edit" pipeline to create 1,786 photorealistic counter-commonsense (CCS) images and 1,799 questions. It specifically targets the fine-grained visual perception hallucinations of SOTA MLLMs—even the strongest models achieve only 45% accuracy, significantly lower than the human baseline of 86.71%, while confirming that CoT reasoning tends to degrade performance on such tasks.
GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs: GHOST moves away from evaluating object hallucinations in Multimodal LLMs (MLLMs) using fixed static benchmarks. Instead, it actively generates a set of images that appear natural and object-free to humans but trick models into believing a target object is present. This approach increases the hallucination success rate from approximately 1% in existing methods to over 28% and reveals that these images are highly transferable across different models.
Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation: This paper investigates the problem of hallucination in Sign Language Translation (SLT) for the first time, proposing a token-level "reliability" score. By using feature sensitivity and counterfactual perturbations, it quantifies whether the decoder is "grounding on the video" or "guessing based on language priors," thereby predicting hallucination risk without reference translations and revealing why gloss-free models are more prone to hallucinations.
Hallucination-aware Intermediate Representation Edit in Large Vision-Language Models: HIRE does not perform retraining or double forward passes. Instead, it executes "in-place editing" of intermediate representations in LVLMs. Using a dual encoder, it disentangles hallucination components from semantics and shifts them along a "de-hallucination direction." A lightweight Router is employed to intervene only on high-risk tokens. HIRE achieves SOTA results across three benchmarks with inference overhead close to the original model, while also supporting controllable generation to amplify hallucinations via a single hyperparameter.
Hallucination Begins Where Saliency Drops: Ours proposes the LVLMs-Saliency gradient-aware diagnostic framework to quantify the visual anchoring strength of each output token. It uncovers a critical pattern: "hallucination occurs when the saliency of previous output tokens regarding the next token prediction decreases." Based on this, a dual-mechanism inference-time framework, SGRS (Saliency-Guided Rejection Sampling) + LocoRE (Local Consistency Restoration), is designed to significantly reduce hallucination rates across multiple LVLMs.
Hallucination Reduction with CASAL: Contrastive Activation Steering for Amortized Learning: CASAL "amortizes" inference-time activation steering into model weights—by training only a sub-module of a single layer using only representation loss (without cross-entropy), the LLM learns to "answer what it knows and abstain from what it doesn't." This reduces hallucination rates by 30%–40% on multiple short-form QA benchmarks while requiring ~30× less compute and ~20× less data than LoRA-style baselines.
HalluGuard: Demystifying Data-Driven and Reasoning-Driven Hallucinations in LLMs: This paper proposes a unified theoretical framework called the "Hallucination Risk Bound," which decomposes the hallucination risk of LLMs into a data-driven term (representational bias during training) and a reasoning-driven term (instability during decoding) using the triangle inequality. Based on this, the authors design HalluGuard, an NTK-based spectral proxy score that requires no external references or hallucination annotations, achieving consistent SOTA performance across 10 benchmarks, 11 baselines, and 9 backbones.
HARP: Hallucination Detection via Reasoning Subspace Projection: HARP decomposes the LLM hidden state space into "Semantic Subspace ⊕ Reasoning Subspace." By performing SVD on the Unembedding layer to identify basis vectors for the reasoning subspace and projecting hidden states onto this subspace (occupying only ~5% of dimensions) as a hallucination detection feature, it pushes AUROC to 92.8% on TriviaQA (7.5 percentage points higher than the previous best).
High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning: HALT decomposes answers generated by the pre-trained model into "fact snippets" during the finetuning stage and uses a ground-truth-based evaluator to verify each snippet. It retains only the parts the model can correctly generate, replacing the rest with "Unsure from here," thereby training a reliable LLM that "only says what it knows." With an adjustable threshold to balance completeness and accuracy, a single Llama3-70B improved its average accuracy across four domains from 51% to 87%.
Imitating the Truth: Attention-aware Truth-Guided Enhancement for Hallucination Mitigation in Large Vision-Language Models: This paper discovers that LVLMs exhibit phased and model-specific attention differences when generating "truth tokens" versus "hallucinated tokens." It proposes AGE, a training-free framework that "calibrates" visual and textual attention during inference to mimic the attention patterns of truth tokens, thereby mitigating hallucinations without retraining or compromising fluency.
Learning to Reason for Hallucination Span Detection: This paper proposes RL4HS: using reinforcement learning (GRPO based on span-F1 rewards) to train 7B/14B models to perform "reason-then-locate" for precise hallucination span detection. It introduces Class-Aware Policy Optimization (CAPO) to correct systematic reward biases towards the "no-hallucination" class, outperforming SFT and proprietary large reasoning models (GPT-5, o3) on RAGTruth.
Leveraging Pretrained Knowledge at Inference Time: LoRA-Gated Contrastive Decoding for Multilingual Factual Language Generation in Adapted LLMs: LGCD utilizes SVD to decompose the FFN weight difference between the "original pretrained model vs. language-adapted model" into a set of LoRA matrices. During decoding, it dynamically triggers contrastive decoding based on token confidence to "re-inject" factual knowledge lost during the adaptation process—without requiring training or access to the original pre-training data.
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation: The AIR (Adaptive vIsual Reinforcement) framework is proposed to reduce MLLM hallucinations during inference without training by utilizing prototype distance-based token pruning and Optimal Transport-guided selective patch reinforcement (LLaVA-1.5-7B CHAIR_S: 22→18.4, POPE Accuracy +5.3%), while maintaining general multimodal performance.
LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals: Ours proposes the Lumina framework, which detects hallucinations in RAG systems via "context-knowledge signals": MMD is used to measure the extent of external context utilization, while cross-layer token prediction evolution measures internal knowledge utilization. The method generalizes without hyperparameter tuning.
Mechanistic Detection and Mitigation of Hallucination in Large Reasoning Models: This paper proposes a mechanistic interpretability-based Reasoning Score (using LogitLens to measure the distribution drift of late-layer logits to characterize "reasoning depth"). Based on this, it reveals three internal patterns of reasoning hallucinations, constructs the RHD detection framework, and adapts GRPO into GRPO-R using potential-based reward shaping to mitigate hallucinations.
Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models: M2R proposes a "Macro + Micro Retrieval" dual-layer framework: during the reasoning phase, coarse-grained evidence is retrieved from external sources and answer-aligned key information is stored in a key-value bank; during the answering phase, micro-retrieval is used to extract these key facts and place them directly next to the answer tokens. Trained via GRPO and curriculum learning, it fundamentally mitigates hallucinations in long-form generation.
Mitigating Hallucination in Vision-Language Model with Depth and Spatial-aware Key-Value Refinement: The authors observe that visual hallucinations in VLMs stem from the "loss of coherence and isotropic divergence of Key vectors for adjacent visual tokens." Consequently, they propose DSCR, a training-free method that utilizes monocular depth and 2D spatial proximity to regroup Key/Value vectors of the same object and push apart those across different surfaces. Without fine-tuning, this redirects cross-modal attention back to relevant regions, achieving up to a 41.6% accuracy improvement across five hallucination benchmarks.
NDAD: Negative-Direction Aware Decoding for Large Language Models via Controllable Hallucination Signal Injection: NDAD takes an unconventional approach: instead of "mining" factual signals from early layers to boost, it actively masks important attention heads to induce hallucination signals. These signals are then used as a "negative direction" and subtracted from the final output distribution, enhancing the factual reliability of LLMs without retraining or external knowledge.
Neural Message-Passing on Attention Graphs for Hallucination Detection: The authors treat internal attention matrices and activations of LLMs as an "attributed directed graph" (tokens as nodes, attention flow as edges). A GNN is used for message passing to detect hallucinations. It is theoretically proven that this framework encompasses previous attention-based heuristics while empirically surpassing them.
P2-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization: P2-DPO enables Large Vision-Language Models (LVLMs) to automatically generate on-policy, vision-grounded preference pairs (focus enhancement + noise resistance) targeting their own perceptual weaknesses. By utilizing a calibrated DPO loss to align the causal relationship between visual signals and text generation, the method outperforms strong baselines trained on expensive human feedback across hallucination benchmarks without relying on any manual annotation.
PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs: PostAlign treats "visual grounding (localization boxes/masks) + textual grounding (reasoning rationale)" as a corrective lens post-positioned on MLLMs. It uses a <REJ> rejection token to empower the model to reject non-existent objects and employs <SIMPLE>/<COMPLEX> routing signals to decide whether to generate intermediate reasoning based on question difficulty, significantly reducing hallucinations on benchmarks like POPE and HaloQuest while preserving general reasoning capabilities.
Seeing What's Wrong: A Trajectory-Guided Approach to Caption Error Detection: This paper proposes TRACED: instead of judging caption correctness via a single image-text similarity score, it iteratively edits the caption to maximize alignment, generating a "caption trajectory." Features derived from the improvement magnitude and semantic shifts of this trajectory are used to train a classifier. TRACED improves detection accuracy by up to 2.8% on MS COCO, Flickr30k, and MM-IMDb, localizes specific erroneous words, and guides VLMs to increase corrected caption alignment scores by up to 14.5%.
Semantic Uncertainty Quantification of Hallucinations in LLMs: A Quantum Tensor Network Based Method: To address the blind spot where semantic entropy ignores the "stochastic fluctuations of token sequence probabilities themselves," this paper embeds the kernel mean embedding (KME) of sequence probabilities as a wave function of a Quantum Tensor Network (QTN). It employs perturbation theory to calculate the local uncertainty of each probability in a one-shot manner. These probabilities are then calibrated via "entropy maximization + KL penalty weighted by inverse uncertainty" to derive an interpretable Semantic Rényi Entropy that is more sensitive to confabulation. Across 116 experimental setups involving 4 datasets, 8 models, and 3 quantization levels, the method consistently outperforms SOTA in AUROC/AURAC.
SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense: This work systematically traces LVLM object hallucinations back to the visual encoder for the first time, identifying three major issues: statistical bias (over-emphasis on high-frequency pattern tokens), inherent bias (residual representations of dominant pre-training objects), and vulnerability (feature distortion caused by minor perturbations). It proposes SHIELD, a training-free framework that synergistically defends against these via token re-weighting, token subtraction, and contrastive decoding, outperforming methods like VCD and OPERA on LLaVA-1.5, InstructBLIP, and Qwen-VL.
Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding: Ours proposes Token-Guard, a token-level hallucination control method based on self-checking decoding. Through token-level/segment-level scoring in latent space and an iterative refinement mechanism, it detects and suppresses hallucination generation during the decoding process, achieving an average F1 improvement of 16.3%.
TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models: Focusing on hallucination signals exposed during the multi-step denoising process of Diffusion Large Language Models (D-LLMs), this paper models the denoising trace as an "action trajectory." It utilizes the Information Bottleneck principle to automatically select sub-trajectories that are most informative regarding hallucinations to train a classifier, improving the hallucination detection AUROC by an average of 15.2% across two open-source D-LLMs and three QA datasets.
VeriTrail: Closed-Domain Hallucination Detection with Traceability: VeriTrail is proposed—the first closed-domain hallucination detection method providing traceability for multi-generative-step (MGS) processes. It models the generation process as a Directed Acyclic Graph (DAG) and verifies facts layer-by-layer along paths while establishing the first MGS datasets containing all intermediate outputs and human annotations.
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow: This paper identifies "hallucination snowballing" in VLM Multi-Agent Systems (MAS)—where a visual misjudgment by one agent is progressively amplified by subsequent agents via pure text streams. Through turn-wise, layer-wise, and token-level attention analysis, the authors locate "middle-layer unimodal visual tokens" as the critical carriers of visual evidence. They propose ViF: establishing an additional "visual flow" between agents using these visual relay tokens combined with attention reallocation. This model-agnostic approach mitigates snowballing and achieves consistent 2.4–3.8% improvements across 8 benchmarks, 4 MAS structures, and 10 backbones.