NeurIPS2025 Hallucination Detection AI paper notes paper summaries Multimodal/VLM LLM Reasoning Alignment/RLHF Adversarial Robustness

👻 Hallucination Detection¶

🧠 NeurIPS2025 · 17 paper notes

📌 Same area in other venues: 📷 CVPR2026 (33) · 🔬 ICLR2026 (40) · 💬 ACL2026 (28) · 🧪 ICML2026 (21) · 🤖 AAAI2026 (15) · 📹 ICCV2025 (5)

🔥 Top topics: Multimodal/VLM ×6 · LLM ×4 · Reasoning ×3 · Alignment/RLHF ×2 · Adversarial Robustness ×2

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models: This paper systematically audits the generation and propagation mechanisms of hallucinations in reasoning large language models (RLLMs), finding that reflection in long CoT amplifies hallucinations through metacognitive bias rather than correcting them. Even targeted interventions at the hallucination source fail to alter final outputs (chain disloyalty), exposing critical shortcomings of existing hallucination detection methods in multi-step reasoning scenarios.
Benford's Curse: Tracing Digit Bias to Numerical Hallucination in LLMs: This paper demonstrates that numerical hallucinations in LLMs originate from the Benford's Law-conforming digit frequency distribution in pretraining corpora—where digit 1 appears with ~30% probability while digit 9 appears with only ~5%—and that this bias is internalized by specific "digit-selective neurons" in the later FFN layers. A Digit Selectivity Coefficient (DSC) is proposed to localize biased neurons, and pruning 0.01% of neurons corrects 1.36–3.49% of erroneous predictions.
Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT: This paper organizes all hidden-layer activations of an LLM into an "activation tensor" (layers × tokens × hidden dimension), treats it analogously to an image, and processes it with a ViT-based architecture (ACT-ViT) that supports joint training across multiple LLMs. The method consistently outperforms conventional probing approaches across 15 LLM–dataset combinations and demonstrates strong zero-shot/few-shot transfer to unseen datasets and unseen LLMs.
Causal-LLaVA: Causal Disentanglement for Mitigating Hallucination in Multimodal Large Language Models: This paper identifies the root cause of object hallucination in MLLMs at the representation level—semantic entanglement induced by dataset co-occurrence bias—and proposes a dual-path causal disentanglement framework (Causal-Driven Projector + Causal Intervention Module). By applying backdoor adjustment at both the projector and the final Transformer layer to decouple co-occurring object representations, the method achieves a 22.6% improvement on MME-Perception.
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers: This paper argues that LLM generalization and hallucination share a common mechanism — out-of-context reasoning (OCR) — and provides theoretical guarantees on a single-layer attention model: the factorized parameterization \((W_O, W_V)\) can perform OCR due to the nuclear norm implicit bias of gradient descent, whereas the merged parameterization \(W_{OV}\) cannot due to its Frobenius norm bias. Moreover, OCR is sample-efficient (requiring only \(m_{\text{train}}>0\)).
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling: This paper proposes REVERSE, the first framework to unify generation adjustment and post-hoc verification within a single VLM. Through hallucination-aware training on 1.3M semi-synthetic samples combined with inference-time retrospective resampling, REVERSE enables a VLM to automatically detect and correct hallucinations during generation, achieving a 12% reduction on CHAIR-MSCOCO and a 34% improvement on HaloQuest.
GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity: GLSim is a training-free object hallucination detection method for LVLMs that combines a global scene similarity score (cosine similarity between the object token and the last instruction token) and a local visual grounding similarity score (cosine similarity between the object token and the Top-K image patch embeddings localized via Visual Logit Lens). It achieves 83.7% AUROC on MSCOCO, surpassing SVAR by 9% and Internal Confidence by 10.8%.
Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation: This paper proposes a definition of hallucination in text-to-image (T2I) models as bias-driven deviation, establishes a taxonomy of three hallucination categories—attribute, relation, and object—and argues that hallucination evaluation serves as an "upper bound" for prompt alignment evaluation, thereby revealing hidden model biases.
Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats: This paper proposes AllPath, a multi-path hallucination intervention framework grounded in the Transformer causal architecture. It is the first to demonstrate that hallucinations in LVLMs do not stem from a single causal path but from the interaction of three paths — image-to-input-text, image-to-output-text, and text-to-text — and that models adaptively rely on different paths depending on the question-answer alignment format. By designing lightweight key-head identification methods for each path and performing adaptive intervention, AllPath consistently reduces hallucinations across four benchmarks covering different alignment formats: POPE, MCQ-POPE, CHAIR, and MME.
Mitigating Hallucination Through Theory-Consistent Symmetric Multimodal Preference Optimization: This paper proposes SymMPO (Symmetric Multimodal Preference Optimization), which addresses two key limitations of existing vision-augmented DPO methods—namely, theoretically unsound objective functions and indirect preference supervision—through symmetric paired preference learning over contrastive images and preference margin consistency regularization. Consistent performance gains are achieved across five hallucination benchmarks.
Reasoning Models Hallucinate More: Factuality-Aware Reinforcement Learning for Large Reasoning Models: This paper reveals that RL-trained reasoning models (e.g., DeepSeek-R1) hallucinate significantly more than non-reasoning models, theoretically identifies three root causes (high-variance gradients, entropy constraints, and spurious local optima), and proposes the FSPO algorithm, which adjusts token-level advantages via step-level factuality verification to reduce hallucination while maintaining or even improving reasoning capability.
Robust Hallucination Detection in LLMs via Adaptive Token Selection: HaMI frames hallucination detection as a Multiple Instance Learning (MIL) problem, treating each generated sequence as a bag of token instances. By jointly optimizing token selection and hallucination detection, it adaptively identifies the most informative tokens, achieving substantial AUROC improvements over all existing methods across four QA benchmarks (up to 11.9%).
SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations: This paper proposes SECA (Semantically Equivalent and Coherent Attacks), a realistic prompt perturbation framework that elicits LLM hallucinations while preserving semantic equivalence and coherence, achieving higher attack success rates on multiple-choice QA tasks with near-zero semantic errors.
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models: This paper addresses OCR hallucinations in MLLMs under degraded document conditions. It introduces KIE-HVQA, the first benchmark for evaluating hallucinations in degraded document scenarios, and proposes a multi-objective reward reinforcement learning framework based on GRPO. The resulting 7B-parameter model achieves approximately 28% higher hallucination-suppression accuracy than GPT-4o.
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations: This paper proposes Topic-level Preference Rewriting (TPR), which systematically optimizes the reward gap configuration in preference data through fine-grained semantic control at the topic level, combined with a curriculum learning strategy that progressively increases the difficulty of negative samples, achieving approximately 93% hallucination reduction across multiple hallucination benchmarks.
Teaming LLMs to Detect and Mitigate Hallucinations: This paper generalizes single-model consistency methods (Self-Consistency + Semantic Entropy) to a multi-model "consortium" setting comprising heterogeneous LLMs. By aggregating responses from models with diverse training backgrounds, the approach breaks the consistent hallucinations that arise within a single model. Evaluating a large number of consortium combinations over a pool of 15 LLMs, the paper finds that well-matched strong-model consortia outperform the strongest single-model baseline in 92% of cases while incurring lower inference cost.
When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations: This paper identifies a "semantic hallucination" problem in Large Multimodal Models (LMMs) for scene text recognition—where non-semantic text is misread as semantically plausible words. Analysis reveals that Transformer layers whose attention is more focused on text regions are less prone to hallucination. Based on this finding, the authors propose a training-free framework, ZoomText + Grounded Layer Correction, achieving approximately 4–5% improvement on TextHalu-Bench and approximately 4% on ST-VQA.