When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations¶

Conference: NeurIPS 2025 arXiv: 2506.05551 Code: GitHub Area: Multimodal VLM Keywords: semantic hallucination, scene text recognition, large multimodal models, attention correction, training-free

TL;DR¶

This paper identifies a "semantic hallucination" problem in Large Multimodal Models (LMMs) for scene text recognition—where non-semantic text is misread as semantically plausible words. Analysis reveals that Transformer layers whose attention is more focused on text regions are less prone to hallucination. Based on this finding, the authors propose a training-free framework, ZoomText + Grounded Layer Correction, achieving approximately 4–5% improvement on TextHalu-Bench and approximately 4% on ST-VQA.

Background & Motivation¶

LMMs demonstrate strong visual perception and reasoning capabilities, yet tend to produce "semantic hallucinations" when handling visually ambiguous or non-semantic scene text—generating semantically plausible but visually incorrect answers. For example, "MMOTEL" (a semantically meaningless edited string) may be recognized as "MOTEL," and "PULLa" as "PULL." Key Challenge: Pre-training on large-scale semantically coherent text instills strong semantic priors in these models, causing them to rely on semantic guessing rather than genuine visual grounding in OCR tasks. Existing hallucination mitigation work focuses primarily on object and factual hallucinations; OCR-specific semantic hallucinations remain largely unexplored. The Core Idea is to exploit attention differences across internal LLM layers—layers that attend more to text regions are less prone to hallucination—to guide the decoding process.

Method¶

Overall Architecture¶

A training-free semantic hallucination mitigation framework consisting of two modules:

Input: Image + text query
Output: Corrected scene text recognition/understanding result
Pipeline: ① ZoomText localizes scene text regions (no external detector required) → ② Grounded Layer Correction selects the hidden states from the optimal layer and fuses them into the decoding process

Key Designs¶

Analysis of Semantic Hallucination Causes:
- Hallucination Tendency Score: For each Transformer layer \(\ell\), the output probabilities of the hallucinated token \(y_{hal}\) and the ground-truth token \(y_{gt}\) are compared: \(S_{hal}^{\ell} = P_{hal}^{\ell} / (P_{hal}^{\ell} + P_{gt}^{\ell})\)
- Text Region Attention Score: Defined as \(A_{\ell} = \frac{\sum_{i \in \mathcal{I}} \sum_{j \in \mathcal{T}} \alpha_{i,j}^{\ell}}{\sum_{i \in \mathcal{I}} \sum_{j \in \mathcal{I}} \alpha_{i,j}^{\ell}}\), measuring the proportion of attention layer \(\ell\) allocates to text regions
- Key Finding: Spearman correlation analysis reveals a strong negative correlation between hallucination tendency and text-region attention—layers that concentrate more attention on text regions are less prone to hallucination
ZoomText (Coarse-to-Fine Text Region Localization):
- Glimpse Step: Extracts query-to-image cross-attention from the last LLM layer, averaged across all heads and query tokens to obtain a global image attention map \(A_{text} = \frac{1}{HQ}\sum_{h=1}^{H}\sum_{q=1}^{Q} A_{q2v}^{(h,q)}\); the top-\(K\) tokens are selected as coarse text region candidates
- Refocus Step: Computes a normalized shift score between the self-attention of the first and last Transformer layers \(A_{text}^{normalized} = (A_{v2v}^{(L)} - A_{v2v}^{(1)}) / (A_{v2v}^{(1)} + \epsilon)\), filtering out non-semantic tokens whose attention patterns remain stable across layers (global-context "registers") and retaining genuine text regions
Grounded Layer Correction (GLC):
- Selects the layer with the strongest text-region attention: \(\ell^{\star} = \arg\max_{\ell} A_{\ell}\)
- Three correction strategies are proposed:
  - Replacement: Directly replaces the final layer's hidden states with those from layer \(\ell^{\star}\)
  - Selective Replacement: Applies replacement only to text-region tokens
  - Fusion (default): Weighted fusion \(\hat{H}_i = (1-w) \cdot H_i^{(L)} + w \cdot H_i^{(\ell^{\star})}\), with \(w=0.1\)
- The Fusion strategy achieves the best balance between hallucination mitigation and preservation of semantic capability

Loss & Training¶

The method is entirely training-free and operates as a test-time adaptive plugin. ZoomText uses \(K=128\) (top-\(K\) image tokens) and a fusion weight of \(w=0.1\). No additional modules or trainable parameters are introduced. The framework can be directly integrated into existing LMMs such as Mini-Monkey, Qwen2.5-VL, and LLaVA-NeXT.

Key Experimental Results¶

Main Results¶

Model	TextHalu-Bench	ST-VQA	TextVQA	GOT	SEED-Bench
GPT-4o	45.3	-	71.0	-	70.2
Mini-Monkey (baseline)	46.5	66.7	74.1	88.8	83.3
Mini-Monkey + Ours	50.6 (+4.1)	70.6 (+3.9)	75.0 (+0.9)	89.2 (+0.4)	84.5 (+1.2)
Qwen2.5-VL (baseline)	48.3	67.3	79.1	85.2	66.7
Qwen2.5-VL + Ours	53.8 (+5.5)	67.6 (+0.3)	80.3 (+1.2)	86.0 (+0.8)	70.2 (+3.5)
LLaVA-NeXT (baseline)	27.9	65.1	65.3	41.9	50.0
LLaVA-NeXT + Ours	28.5 (+0.6)	65.2 (+0.1)	65.5 (+0.2)	42.0 (+0.1)	51.2 (+1.2)

Ablation Study¶

Configuration	TextHalu-Bench	ST-VQA	Notes
Baseline (Mini-Monkey)	46.5	66.7	--
Adversarial Training	47.5 (+1.0)	66.8 (+0.1)	Training-based approach yields limited gains
Chain-of-Thought	46.8 (+0.3)	68.2 (+1.5)	CoT helps general tasks but does not address the root cause
Ours (Fusion)	50.6 (+4.1)	70.6 (+3.9)	Best configuration
External text detector replacing ZoomText	50.4 (+3.9)	70.8 (+4.1)	ZoomText approaches external detector performance
w/o Glimpse	50.2 (+3.7)	70.2 (+3.5)	Glimpse contributes meaningfully
w/o Refocus	49.8 (+3.3)	69.5 (+2.8)	Refocus noise filtering is important
Replacement strategy	Decline	Decline	Direct replacement disrupts semantics
Selective Replacement	Moderate gain	Decline on general tasks	Over-writing hurts alignment
Fusion (w=0.1)	Best	Best	Mild fusion is optimal

Key Findings¶

Semantic hallucination is a fundamental deficiency of LMMs: Even GPT-4o scores only 45.3 on TextHalu-Bench, far below the human score of 96.8
Hallucination tendency varies significantly across Transformer layers: Middle layers often predict ground-truth tokens more accurately than the final layers
Attention concentration and hallucination exhibit a strong negative correlation: Layers that allocate more attention to text regions exhibit lower hallucination probability
Gains are larger for models with stronger OCR capability: Mini-Monkey and Qwen2.5-VL show substantial improvements, whereas LLaVA-NeXT, which has weaker OCR ability, benefits less
ZoomText achieves comparable performance without external detectors: The glimpse-refocus strategy effectively leverages the model's own attention

Highlights & Insights¶

This work is the first to systematically define and study "semantic hallucination" in LMMs, with a clear problem formulation that is highly relevant to real-world applications
The methodology of analyzing hallucination causes through attention mechanisms is instructive—directly quantifying the relationship between layer-wise attention and hallucination probability
ZoomText's glimpse-refocus strategy provides an elegant text region localization solution requiring no external modules
The fusion strategy is simple and effective (a single weight \(w=0.1\)), with minimal computational overhead

Limitations & Future Work¶

The fusion weight \(w\) and top-\(K\) value require manual tuning and may need adjustment for different models
The method provides limited benefit for models with inherently weak OCR capability (e.g., LLaVA-NeXT), indicating a dependency on underlying visual encoding ability
ZoomText assumes text appears on semantically meaningful backgrounds (e.g., signs, posters) and may not generalize well to document-style images containing only text
TextHalu-Bench comprises only 1,740 samples, offering limited scene coverage

Complementary to hallucination mitigation methods such as VCD (Visual Contrastive Decoding)—VCD targets object hallucinations, whereas this work targets textual hallucinations
Layer-wise attention analysis can generalize to other tasks requiring precise visual grounding (e.g., fine-grained recognition)
The hidden state fusion mechanism in Grounded Layer Correction can be applied to other scenarios involving model-self-guided decoding
The evaluation paradigm of TextHalu-Bench (contrasting semantic vs. non-semantic text) offers useful insights for assessing OCR model robustness

Rating¶

Novelty: ⭐⭐⭐⭐ The definition of semantic hallucination and the analytical approach are novel, though the solution (hidden state fusion) is relatively conventional
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple benchmarks with extensive ablations and comparisons against various hallucination mitigation methods
Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is clear, visualizations are rich, and the argumentation is logically coherent
Value: ⭐⭐⭐⭐ Reveals an important deficiency of LMMs in OCR tasks; the training-free solution offers strong practical utility