Skip to content

Spotlight and Shadow: Attention-Guided Dual-Anchor Introspective Decoding for MLLM Hallucination Mitigation

Conference: ACL 2026
arXiv: 2604.10071
Code: None
Area: Hallucination Detection
Keywords: Multimodal Hallucination, Contrastive Decoding, Layer-wise Analysis, Visual Attention, Training-free

TL;DR

Ours proposes DaID (Dual-Anchor Introspective Decoding), which leverages the internal visual perception differences across MLLM layers—amplifying visual signals through the Spotlight layer and suppressing linguistic inertia through the Shadow layer—to achieve hallucination mitigation within a single forward pass.

Background & Motivation

Background: Multimodal Large Language Models (MLLMs) perform excellently in reasoning tasks but suffer from significant hallucination issues, where generated text is inconsistent with visual content.

Limitations of Prior Work: Existing contrastive decoding methods (e.g., VCD, ICD) have two major flaws: (1) they require an additional forward pass per step to obtain a negative sample distribution, increasing inference latency by 1.83×; (2) they rely on heuristic external perturbations (e.g., visual masking) to construct negative distributions, introducing random noise that causes semantic shift.

Key Challenge: The uncertainty of external perturbations may lead to the erroneous suppression of correct visual signals (e.g., VCD replacing the correct "yellow" with an incorrect "red").

Goal: To shift from an external intervention paradigm to an internal introspective paradigm, utilizing the perception differences of the model's own intermediate layers as the source of contrastive signals.

Key Insight: A layer-wise diagnosis of MLLMs reveals that shallow layers have strong hallucination tendencies (visual agnosia), intermediate layers exhibit the strongest visual perception (peak fidelity), and deep layers see visual signals overwritten by linguistic priors ("see then forget").

Core Idea: Use Visual Attention Score (VAS) to dynamically locate the Spotlight layer (peak visual perception) and the Shadow layer (dominated by linguistic noise) for each token, achieving hallucination suppression through contrastive calibration in a single forward pass.

Method

Overall Architecture

During the standard MLLM decoding process, DaID utilizes the attention distribution over visual tokens at each layer to select two anchor layers in real-time: the Spotlight layer (highest visual attention \(\rightarrow\) amplifies visual signals) and the Shadow layer (lowest visual attention before the Spotlight layer \(\rightarrow\) suppresses linguistic priors). The final logits are calibrated via a dual-anchor contrastive formula. This process requires no additional forward passes.

Key Designs

  1. Visual Attention Score (VAS) and Dynamic Anchoring:

    • Function: Locates the optimal contrastive anchor layers for each token without training.
    • Mechanism: \(VAS_t(l)\) = average attention weight of all heads on visual tokens. \(Spotlight = \text{argmax } VAS\) (peak visual perception), \(Shadow = \text{argmin } VAS\) (restricted to layers before the Spotlight, representing pure linguistic noise).
    • Design Motivation: Experimental validation shows that visual attention is highly synchronized with object recognition accuracy/hallucination rates, serving as a reliable training-free proxy for the model's cognitive state.
  2. Dual-Anchor Contrastive Decoding:

    • Function: Simultaneously enhances visual signals and suppresses linguistic inertia in the final logits.
    • Mechanism: \(L_{DaID} = [L_{final} + \alpha \cdot L_{spotlight}] \cdot (1+\beta) - \beta \cdot L_{shadow}\). \(\alpha\) controls visual enhancement intensity, and \(\beta\) controls linguistic suppression intensity.
    • Design Motivation: Neither enhancement nor suppression alone is sufficient; effective mitigation requires simultaneous "brightening + denoising."
  3. Adaptive Plausibility Constraint:

    • Function: Prevents intermediate layer logits from introducing syntactically implausible tokens.
    • Mechanism: Dual-anchor calibration is applied only to candidate tokens with probability \(\ge \gamma \cdot \text{max\_prob}\) in the final layer distribution; others are set to zero.
    • Design Motivation: As the Spotlight layer is an intermediate layer, its logits may contain visually relevant but syntactically inappropriate tokens, requiring a constraint on the candidate space.

Loss & Training

DaID is a training-free inference-time method. Key hyperparameters: \(\alpha=0.8\) (visual enhancement), \(\beta=0.2\) (linguistic suppression), \(\gamma=0.9\) (POPE) / \(0.1\) (others).

Key Experimental Results

Main Results

Hallucination Benchmarks on LLaVA-1.5-7B:

Method POPE Acc POPE F1 CHAIR_S↓ CHAIR_I↓ MME Total↑
Greedy 81.38 82.20 49.6 14.4 559.48
VCD 84.66 84.52 49.2 14.8 603.66
OPERA 84.88 85.21 45.4 12.7 549.00
SID 84.82 85.50 44.2 12.2 599.80
EAZY 84.97 85.78 38.8 11.4 596.16
DaID 85.08 85.92 35.9 11.3 633.68

Hallucination Benchmarks on LLaVA-NeXT:

Method POPE Acc POPE F1 CHAIR_S↓ CHAIR_I↓ MME Total↑
Greedy 83.78 82.24 32.8 9.1 580.92
EAZY 84.91 85.40 26.8 8.3 611.14
DaID 85.32 85.76 24.2 8.2 644.40

Ablation Study

Hyperparameter Analysis (LLaVA-1.5): * As \(\alpha\) increases from \(0.4 \rightarrow 0.8\), POPE Acc rises from 83.44% to 85.08%; performance drops when \(\alpha > 0.8\) (excessive visual signal disrupts syntax). * \(\beta=0.2\) is optimal: compared to \(\beta=0\) (no suppression), F1 +0.93%, Acc +0.51%; excessive suppression (\(\beta > 0.2\)) leads to performance degradation.

General Reasoning Ability: Across five benchmarks (GQA, VQAv2, MMB, SeedI, VizWiz), DaID not only maintains but consistently improves performance at both 7B and 13B scales (e.g., +2.1% on SeedI).

Key Findings

  • Layer-wise diagnosis reveals the "see then forget" phenomenon: object recognition accuracy peaks in intermediate layers and drops significantly in deep layers (11.12% drop in LLaVA-NeXT).
  • Visual attention is precisely synchronized with object recognition accuracy (both peak at layer 25 in LLaVA-1.5), validating VAS as a reliable proxy for cognitive state.
  • DaID outperforms methods requiring extra forward passes (VCD, OPERA, etc.) without increasing inference overhead (single forward pass).
  • Generalization experiments across 5 MLLM architectures confirm the consistent effectiveness of the method.

Highlights & Insights

  • The paradigm shift from "external perturbation" to "internal introspection" is elegant—it avoids the introduction of external noise while reducing computational overhead.
  • The "Spotlight + Shadow" dual-anchor concept is intuitive: shallow layers = linguistic noise \(\rightarrow\) Shadow, intermediate layers = visual peak \(\rightarrow\) Spotlight.
  • The layer-wise diagnostic analysis itself holds significant scientific value; the "see then forget" phenomenon and the attention proxy mechanism can inspire further research.
  • The method is entirely training-free and can be applied plug-and-play to any MLLM.

Limitations & Future Work

  • \(\alpha\) and \(\beta\) require different settings across benchmarks (e.g., \(\gamma=0.9\) for POPE vs. \(0.1\) for others), lacking automated hyperparameter selection.
  • The layer analysis is based on the LLaVA series; optimal layers for other architectures (e.g., Qwen2-VL) may differ.
  • While boasting a single forward pass, there may be overhead in attention extraction and layer-wise logit computation; specific latency was not reported.
  • Extensions to more complex multimodal scenarios, such as video understanding, remain to be explored.
  • Comparison with VCD is highly illustrative: where VCD uses external perturbations to construct negative distributions, DaID uses shallow linguistic priors as a natural negative distribution, which is more elegant and effective.
  • Like DoLa, DaID utilizes layer-wise contrast; however, DoLa only contrasts early and late layers for factual knowledge, whereas DaID’s dual-anchor + VAS dynamic selection is more granular.
  • Implications for MLLM design: consider incorporating auxiliary objectives for visual retention in intermediate layers during training to mitigate "see then forget" at its source.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The internal introspection paradigm is novel, the dual-anchor dynamic selection mechanism is cleverly designed, and the "see then forget" discovery is valuable.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive analysis across multiple hallucination and general benchmarks, multiple MLLMs, and full hyperparameter analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ The paper is well-structured, flowing seamlessly from motivation to observation to method.