Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs¶
Conference: AAAI 2026 arXiv: 2511.09018 Code: https://github.com/CikZ2023/OWL Area: Causal Inference Keywords: Object Hallucination, Causal Inference, Attention Intervention, Contrastive Decoding, Large Vision-Language Models
TL;DR¶
This paper proposes Owl, a framework that models visual and textual attention as mediating variables within a structural causal model, introduces the VTACR metric to quantify cross-modal attention imbalance, and designs VTACR-guided adaptive attention modulation combined with a dual-path contrastive decoding strategy, achieving state-of-the-art hallucination mitigation on POPE and CHAIR benchmarks.
Background & Motivation¶
LVLMs (LLaVA, MiniGPT-4, Shikra, etc.) demonstrate strong performance in image captioning and VQA, yet remain severely affected by object hallucination—generating objects not present in the image. Existing solutions fall into three categories: (1) human preference alignment (e.g., RLHF), which is costly; (2) post-hoc detection/correction (LURE, Woodpecker), which does not address root causes; and (3) decoding optimization (VCD, PAI, OPERA), which typically manipulates attention in only a single modality.
Core Observation: Prior methods either enhance visual attention or suppress textual attention, but both neglect the interaction imbalance between the two. The authors find that: - Independently enhancing visual attention reduces hallucination (higher TCE) but leads to shorter outputs. - Independently enhancing textual attention aggravates hallucination but produces longer outputs. - Hallucinated tokens consistently exhibit low VTACR (Visual-to-Textual Attention Contribution Ratio), indicating excessive reliance on textual priors.
Core Problem¶
How to dynamically balance the contributions of visual and textual attention during decoding, so that the model neither hallucinates due to over-reliance on textual priors nor truncates outputs due to over-emphasis on visual signals?
Method¶
Overall Architecture¶
Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation) comprises three core components: 1. Structural Causal Model (SCM): Models visual and textual attention as mediating variables. 2. VTACR-Guided Adaptive Attention Modulation: Dynamically adjusts attention weights per layer and per token. 3. Dual-path Contrastive Decoding (DCD): Constructs a vision-preferred path and a text-preferred path, suppressing hallucinations via contrastive comparison.
Key Designs¶
-
VTACR Metric (Visual-to-Textual Attention Contribution Ratio):
- Visual token attention contribution: \(\nu^{(\ell)} = \frac{1}{N|\mathcal{V}|} \sum_{j \in \mathcal{V}} \sum_{i=1}^{N} \mathbf{A}_{i,j}^{(\ell)}\)
- Textual token attention contribution: \(\tau^{(\ell)} = \frac{1}{N|\mathcal{T}|} \sum_{k \in \mathcal{T}} \sum_{i=1}^{N} \mathbf{A}_{i,k}^{(\ell)}\)
- Layer-wise VTACR: \(\text{VTACR}^{(\ell)} = \nu^{(\ell)} / \tau^{(\ell)}\)
- Quantifies the ratio of attention contributions from visual vs. textual tokens to the current generated token at each layer.
- Hallucinated tokens exhibit abnormally low VTACR, indicating over-dependence on the textual modality.
-
Structural Causal Model and Mediating Variable Intervention:
- Causal graph: \(X_V \to A_V \to Y_T\), \(X_T \to A_T \to Y_T\)
- Priors \(P_V, P_T\) cannot be directly intervened upon but indirectly influence generation through mediating variables \(A_V, A_T\).
- Soft interventions: \(do(A_V = A_V^*)\), \(do(A_T = A_T^*)\)
- The TCE metric evaluates intervention effects by measuring the average change in hallucination behavior following attention modification.
-
Adaptive Attention Modulation:
- 2,000 hallucination samples are drawn from MSCOCO to compute the per-layer VTACR distribution.
- The baseline score \(V_b^{(\ell)}\) is defined as the \(\tau\)-th percentile (default: 80) of this distribution.
- When \(V^{(\ell)} < V_b^{(\ell)}\) (insufficient visual grounding), the modulation coefficient is increased.
- \(\tilde{T}^{(\ell)} = \mathbb{I}(V^{(\ell)} < V_b^{(\ell)}) \cdot \min(T \cdot \frac{V^{(\ell)} - V_b^{(\ell)}}{V_b^{(\ell)}}, T)\)
- Dynamically adjusts \(\tilde{\alpha}^{(\ell)} = \alpha + \tilde{T}^{(\ell)}\) and \(\tilde{\beta}^{(\ell)} = \beta + \tilde{T}^{(\ell)}\).
-
Dual-Path Attention Intervention + Contrastive Decoding (DCD):
- Vision-preferred path: Amplifies visual token attention and suppresses textual token attention.
- \(\tilde{\mathbf{A}}_{i,j}^{(\ell)} = \mathbf{A}_{i,j}^{(\ell)} + \tilde{\alpha}^{(\ell)} \cdot |\mathbf{A}_{i,j}^{(\ell)}|, \quad j \in \mathcal{V}\)
- \(\tilde{\mathbf{A}}_{i,k}^{(\ell)} = \mathbf{A}_{i,k}^{(\ell)} - \tilde{\beta}^{(\ell)} \cdot |\mathbf{A}_{i,k}^{(\ell)}|, \quad k \in \mathcal{T}\)
- Text-preferred path: Suppresses visual attention and amplifies textual attention (simulating a hallucination-prone scenario).
- Contrastive decoding: \(P_{\text{DCD}}(Y|X_V, X_I) = \text{Softmax}[(1+\lambda) \cdot \log p_\theta(y|X_V^\uparrow, X_T^\downarrow) - \lambda \cdot \log p_\theta(y|X_V^\downarrow, X_T^\uparrow)]\)
- Widens the probability gap between faithful tokens and hallucinated tokens via contrastive comparison.
- Vision-preferred path: Amplifies visual token attention and suppresses textual token attention.
Loss & Training¶
No training is required. Owl is a purely inference-stage decoding strategy that does not modify model parameters: - Hyperparameters: \(\alpha\) and \(\beta\) are tuned per backbone (LLaVA-1.5: 0.4/0.5; MiniGPT-4: 0.2/0.3; Shikra: 0.5/0.3). - Contrastive strength \(\lambda = 0.2\), modulation coefficient \(T = 0.2\), percentile threshold \(\tau = 80\). - Experiments are conducted on 500 images from MSCOCO val2014 using 4× RTX 3090 GPUs.
Key Experimental Results¶
CHAIR Benchmark (hallucination rate, lower is better):
| Model | Method | C_S | C_I | Len |
|---|---|---|---|---|
| LLaVA-1.5 | PAI | 31.8 | 10.3 | 85.2 |
| LLaVA-1.5 | Owl | 26.2 | 8.1 | 98.4 |
| MiniGPT-4 | PAI | 24.8 | 9.3 | 65.9 |
| MiniGPT-4 | Owl | 21.2 | 6.2 | 73.6 |
| Shikra | PAI | 37.6 | 12.9 | 94.7 |
| Shikra | Owl | 29.3 | 9.7 | 108.2 |
- Compared to PAI, \(C_S\) decreases by 17.6% and \(C_I\) by 21.4% on LLaVA-1.5.
- \(C_I\) decreases by 36.7% on MiniGPT-4 (largest gain).
- \(C_S\) decreases by 22.1% on Shikra.
- Generation length increases rather than decreases (output richness is not sacrificed).
POPE Benchmark (accuracy, higher is better):
| Model | Method | Random | Popular | Adversarial |
|---|---|---|---|---|
| LLaVA-1.5 | Owl | 90.2 | 88.1 | 90.5 |
| MiniGPT-4 | Owl | 82.2 | 78.4 | 79.0 |
| Shikra | Owl | 85.2 | 82.3 | 83.4 |
- Particularly strong under the Adversarial setting; Owl achieves the highest scores across all three splits on Shikra.
GPT-4V Evaluation: On LLaVA-1.5, Correctness improves from 5.58 to 6.70 (+20.1%) and Detailedness from 5.30 to 5.90 (+11.3%).
VQA Preservation: VizWiz +7.6% (48.8→52.5), TextVQA +3.7%, VQAv2 drops only 2.3%.
Ablation Study¶
- α (visual attention coefficient): Increasing α reduces hallucination, but excessively large values compress useful content (F1 degrades), indicating a trade-off.
- β (textual attention coefficient): Increasing β steadily reduces hallucination with minimal impact on F1, suggesting that suppressing textual attention is the safer intervention.
- λ (contrastive decoding strength): Effective and stable in the range 0.1–0.4; excessively high values destabilize decoding.
- The three hyperparameters are complementary: α controls visual amplification, β controls textual suppression, and λ controls contrastive intensity.
Highlights & Insights¶
- Novel causal perspective: This is the first work to simultaneously model both visual and textual attention as mediating variables in an SCM, providing an interpretable analytical framework for hallucination.
- VTACR metric: Concisely and effectively quantifies cross-modal attention imbalance, and can serve as a standalone hallucination detection signal.
- Training-free: A purely inference-stage method that is plug-and-play without requiring model retraining.
- No degradation in generation quality: Hallucination is reduced while generation length increases, in contrast to methods such as PAI that tend to produce shorter outputs.
- Elegant dual-path design: Constructing a "vision-preferred" and a "text-preferred" path for contrastive comparison is intuitively clear and empirically effective.
Limitations & Future Work¶
- Model-specific hyperparameters: \(\alpha\) and \(\beta\) require separate tuning for each backbone, limiting generalizability.
- Additional inference overhead: DCD requires two forward passes (one per path), roughly halving inference speed.
- Validated only on limited backbones: LLaVA-1.5, MiniGPT-4, and Shikra are relatively early-generation models; effectiveness on stronger LVLMs (e.g., LLaVA-Next, InternVL2) has not been verified.
- VTACR baseline distribution depends on sampled data: The 2,000 hallucination samples are drawn from MSCOCO; adaptability to other data distributions is not discussed.
- POPE Popular setting: Owl slightly underperforms PAI on MiniGPT-4 and LLaVA-1.5, suggesting that further tuning may be needed for high-frequency object scenarios.
Related Work & Insights¶
| Method | Intervention | Modality | Training | Key Distinction |
|---|---|---|---|---|
| VCD | Visual contrastive decoding | Single (visual) | No | Constructs negative samples by perturbing visual input |
| PAI | Perplexity-aware attention gating | Single | No | Fixed scaling; does not account for layer-wise differences |
| OPERA | Rollback + attention suppression | Single (textual) | No | Suppresses repetition; does not address cross-modal imbalance |
| CausalMM | Causal graph + counterfactual reasoning | Dual | No | Intervenes at visual encoder and LLM decoder, but amplifies hallucination signals |
| Owl | VTACR-guided dual-path contrastive | Dual (explicitly decoupled) | No | Layer- and token-wise adaptive; widens gap between faithful and hallucinated signals |
Implications and Connections:
- VTACR as a general hallucination detector: The metric can serve as a token-level proxy for hallucination probability, enabling early stopping or selective post-processing.
- Dual-path contrastive approach is broadly applicable: The paradigm is not limited to visual/textual attention and can be extended to other dimensions of multimodal fusion (e.g., temporal/spatial attention).
- Connection to token pruning/compression: Layers and tokens with low VTACR may indicate redundant visual tokens and could be combined with visual token compression methods.
- Causal mediation analysis framework: Beyond hallucination, the framework can be applied to analyze any visual-textual imbalance in VLMs, including bias and faithfulness.
Rating¶
| Dimension | Score (1–10) | Notes |
|---|---|---|
| Novelty | 7 | Causal mediating variable perspective is novel; contrastive decoding framework has precedents |
| Technical Depth | 8 | Complete SCM formulation, clear VTACR definition, fine-grained adaptive mechanism |
| Experimental Thoroughness | 7 | Three backbones and multiple benchmarks, but models are outdated; newer LVLMs are absent |
| Practical Value | 8 | Training-free, plug-and-play, open-source code |
| Writing Quality | 7 | Architecture diagrams are clear; numerous equations but logically coherent |
| Overall | 7.5 | Solid hallucination mitigation work; causal modeling combined with dual-path contrastive decoding offers meaningful insights |