Skip to content

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Basic Information

Conference: CVPR2026 arXiv: 2603.10470 Code: Project Page Area: Causal Reasoning / Multimodal Hallucination Suppression Keywords: Large Vision-Language Models, Hallucination Suppression, Counterfactual Reasoning, Diffusion Models, Feature Projection, Training-Free

TL;DR

This paper proposes CIPHER, a training-free test-time hallucination suppression method. In the offline phase, a diffusion model is used to generate counterfactual images, constructing the OHC-25K dataset, from which visual hallucination subspaces are extracted via SVD. During inference, hidden states are projected onto the orthogonal complement of this subspace, significantly reducing visual hallucinations in LVLMs without modifying model parameters or incurring additional inference overhead.

Background & Motivation

Large Vision-Language Models (LVLMs) such as LLaVA, MiniGPT-4, and mPLUG-Owl2 achieve strong performance on multimodal tasks, yet frequently exhibit hallucinations—generating descriptions inconsistent with visual inputs (e.g., non-existent objects, incorrect attributes). Hallucination sources fall into two categories:

Text-induced hallucinations: stemming from autoregressive generation biases and language priors of the underlying LLM.

Vision-induced hallucinations: stemming from weak visual grounding and insufficient modality alignment.

Existing test-time methods (e.g., Nullu) primarily perturb the text modality to extract hallucination directions, neglecting hallucinations induced by the visual modality. Through linear probing experiments, the authors find that hallucination signals derived from text perturbations are weak and unstable in the hidden representation space (accuracy 0.73–0.80), whereas visual perturbations generated by diffusion models exhibit high separability and cross-layer stability (accuracy 0.86–0.89). This finding motivates the construction of visual counterfactuals to more precisely localize hallucination directions.

Method

Overall Architecture

CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal) consists of two stages:

  • Offline Phase: Construct the counterfactual dataset OHC-25K → Extract hallucination directions → Estimate the hallucination subspace.
  • Inference Phase: Apply orthogonal projection to hidden states during generation to suppress hallucination components.

Key Design 1: Counterfactual Dataset Construction (OHC-25K)

\(M=5000\) image-caption pairs \(\{(\boldsymbol{I}_i, \mathcal{C}_i)\}_{i=1}^M\) are randomly sampled from the MSCOCO training set, and counterfactual images are generated via the following pipeline:

Step 1: Caption perturbation. GPT-3.5 is used to generate hallucinated captions \(\tilde{\mathcal{C}}_i\) from each ground-truth caption \(\mathcal{C}_i\), injecting plausible but non-existent object descriptions (e.g., adding "a bunch of grapes" to a dining table scene).

Step 2: Latent encoding and forward diffusion noise injection. The image is encoded into the latent space using the VAE encoder of Stable Diffusion v1.5, followed by \(t_h\) steps of forward diffusion to introduce Gaussian noise:

\[\boldsymbol{z}_0 = \mathcal{E}(\boldsymbol{I}_i)\]
\[\tilde{\boldsymbol{z}}_{t_h} = \sqrt{\bar{\alpha}_{t_h}} \boldsymbol{z}_0 + \sqrt{1 - \bar{\alpha}_{t_h}} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, I)\]

where \(\bar{\alpha}_{t_h}\) is the cumulative product of noise schedule coefficients. Setting \(t_h = 0.5T\) preserves the global structure while allowing semantic content to be modified.

Step 3: Conditional reverse denoising. Reverse denoising is performed conditioned on the hallucinated caption \(\tilde{\mathcal{C}}_i\), guiding the noisy latent toward an image aligned with the hallucinated description:

\[\tilde{\boldsymbol{z}}_{t-1} = f_\theta(\tilde{\boldsymbol{z}}_t, t, \tilde{\mathcal{C}}_i), \quad t = t_h, \dots, 1\]

Step 4: Decoding to obtain counterfactual images. The final latent is decoded to produce counterfactual image \(\tilde{\boldsymbol{I}}_{i,j} = \mathcal{D}(\tilde{\boldsymbol{z}}_0)\), where \(j=1,\dots,B=5\) indexes variants from different Gaussian noise seeds. Counterfactual images are paired with the original ground-truth captions to create semantic conflicts:

\[\textbf{OHC-25K} = \{(\tilde{\boldsymbol{I}}_{i,j}, \mathcal{C}_i) \mid i=1,\dots,M,\; j=1,\dots,B\}\]

Key Design 2: Hallucination Subspace Estimation

For each original pair \((\boldsymbol{I}_i, \mathcal{C}_i)\) and its \(B\) counterfactual variants \((\tilde{\boldsymbol{I}}_{i,j}, \mathcal{C}_i)\), intermediate hidden representations are extracted from the frozen LVLM. Let \(\boldsymbol{h}_{\ell,k}^{(i)}\) denote the hidden state at layer \(\ell\) and caption token \(k\); mean pooling is applied over all caption tokens:

\[\boldsymbol{h}_\ell^{(i)} = \frac{1}{N}\sum_{k=1}^{N} \boldsymbol{h}_{\ell,k}^{(i)}, \quad \tilde{\boldsymbol{h}}_\ell^{(i)} = \frac{1}{B}\sum_{j=1}^{B} \tilde{\boldsymbol{h}}_\ell^{(i,j)}\]

The hallucination direction vector for sample \(i\) at layer \(\ell\) is computed as:

\[\boldsymbol{\delta}_\ell^{(i)} = \tilde{\boldsymbol{h}}_\ell^{(i)} - \boldsymbol{h}_\ell^{(i)}\]

All hallucination direction vectors are stacked into a difference matrix \(\boldsymbol{\Delta}_\ell \in \mathbb{R}^{M \times d}\), on which singular value decomposition is performed:

\[\boldsymbol{\Delta}_\ell = \boldsymbol{U}_\ell \boldsymbol{\Sigma}_\ell \boldsymbol{V}_\ell^\top\]

The top-\(r\) right singular vectors \(\boldsymbol{V}_{\ell,r} = [\boldsymbol{v}_{\ell,1}, \dots, \boldsymbol{v}_{\ell,r}]\) are retained as the Hallucination Basis Bank for each layer. The subspace spanned by these vectors captures the principal directions of vision-induced hallucinations.

Key Design 3: Test-Time Hallucination Removal

At inference time, at each autoregressive decoding step \(k\) and selected layer \(\ell\), the hidden state is projected onto the orthogonal complement of the hallucination subspace:

\[\boldsymbol{h}_{\ell,k}^{\text{clean}} = \boldsymbol{h}_{\ell,k}^{\text{test}} - \sum_{j=1}^{r} \langle \boldsymbol{h}_{\ell,k}^{\text{test}}, \boldsymbol{v}_{\ell,j} \rangle \boldsymbol{v}_{\ell,j}\]

Equivalently, expressed via a projection matrix:

\[\boldsymbol{h}_{\ell,k}^{\text{clean}} = \boldsymbol{P}_\ell \boldsymbol{h}_{\ell,k}^{\text{test}}, \quad \boldsymbol{P}_\ell = \boldsymbol{I} - \boldsymbol{V}_{\ell,r} \boldsymbol{V}_{\ell,r}^\top\]

This operation is applied before each token decoding step, removing components aligned with the hallucination directions while preserving core semantic information.

Implementation Details

  • Intervention layer selection: Projection is applied at upper Transformer layers (layers 16–32).
  • Rank selection: \(r=8\) for LLaVA-1.5, \(r=64\) for MiniGPT-4, \(r=32\) for mPLUG-Owl2 (determined via grid search).
  • Diffusion timestep: \(t_h = 0.5T\), balancing structural preservation and semantic substitution.
  • Classifier-free guidance scale: 7.5.
  • Decoding settings: beam size=3; maximum 64 tokens for CHAIR, 256 tokens for OPOPE.
  • Zero additional inference overhead: Projection is a lightweight matrix operation; throughput is identical to greedy decoding.

Key Experimental Results

Main Results 1: CHAIR Benchmark (Hallucination Rate + Fluency)

Method LLaVA CHAIR\(_S\) CHAIR\(_I\) BLEU↑ MiniGPT-4 CHAIR\(_S\) CHAIR\(_I\) BLEU↑ mPLUG CHAIR\(_S\) CHAIR\(_I\) BLEU↑
Greedy 20.40 7.08 15.72 32.40 12.20 14.57 22.90 8.62 15.01
DoLa 20.20 6.75 15.68 31.90 12.15 14.54 22.40 8.36 15.13
OPERA 17.50 6.07 16.02 29.70 11.96 14.82 20.07 7.18 15.41
VCD 20.30 7.28 14.53 29.00 12.64 14.42 22.80 8.68 15.14
Woodpecker 23.85 7.50 17.05 28.87 10.20 15.30 26.33 8.43 16.43
LURE 19.48 6.50 15.97 27.88 10.20 15.03 21.27 7.67 15.65
HALC 16.90 5.72 16.02 25.20 9.42 14.91 18.80 7.00 15.33
Nullu 15.20 5.30 15.69 21.40 8.99 14.81 15.60 5.77 15.45
CIPHER 13.05 4.53 15.82 18.48 8.33 15.10 13.60 4.92 16.25

CIPHER achieves the lowest hallucination rate across all models. On LLaVA, CHAIR\(_S\) is reduced by 2.15% relative to Nullu and by 7.35% relative to Greedy; on MiniGPT-4, the reduction versus Greedy is 13.92%. BLEU scores are maintained or improved, indicating that hallucination suppression does not sacrifice generation fluency.

Main Results 2: OPOPE Benchmark (Object Hallucination Detection)

Method LLaVA Acc↑ Prec↑ F1↑ MiniGPT-4 Acc↑ Prec↑ F1↑ mPLUG Acc↑ Prec↑ F1↑
Greedy 79.14 91.98 90.45 71.22 93.72 90.04 76.46 88.85 87.29
Nullu 79.52 93.46 91.79 71.92 95.96 92.07 77.09 92.83 90.80
CIPHER 80.05 93.72 92.11 72.25 96.50 92.58 77.87 92.93 90.95

CIPHER consistently outperforms all baselines on the near-saturated OPOPE benchmark, with particularly notable improvements in precision.

Ablation Study: Hallucination Source Analysis

Text Perturbation Image Perturbation CHAIR\(_S\) CHAIR\(_I\) BLEU↑
Yes No 15.20 5.30 15.69
No Yes 13.05 4.53 15.82
Yes Yes 15.71 5.32 15.66

Image perturbation alone yields the best results; joint perturbation slightly underperforms, suggesting that the two hallucination directions may interfere with each other.

Inference Efficiency Comparison

Metric Greedy OPERA HALC Nullu CIPHER
CHAIR\(_S\) 20.40 17.50 16.90 15.20 13.05
Throughput↑ (items/s) 0.70 0.10 0.05 0.70 0.70

CIPHER maintains throughput identical to standard greedy decoding (0.70 items/s), substantially outpacing OPERA (7× speedup) and HALC (14× speedup).

Key Findings

  • Linear probing confirms visual perturbations outperform text perturbations: Text-perturbation hidden representations achieve separability of only 0.73–0.80 (cross-layer instability), while visual perturbations reach 0.86–0.89 (cross-layer stability).
  • Optimal diffusion timestep \(t_h=0.5T\): Smaller values retain too much original semantics; larger values destroy structural consistency.
  • Optimal rank \(r=8\) for LLaVA: Simultaneously optimal on both CHAIR and BLEU.
  • Strong robustness to visual noise: CIPHER consistently outperforms baselines across a wide range of noise levels (\(\sigma\) from 0 to 1), with greater advantages under high-noise conditions.
  • Consistent improvements across all MMHal-Bench categories: Attributes, environment, holistic descriptions, and adversarial scenarios benefit most.
  • LLaVA-Bench: Accuracy improves from 6.79 to 7.08; detail score improves from 6.33 to 6.75.

Highlights & Insights

  • Novel perspective via visual counterfactuals: CIPHER is the first to extract hallucination directions from "hallucinated images" rather than "hallucinated text." Linear probing experiments robustly demonstrate that visual hallucination signals are stronger and more stable.
  • Zero additional inference overhead: Only a single forward pass plus lightweight matrix projection is required; throughput is identical to the no-intervention baseline.
  • Training-free, plug-and-play: No model retraining, additional annotations, or architectural modifications are needed—only a one-time offline subspace computation.
  • Cross-model generalizability: Consistent and significant improvements are demonstrated across three architectures: LLaVA-1.5, MiniGPT-4, and mPLUG-Owl2.
  • Mathematical elegance: The pipeline from counterfactuals → difference vectors → SVD → orthogonal projection forms a complete and interpretable logical chain.

Limitations & Future Work

  • The offline phase relies on Stable Diffusion and GPT-3.5 for counterfactual data generation, incurring a non-trivial one-time construction cost.
  • The hallucination subspace is statically fixed; the same projection matrix is applied to all test inputs, lacking input-adaptive capability.
  • The rank \(r\) and intervention layers require grid search for each model, with substantial variation in optimal configurations across models (\(r\) ranging from 8 to 64).
  • Joint text and image perturbation underperforms image perturbation alone; the paper does not provide sufficiently deep analysis of this phenomenon.
  • Evaluation primarily targets object-level hallucinations; effectiveness on fine-grained attribute/relational hallucinations warrants further exploration.

Rating

⭐⭐⭐⭐ — The method is concise and elegant with effective experimental validation. The counterfactual idea of "using hallucinated images to combat hallucinations" is novel, and comprehensive evaluation across four benchmarks is convincing. The primary weaknesses are the lack of adaptivity in the static projection and insufficient analysis of why joint perturbation underperforms.