ICLR 2026 LLM Safety LVLM hallucination visual encoder statistical bias inherent bias adversarial robustness contrastive decoding training-free

SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense¶

Conference: ICLR 2026 arXiv: 2510.16596 Code: GitHub Area: AI Safety Keywords: LVLM hallucination, visual encoder, statistical bias, inherent bias, adversarial robustness, contrastive decoding, training-free

TL;DR¶

This work is the first to systematically trace object hallucinations in LVLMs back to the visual encoder, identifying three core issues: statistical bias (over-emphasis on high-frequency pattern tokens), inherent bias (residual representations of pre-training dominant objects), and vulnerability (feature distortion under minimal perturbations). It proposes SHIELD—a fully training-free framework that jointly addresses these issues via token reweighting, token subtraction, and contrastive decoding, achieving comprehensive improvements over VCD and OPERA on LLaVA-1.5, InstructBLIP, and Qwen-VL.

Background & Motivation¶

Background: Large vision-language models (LVLMs) demonstrate strong performance on cross-modal tasks, yet object hallucination—where models generate plausible but image-inconsistent object descriptions—severely limits their deployment in safety-critical domains such as medical imaging, autonomous driving, and robotics.

Limitations of Prior Work: Existing hallucination mitigation methods fall into two categories: training-based methods (CLIP-DPO, LURE, LLaVA-RLHF) are resource-intensive; training-free methods (VCD via blurred image contrast, OPERA via over-trust penalty, HALC via adaptive focal contrastive decoding) are more efficient, but nearly all focus on the LLM component, leaving the role of the visual encoder largely unexplored.

Key Finding — Statistical Bias: Due to imbalanced pre-training data distributions, CLIP visual encoders over-emphasize tokens corresponding to high-frequency visual patterns (exhibiting abnormally high L2 norms), causing downstream LLM attention to be "hijacked" by these over-activated tokens and distorting fine-grained perception. Experiments show that higher peak-to-average L2 ratios correlate with higher proportions of hallucinated samples.

Key Finding — Inherent Bias: The encoder develops "ghost representations" of dominant objects in pre-training data—even when fed pure random noise, LLaVA-1.5 still identifies high-frequency objects such as "car," "chair," and "table" as present, demonstrating that the encoder itself carries input-independent erroneous priors.

Key Finding — Vulnerability: Visual encoders do not acquire sufficient noise/perturbation robustness during pre-training. Experiments show that on the POPE COCO subset, just a few steps of PGD adversarial attack reduce the F1 score from approximately 87 to below 70, indicating that minor perturbations cause severe feature distortion.

Mechanism: Each problem corresponds to a dedicated solution—token reweighting corrects statistical bias, token subtraction eliminates inherent bias, and adversarial contrastive decoding addresses vulnerability—forming a complete encoder-side hallucination defense.

Method¶

Overall Architecture¶

SHIELD is a fully training-free framework that operates on visual tokens during the LVLM inference stage. Given an input image and a query, the visual tokens produced by the encoder are affected by three categories of issues: statistical bias causing certain tokens to be over-emphasized, inherent bias introducing erroneous representations, and vulnerability causing feature instability. SHIELD processes these three issues sequentially through three dedicated modules.

Key Design 1: Token Reweighting — Mitigating Statistical Bias¶

Function: Redistributes the weights of visual tokens so that the model attends to tokens more relevant to genuine objects, rather than being dominated by a few high-L2-norm tokens.
Core Idea: (1) Use the original LVLM to generate a naïve description \(\mathbf{c}^{\text{naive}}\) for the image; (2) encode the description into \(P\) text tokens \(\mathbf{c}\) via the CLIP text encoder; (3) compute the cosine similarity matrix \(\mathbf{M} \in \mathbb{R}^{N \times P}\) between visual tokens \(\mathbf{x}^v\) and text tokens \(\mathbf{c}\); (4) take the maximum similarity of each visual token across all text tokens and normalize to obtain weights \(\mathbf{W}^v\); (5) apply the weights to the original tokens via residual addition:

\[\mathbf{x}^{v\prime} = \mathbf{x}^v + \mathbf{x}^v \odot \mathbf{W}^v\]

Design Motivation: Although the naïve description may contain hallucinated objects, those objects cannot match any visual token with high similarity in the similarity matrix, and thus are not erroneously amplified. This self-cleaning property ensures that reweighting only reinforces genuinely present objects.

Key Design 2: Token Subtraction — Eliminating Inherent Bias¶

Function: Estimates and removes the "ghost" representations inherently carried by the encoder due to pre-training data distribution, yielding visual tokens that more faithfully reflect the current input.
Core Idea: Feed \(K\) random noise images into the visual encoder, average the resulting tokens as an estimate of inherent bias, and subtract it from the reweighted tokens:

\[\mathbf{x}^{v\prime\prime} = \mathbf{x}^{v\prime} - \frac{1}{K}\sum_{i=1}^{K}E(\mathbf{n}_i)\]

Design Motivation: Inherent bias depends solely on encoder parameters (independent of input), so the average output over noise inputs reliably estimates these erroneous representations. This estimate can be precomputed and cached, incurring negligible additional inference overhead.

Key Design 3: Adversarial Attack + Contrastive Decoding — Addressing Vulnerability¶

Function: Exposes vulnerability-induced hallucinations via adversarial perturbations, then suppresses them through contrastive decoding at inference time.
Core Idea: (1) Construct adversarial perturbation \(\delta^*\) by minimizing the cosine similarity between the global representation of the perturbed image and the naïve description:

\[\ell_{\text{adv}} = \cos(E(\mathbf{v}+\delta), E_t(\mathbf{c}^{\text{naive}}))\]

(2) Add the attack tensor to the original image to produce "attacked" visual tokens \(\bar{\mathbf{x}}^v = E(\mathbf{v}+\delta^*)\); (3) contrast the two sets of logits during decoding:

\[p_{\text{shield}}(y_i) = \text{softmax}\left[(1+\alpha)\cdot\text{logit}(y_i|\mathbf{x}^{v\prime\prime}) - \alpha\cdot\text{logit}(y_i|\bar{\mathbf{x}}^v)\right]\]

Design Motivation: Adversarial attacks precisely expose the semantic regions where the encoder is most susceptible to deception, i.e., the highest-vulnerability areas. Contrastive decoding then leverages the discrepancy between the attacked and normal versions to accurately suppress vulnerability-induced hallucinated outputs while preserving correct content. An adaptive plausibility constraint (\(\beta\) truncation) further prevents the generation of implausible tokens.

Key Design 4: Adaptive Plausibility Constraint¶

Function: After contrastive decoding, retains only tokens whose probability is no less than \(\beta\) times the maximum probability, setting all others to zero:

\[\nu_{\text{token}}(y_i) = \{y_i \in \nu : p(y_i) \geq \beta \max_\omega p(\omega)\}\]

Design Motivation: Prevents contrastive decoding from introducing implausible low-probability tokens, thereby preserving output quality.

Key Experimental Results¶

Table 1: CHAIR Hallucination Evaluation (500 COCO images, long descriptions)¶

LVLM	Method	\(C_S\)↓	\(C_I\)↓
LLaVA-1.5	Vanilla	48.8	14.2
LLaVA-1.5	VCD	46.8	13.2
LLaVA-1.5	OPERA	44.6	12.8
LLaVA-1.5	SHIELD	36.6	10.3
InstructBLIP	Vanilla	54.6	24.8
InstructBLIP	VCD	44.0	13.6
InstructBLIP	OPERA	46.4	14.2
InstructBLIP	SHIELD	40.4	10.9
Qwen-VL	Vanilla	49.2	13.1
Qwen-VL	VCD	46.4	11.9
Qwen-VL	OPERA	34.6	9.5
Qwen-VL	SHIELD	28.9	9.2

On LLaVA-1.5, SHIELD reduces \(C_S\) by approximately 18% (44.6→36.6) and \(C_I\) by approximately 20% compared to the second-best method OPERA.

Table 2: POPE Hallucination Evaluation (COCO subset, Accuracy/F1)¶

LVLM	Method	Random Acc↑	Popular Acc↑	Adversarial Acc↑	Avg Acc↑
LLaVA-1.5	Vanilla	83.2	81.8	78.9	81.3
LLaVA-1.5	VCD	87.7	85.3	80.8	84.6
LLaVA-1.5	OPERA	89.1	86.0	79.1	84.7
LLaVA-1.5	SHIELD	91.3	87.4	82.5	87.0
Qwen-VL	Vanilla	84.7	84.1	82.2	83.6
Qwen-VL	VCD	88.6	87.1	84.2	86.6
Qwen-VL	SHIELD	89.2	87.6	84.3	87.0

SHIELD's advantage is most pronounced on the Adversarial split, suggesting that encoder bias and vulnerability are the primary sources of hallucination in adversarial settings.

Table 3: MME Hallucination Subset Evaluation¶

LVLM	Method	Existence↑	Count↑	Position↑	Color↑	Total↑
LLaVA-1.5	Vanilla	175.6	124.6	114.0	151.0	565.3
LLaVA-1.5	VCD	184.6	138.3	128.6	153.0	604.6
LLaVA-1.5	OPERA	180.6	133.3	123.3	155.0	592.3
LLaVA-1.5	SHIELD	195.0	141.6	148.3	183.3	668.3
Qwen-VL	Vanilla	155.0	127.6	131.6	173.0	587.3
Qwen-VL	SHIELD	180.0	170.0	128.3	190.0	668.3

Improvements in Position and Color are particularly notable (LLaVA-1.5: Position 114→148, Color 151→183), demonstrating that mitigating statistical bias substantially enhances the model's fine-grained attribute perception.

Table 4: Ablation Study (CHAIR, LLaVA-1.5)¶

Module Configuration	\(C_S\)↓	\(C_I\)↓
Vanilla	48.8	14.2
+ Adaptive plausibility constraint	50.2	13.8
+ Adversarial vulnerability defense	46.4	12.8
+ Statistical bias mitigation	40.4	11.0
+ Inherent bias elimination (full SHIELD)	36.6	10.3

Each module contributes independently and significantly. The statistical bias mitigation module yields the largest individual contribution (\(C_S\) reduced from 46.4 to 40.4, approximately 13%), with inherent bias elimination providing a further reduction of approximately 10%.

Key Findings¶

The encoder is a critical source of hallucination: All prior training-free methods focus exclusively on the LLM side; SHIELD is the first to demonstrate that bias and vulnerability within the visual encoder constitute an independent and significant source of hallucination, surpassing existing methods across all benchmarks.
Statistical bias is the dominant driver of hallucination: Ablation results show that mitigating statistical bias yields the largest performance gain—the over-emphasis of high-L2 tokens has a particularly pronounced impact on hallucination in long-description scenarios.
SHIELD does not sacrifice general capability: Full MME evaluation shows Perception improving from 1279 to 1473 (+194) and Total from 1632 to 1811 (+179), indicating that mitigating encoder bias not only reduces hallucination but also improves general perceptual abilities such as OCR and poster recognition.
Attribute-level hallucination sees the greatest improvement: Position improves by over 30% and Color by over 21%, indicating that encoder bias most severely impairs fine-grained attribute perception, and correction yields the greatest gains in these dimensions.
Limited gains on InstructBLIP: The Q-Former module restricts the propagation of modified visual features, resulting in smaller gains for SHIELD—which indirectly confirms that SHIELD's mechanism indeed operates at the visual token level.

Highlights & Insights¶

A new paradigm of "encoder-side hallucination": Prior work universally attributes hallucination to LLM overconfidence or data bias; SHIELD is the first to systematically localize the problem to the visual encoder, opening an entirely new research direction.
The persuasive power of the noise input experiment: Feeding pure noise to the encoder, which still "perceives" cars and chairs, reveals not model comprehension but the imprint of pre-training data distribution—a concise yet highly insightful experimental design.
Self-cleaning property of the naïve description: Token reweighting relies on the naïve description, yet hallucinated objects naturally fail to match high-similarity visual tokens in the similarity matrix and are therefore not amplified—an elegant self-correcting mechanism.
Orthogonality of the triple defense: The three modules each address independent problem dimensions (distributional bias → residual representation → robustness), and ablation experiments confirm they are complementary and additively effective.
Precomputable noise estimate: The inherent bias estimate depends only on encoder parameters and can be computed offline and cached, incurring near-zero additional overhead at inference time.

Limitations & Future Work¶

Increased inference cost: The framework requires generating a naïve description (one forward pass), computing the CLIP similarity matrix, sampling noise inputs, and running adversarial attacks, which is expected to increase inference latency by 2–3×.
Dependence on CLIP encoder architecture: Both token reweighting and adversarial attack strategies rely on CLIP's vision-text alignment; applicability to LVLMs that do not use CLIP encoders (e.g., EVA-CLIP variants or native ViT architectures) has not been validated.
Limited effectiveness on InstructBLIP: The Q-Former bottleneck restricts the propagation of modified visual features, potentially reducing SHIELD's efficacy on architectures with intermediate adapters.
Hyperparameter sensitivity: Fixed values of \(\alpha=2, \beta=0.35, K=32, l=0.02\) are used across all models; optimal hyperparameters may differ across models and tasks.
Evaluation scope: Evaluation is primarily conducted on COCO-based datasets; generalization to out-of-distribution scenarios (medical, remote sensing, industrial) has not been verified.

vs. VCD (Visual Contrastive Decoding)¶

VCD suppresses hallucinations by contrasting outputs from natural and blurred images, operating fundamentally at the LLM decoding stage. SHIELD directly corrects visual tokens at the encoder side and employs adversarial perturbations (rather than simple blurring) for contrastive decoding. SHIELD reduces \(C_S\) by approximately 22% compared to VCD on CHAIR (46.8→36.6) and outperforms it by approximately 1.7 points on POPE Adversarial (80.8→82.5). VCD's blurring is semantically agnostic uniform degradation, whereas SHIELD's adversarial attack is semantically targeted, enabling more precise exposure of vulnerabilities.

vs. OPERA¶

OPERA avoids over-reliance on specific tokens by adding an over-trust penalty in beam search, also operating at the LLM decoding stage. SHIELD reduces \(C_S\) by approximately 18% compared to OPERA on LLaVA-1.5 CHAIR (44.6→36.6) and achieves a 76-point higher total score on MME hallucination (592→668). OPERA indirectly mitigates statistical bias but does not address its root cause; SHIELD directly reweights visual tokens for a more fundamental correction.

vs. MARINE / VTI¶

MARINE introduces image-text alignment guidance from external visual models; VTI adjusts latent representations at test time to stabilize visual features. Both target feature-level correction but do not analyze the root causes of bias and vulnerability. SHIELD provides a more systematic root-cause analysis and a corresponding three-strategy defense framework.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First systematic attribution of encoder-side hallucination (statistical bias + inherent bias + vulnerability); the three-strategy defense framework is original
Experimental Thoroughness: ⭐⭐⭐⭐ — 5 hallucination benchmarks (CHAIR/POPE/MME/AMBER/GPT-4o) × 3 LVLM families + complete ablation + visualization
Writing Quality: ⭐⭐⭐⭐ — The logical chain from problem analysis → root-cause identification → solution is clear; figures and tables are well-designed
Value: ⭐⭐⭐⭐ — Opens a new encoder-side direction for LVLM hallucination research; training-free nature has practical utility