Skip to content

Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

Conference: ICLR 2026 arXiv: 2602.21704 Code: None Area: Multimodal VLM Keywords: hallucination mitigation, activation engineering, attention head intervention, training-free method, large vision-language models

TL;DR

This paper proposes Dynamic Multimodal Activation Steering (DMAS), a training-free method that constructs a semantics-based truthfulness steering vector database and a visual perception steering vector, dynamically selecting the most relevant steering vectors at inference time to intervene on critical attention heads. DMAS significantly mitigates hallucinations in LVLMs, achieving a gain of 94.66 points on MME and a 20.2% reduction in hallucination rate on CHAIR.

Background & Motivation

Large vision-language models (LVLMs) achieve strong performance on tasks such as VQA and image captioning, but suffer from severe hallucination problems—fabricating non-existent objects or incorrectly describing image content. Existing approaches fall into two categories: training-based methods require carefully annotated data and substantial computational resources (e.g., LRV, RLHF-V), while decoding-based methods (e.g., VCD, ICD) avoid training but often degrade generation quality.

Recent activation engineering approaches (e.g., ICT, VTI) attempt to reduce hallucinations by intervening in the model's internal representations, but exhibit critical limitations: ICT focuses solely on visual-level intervention, ignoring the multimodal nature of LVLMs; VTI employs fixed steering vectors, overlooking the variation of steering vectors across different semantic contexts.

Core Findings: Through analysis of LLaVAv1.5's attention patterns, the authors identify two key phenomena: (1) truthfulness and visual perception capabilities primarily activate different subsets of attention heads (truthfulness concentrated in layer 30, visual perception in layer 31); (2) truthfulness steering vectors vary substantially across different semantic contexts (t-SNE visualizations reveal clearly separated semantic clusters). These two findings directly motivate the design of DMAS.

Method

Overall Architecture

DMAS operates in three steps: (1) constructing a dynamic truthfulness steering vector database; (2) computing visual perception steering vectors; (3) applying dynamic intervention on different attention heads at inference time. The entire method requires no training and operates in a plug-and-play manner.

Key Designs

  1. Truthfulness Steering Vector Database: Samples from the AMBER and SEED datasets are semantically clustered into 4 clusters. For each sample, correct/hallucinated answer pairs are constructed and fed into the LVLM to extract activations \(A_{pos}\) and \(A_{neg}\) at the last token position across attention heads in each layer. The intra-cluster average activation difference forms the steering vector: \(D_i = \frac{1}{|C_i|}\sum_{j \in C_i}(A_{pos,j} - A_{neg,j})\), with PCA applied for denoising. A Key-Value database is constructed using the mean cluster embedding as the Key and the steering vector as the Value. At inference time, a sentence transformer computes semantic similarity between the input and each Key to dynamically retrieve the most relevant steering vector.

  2. Visual Perception Steering Vector: Given an original image \(V\) and a noise-perturbed image \(V'\) (obtained via the forward diffusion process), YOLOv11 detects objects to generate description templates. The activation difference between the original input \((V, T+Y_O)\) and the perturbed input \((V', T+Y_{O'})\) is computed as: \(D_v = A_v - A_{v'}\), with PCA applied to extract principal components. This design enhances the model's attention to visual information.

  3. Dynamic Inference Intervention: At inference time, the semantically most relevant truthfulness steering vector \(D_f\) is retrieved via cosine similarity, and binary masks \(M_f\) and \(M_v\) are constructed to apply intervention only to the Top-K attention heads with the largest activation differences. The modified attention computation is: \(\mathbf{x}^{(l+1)} = \mathbf{x}^{(l)} + \text{Concat}[\text{Attn}^{(l,h)}(\mathbf{x}^{(l)}) + \alpha \cdot M_f^{(l,h)} \cdot D_f^{(l,h)} + \beta \cdot M_v^{(l,h)} \cdot D_v^{(l,h)}] \cdot \mathbf{W}_o^{(l)}\), where \(\alpha\) and \(\beta\) control intervention strength.

Loss & Training

No training is required. Hyperparameters \(\alpha, \beta \in \{0.5, 1, ..., 10\}\) and \(K \in \{32, 64, ..., 1024\}\) are determined via grid search. Temperature is set to 0 and top_p to 1.

Notable implementation details: - Key embeddings in the steering vector database are obtained via a sentence transformer (all-mpnet-base-v2) - YOLOv11 detects objects in the image for the visual perception component; objects from the same category not present in the image are randomly sampled from a category bank as contrast - PCA denoising is applied separately to truthfulness and visual perception steering vectors to extract the most salient principal components - All experiments are conducted on an NVIDIA RTX 4090 (48GB) GPU

Key Experimental Results

Main Results

Model Method Existence↑ Count↑ Position↑ Color↑ Total↑
LLaVAv1.5 Regular 175.67 124.67 114.00 151.00 565.33
LLaVAv1.5 ICT 190.00 160.43 128.67 170.00 649.10
LLaVAv1.5 DMAS 195.00 158.33 133.33 173.33 659.99
QwenVL Regular 155.00 127.67 131.67 173.00 587.33
QwenVL VAF 165.00 155.00 133.33 175.00 628.33
QwenVL DMAS 170.00 145.00 133.33 185.00 633.33

CHAIR Results (LLaVAv1.5):

Method CHAIR_S↓ CHAIR_I↓
Regular 51.0 15.2
VTI 35.8 11.1
DMAS 30.8 11.4

Ablation Study

Method CHAIR_S↓ CHAIR_I↓ POPE Acc↑ POPE F1↑
Full DMAS 30.8 11.4 81.70 82.47
Truthfulness vector only 34.2 11.7 81.67 82.42
Visual vector only 42.4 13.2 81.40 82.01
No intervention 51.0 15.2 75.08 76.06

Key Findings

  • Dynamic semantic matching for steering vector selection significantly outperforms fixed steering vectors; on QwenVL's Position subtask, fixed vectors even underperform the original model
  • A cluster count of 4 achieves optimal performance on both models; too few clusters leads to excessively coarse semantic granularity
  • The method yields substantial improvements on entirely different dataset types such as ScienceQA and ViQuAE (LLaVAv1.5 on ScienceQA: 52.75%→62.27%), demonstrating strong generalization
  • On POPE with MSCOCO, LLaVAv1.5 achieves +5.43% Accuracy and +7.14% F1; on GQA, +6.94% Accuracy and +6.5% F1
  • Negative values of \(\alpha\) and \(\beta\) (i.e., steering toward hallucination) degrade F1; excessively large values impair the model's base capabilities
  • Too few intervened attention heads yields negligible effect, while too many also leads to performance degradation

Highlights & Insights

  • The paper reveals that truthfulness and visual perception activate distinct subsets of attention heads in LVLMs, providing an important empirical foundation for future research
  • The dynamic semantic matching design is principled and effective, avoiding the limitations of a one-size-fits-all intervention strategy
  • The method is entirely training-free and can be plug-and-play integrated into different LVLM architectures
  • The experimental design is comprehensive: discriminative tasks (MME, POPE) + generative tasks (CHAIR) + generalization validation (ScienceQA, ViQuAE), forming a complete evaluation framework
  • Visualization analyses (attention head activation maps, t-SNE cluster plots, hyperparameter sensitivity curves) provide strong support for the method's motivation and effectiveness

Limitations & Future Work

  • The construction of the steering vector database relies on specific choices of the AMBER and SEED datasets; larger and more diverse data sources may further improve performance
  • The cluster count is currently fixed at 4; adaptively determining the optimal number of clusters is a worthwhile direction
  • Hyperparameters \(\alpha\), \(\beta\), and \(K\) require grid search; automated hyperparameter tuning merits investigation
  • Validation is limited to 7B-scale models; the effectiveness on larger models (e.g., 13B, 70B) remains to be verified
  • Constructing the steering vector database incurs non-trivial preprocessing cost (activation extraction over 3,000 samples)
  • The method assumes that the functional specialization of attention heads is consistent across different LVLM architectures; the universality of this assumption requires further validation
  • Relationship to ICT: ICT enhances attention by adding noise to objects in images, whereas DMAS intervenes along both truthfulness and visual perception dimensions simultaneously and supports dynamic semantic matching
  • Relationship to VTI: VTI uses fixed steering vectors; DMAS demonstrates the necessity of dynamic selection
  • Insight: The activation engineering paradigm warrants further exploration across a broader range of multimodal tasks, such as visual reasoning and multimodal dialogue

Rating

  • Novelty: ⭐⭐⭐⭐ The idea of dynamic semantic matching for steering vector selection is innovative, though activation engineering itself builds on prior foundational work
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers three benchmarks (MME/POPE/CHAIR), two models, complete ablations, and sufficient generalization validation
  • Writing Quality: ⭐⭐⭐⭐ Writing is clear with strong visualizations and naturally motivated problem framing
  • Value: ⭐⭐⭐⭐ High practical value as a training-free method, though clustering and hyperparameter search add deployment complexity