Skip to content

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

Conference: ACL 2026 arXiv: 2604.17941 Code: github Area: Multimodal VLM Keywords: neuron attribution, causal analysis, multi-task VLM, attention heads, model interpretability

TL;DR

This paper proposes HONES, a framework that first localizes task-critical attention heads and then uses them as conditions to guide FFN neuron attribution, achieving unified, gradient-free, neuron-level causal analysis and lightweight task performance improvement across heterogeneous tasks in multi-task VLMs.

Background & Motivation

Background: Large vision-language models (VLMs) achieve strong performance across diverse tasks such as VQA, OCR, and image captioning, yet their internal decision-making processes remain opaque — multiple capabilities are entangled within shared parameters, hindering error attribution and controllable deployment. Neuron-level analysis can provide fine-grained, actionable insights.

Limitations of Prior Work: (1) Existing neuron analysis methods focus primarily on single-task settings and cannot compare neuron importance across heterogeneous tasks (e.g., question answering vs. image-text matching). (2) Most approaches score neurons independently, ignoring the task-dependent routing effect of attention heads, which inflates the importance scores of polysemantic neurons.

Key Challenge: How to accurately identify task-critical neurons within a shared parameter space across different tasks while avoiding noise introduced by polysemanticity?

Goal: To design a unified cross-task neuron attribution framework and leverage the identified critical neurons for lightweight task performance improvement.

Key Insight: Following a structured causal view of the Transformer — attention heads select and route task-relevant inputs, while FFN neurons write routed information into the residual stream. The framework first localizes routing nodes (attention heads), then attributes FFN neurons conditioned on them.

Core Idea: The task importance of a neuron should be measured by its "write contribution" along the routing path of task-critical attention heads, rather than by simple activation magnitude.

Method

Overall Architecture

HONES operates in two stages. Discovery stage: (1) Task-critical attention heads \(\mathcal{H}_t^*\) are localized via mean-ablation intervention; (2) conditioned on the identified attention heads, the causal write contribution of each FFN neuron to the task objective is measured via Direct Vocabulary Projection (DVP), and Top-K neurons are selected. Steering stage: The backbone is frozen, and sparse scaling factors are learned exclusively on the critical neurons, with KL regularization enabling controllable task improvement.

Key Designs

  1. Causal Attention Head Localization:

    • Function: Identifies task-critical "routing nodes" to constrain the downstream neuron search space.
    • Mechanism: Mean ablation intervention is applied — the output of a target head is replaced by the mean of the remaining \(H-1\) heads' outputs, and the resulting task performance drop \(S_t(h)\) is measured. The Top-\(K_h\) heads are selected to form \(\mathcal{H}_t^*\).
    • Design Motivation: Mean ablation reduces out-of-distribution artifacts compared to zero ablation; localizing routing nodes first effectively isolates valid computation paths.
  2. Head-Guided Neuron Attribution (Causal Write Effect):

    • Function: Scores each FFN neuron conditioned on the task routing context.
    • Mechanism: For each neuron \((l,i)\), the vector \(\Delta \mathbf{r}_i^{(l)}\) written into the residual stream via the down-projection is computed, then projected onto the unembedding vector of the target token via DVP to obtain a write contribution \(c_{l,i}\). Interventions are then applied to each critical head, and the contribution drop \(\Delta c\) before and after intervention is aggregated, weighted by head importance, into a final score \(I_{l,i}\).
    • Design Motivation: Activation-based independent scoring is easily confounded by polysemanticity; head-guided conditioning ensures that only contributions along the task routing path are counted.
  3. Lightweight Neuron Steering:

    • Function: Enhances task performance by modulating the activations of critical neurons.
    • Mechanism: All backbone parameters are frozen, and a scaling factor \(\lambda_{l,i}\) is learned for each critical neuron. The optimization objective includes a task loss and a KL divergence regularization term: \(\min_{\lambda_t} \mathcal{L}_t + \beta \text{KL}(p_\theta \| p_{\theta_{\lambda_t}})\).
    • Design Motivation: KL regularization prevents excessive deviation from the original model behavior; learning only sparse scaling factors incurs minimal parameter overhead.

Loss & Training

The discovery stage uses a discovery split of 7K images; the steering stage uses a dev split of 2K images to learn scaling factors, with 3K images reserved for testing. For open-ended objectives (e.g., captioning), IDF-weighted aggregation of token unembedding vectors is applied.

Key Experimental Results

Main Results (Performance Drop % after Masking Top-1% Neurons)

Method VQA OCR Caption Retrieval Avg.
AP 11.33 10.40 8.65 0.50 7.72
MA 6.82 15.50 11.90 1.35 8.89
APE 3.20 -1.87 12.20 0.90 3.61
HONES 27.30 19.00 19.80 7.43 18.38

Steering Results (LLaVA-1.5-7B)

Method VQA OCR Caption Retrieval Avg.
Base 0.652 0.576 0.129 0.933 0.572
Grid-Search 0.666 0.594 0.132 0.956 0.587
HONES 0.673 0.602 0.141 0.963 0.595

Key Findings

  • HONES consistently outperforms activation-based methods across all tasks and both VLMs, achieving average performance drops of 18.38% (LLaVA) and 21.91% (Qwen).
  • Critical neurons exhibit task-dependent layer preferences: retrieval tasks concentrate in middle layers (visual-text alignment), while other tasks favor deeper layers (answer decoding).
  • VQA shared neurons show the highest cross-task salience, exhibiting a "hub" effect — VQA-related neurons underpin a broad range of visual-language tasks.
  • In OOD experiments, directly transferring in-domain learned scaling factors (zero-shot) yields consistent improvements.

Highlights & Insights

  • The coarse-to-fine attribution strategy from attention heads to neurons is both elegant and efficient — head-guided conditioning effectively suppresses polysemantic noise.
  • A unified cross-task scoring interface (DVP + IDF weighting) addresses the comparability challenge posed by heterogeneous task outputs.
  • The discovery of VQA as a cross-task "hub" carries significant implications for model understanding.
  • The steering method learns only sparse scaling factors, incurring minimal parameter overhead while remaining transferable to OOD settings.

Limitations & Future Work

  • Experiments are limited to dense models at the 7B scale; validation on larger models or MoE architectures remains to be conducted.
  • The four coarse-grained task categories may obscure sub-task-level differences (e.g., counting vs. spatial reasoning within VQA).
  • Causal analysis requires multiple forward passes, resulting in high computational cost that limits scalability on large datasets.
  • Complementary integration with feature-level methods such as SAEs has not been explored.
  • vs. AP/MA/APE (activation statistics methods): Examining activation magnitude alone cannot disambiguate polysemantic neurons; HONES's head-guided conditioning provides more accurate attribution.
  • vs. QRNCA (gradient-based methods): HONES is gradient-free and more efficient, with faster localization.
  • vs. SAE: HONES operates directly on the original model without additional feature learning, supporting both causal attribution and lightweight steering.
  • vs. MultEdit: MultEdit edits knowledge in MLP blocks, whereas HONES analyzes the neuron-sharing structure across tasks.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The head-guided neuron attribution framework is novel in design, and the unified cross-task scoring interface addresses a practical bottleneck.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four tasks × two models, extensive controlled variants and ablation studies, with OOD validation.
  • Writing Quality: ⭐⭐⭐⭐ The framework is clearly described, with rich findings and insights.
  • Value: ⭐⭐⭐⭐⭐ Significant advancement in VLM interpretability and controllability; the steering method has strong practical value.