From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models¶

Conference: ACL 2026 arXiv: 2604.17941 Code: github Area: Multimodal VLM Keywords: neuron attribution, causal analysis, multi-task VLM, attention heads, model interpretability

TL;DR¶

This paper proposes HONES, a framework that first localizes task-critical attention heads and then uses them as conditions to guide FFN neuron attribution, achieving unified, gradient-free, neuron-level causal analysis and lightweight task performance improvement across heterogeneous tasks in multi-task VLMs.

Background & Motivation¶

Background: Large vision-language models (VLMs) achieve strong performance across diverse tasks such as VQA, OCR, and image captioning, yet their internal decision-making processes remain opaque — multiple capabilities are entangled within shared parameters, hindering error attribution and controllable deployment. Neuron-level analysis can provide fine-grained, actionable insights.

Limitations of Prior Work: (1) Existing neuron analysis methods focus primarily on single-task settings and cannot compare neuron importance across heterogeneous tasks (e.g., question answering vs. image-text matching). (2) Most approaches score neurons independently, ignoring the task-dependent routing effect of attention heads, which inflates the importance scores of polysemantic neurons.

Key Challenge: How to accurately identify task-critical neurons within a shared parameter space across different tasks while avoiding noise introduced by polysemanticity?

Goal: To design a unified cross-task neuron attribution framework and leverage the identified critical neurons for lightweight task performance improvement.

Key Insight: Following a structured causal view of the Transformer — attention heads select and route task-relevant inputs, while FFN neurons write routed information into the residual stream. The framework first localizes routing nodes (attention heads), then attributes FFN neurons conditioned on them.

Core Idea: The task importance of a neuron should be measured by its "write contribution" along the routing path of task-critical attention heads, rather than by simple activation magnitude.

Method¶

Overall Architecture¶

HONES operates in two stages. Discovery stage: (1) Task-critical attention heads \(\mathcal{H}_t^*\) are localized via mean-ablation intervention; (2) conditioned on the identified attention heads, the causal write contribution of each FFN neuron to the task objective is measured via Direct Vocabulary Projection (DVP), and Top-K neurons are selected. Steering stage: The backbone is frozen, and sparse scaling factors are learned exclusively on the critical neurons, with KL regularization enabling controllable task improvement.

Key Designs¶

Causal Attention Head Localization:
- Function: Identifies task-critical "routing nodes" to constrain the downstream neuron search space.
- Mechanism: Mean ablation intervention is applied — the output of a target head is replaced by the mean of the remaining \(H-1\) heads' outputs, and the resulting task performance drop \(S_t(h)\) is measured. The Top-\(K_h\) heads are selected to form \(\mathcal{H}_t^*\).
- Design Motivation: Mean ablation reduces out-of-distribution artifacts compared to zero ablation; localizing routing nodes first effectively isolates valid computation paths.
Head-Guided Neuron Attribution (Causal Write Effect):
- Function: Scores each FFN neuron conditioned on the task routing context.
- Mechanism: For each neuron \((l,i)\), the vector \(\Delta \mathbf{r}_i^{(l)}\) written into the residual stream via the down-projection is computed, then projected onto the unembedding vector of the target token via DVP to obtain a write contribution \(c_{l,i}\). Interventions are then applied to each critical head, and the contribution drop \(\Delta c\) before and after intervention is aggregated, weighted by head importance, into a final score \(I_{l,i}\).
- Design Motivation: Activation-based independent scoring is easily confounded by polysemanticity; head-guided conditioning ensures that only contributions along the task routing path are counted.
Lightweight Neuron Steering:
- Function: Enhances task performance by modulating the activations of critical neurons.
- Mechanism: All backbone parameters are frozen, and a scaling factor \(\lambda_{l,i}\) is learned for each critical neuron. The optimization objective includes a task loss and a KL divergence regularization term: \(\min_{\lambda_t} \mathcal{L}_t + \beta \text{KL}(p_\theta \| p_{\theta_{\lambda_t}})\).
- Design Motivation: KL regularization prevents excessive deviation from the original model behavior; learning only sparse scaling factors incurs minimal parameter overhead.

Loss & Training¶

The discovery stage uses a discovery split of 7K images; the steering stage uses a dev split of 2K images to learn scaling factors, with 3K images reserved for testing. For open-ended objectives (e.g., captioning), IDF-weighted aggregation of token unembedding vectors is applied.

Key Experimental Results¶

Main Results (Performance Drop % after Masking Top-1% Neurons)¶

Method	VQA	OCR	Caption	Retrieval	Avg.
AP	11.33	10.40	8.65	0.50	7.72
MA	6.82	15.50	11.90	1.35	8.89
APE	3.20	-1.87	12.20	0.90	3.61
HONES	27.30	19.00	19.80	7.43	18.38

Steering Results (LLaVA-1.5-7B)¶

Method	VQA	OCR	Caption	Retrieval	Avg.
Base	0.652	0.576	0.129	0.933	0.572
Grid-Search	0.666	0.594	0.132	0.956	0.587
HONES	0.673	0.602	0.141	0.963	0.595

Key Findings¶

HONES consistently outperforms activation-based methods across all tasks and both VLMs, achieving average performance drops of 18.38% (LLaVA) and 21.91% (Qwen).
Critical neurons exhibit task-dependent layer preferences: retrieval tasks concentrate in middle layers (visual-text alignment), while other tasks favor deeper layers (answer decoding).
VQA shared neurons show the highest cross-task salience, exhibiting a "hub" effect — VQA-related neurons underpin a broad range of visual-language tasks.
In OOD experiments, directly transferring in-domain learned scaling factors (zero-shot) yields consistent improvements.

Highlights & Insights¶

The coarse-to-fine attribution strategy from attention heads to neurons is both elegant and efficient — head-guided conditioning effectively suppresses polysemantic noise.
A unified cross-task scoring interface (DVP + IDF weighting) addresses the comparability challenge posed by heterogeneous task outputs.
The discovery of VQA as a cross-task "hub" carries significant implications for model understanding.
The steering method learns only sparse scaling factors, incurring minimal parameter overhead while remaining transferable to OOD settings.

Limitations & Future Work¶

Experiments are limited to dense models at the 7B scale; validation on larger models or MoE architectures remains to be conducted.
The four coarse-grained task categories may obscure sub-task-level differences (e.g., counting vs. spatial reasoning within VQA).
Causal analysis requires multiple forward passes, resulting in high computational cost that limits scalability on large datasets.
Complementary integration with feature-level methods such as SAEs has not been explored.

vs. AP/MA/APE (activation statistics methods): Examining activation magnitude alone cannot disambiguate polysemantic neurons; HONES's head-guided conditioning provides more accurate attribution.
vs. QRNCA (gradient-based methods): HONES is gradient-free and more efficient, with faster localization.
vs. SAE: HONES operates directly on the original model without additional feature learning, supporting both causal attribution and lightweight steering.
vs. MultEdit: MultEdit edits knowledge in MLP blocks, whereas HONES analyzes the neuron-sharing structure across tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The head-guided neuron attribution framework is novel in design, and the unified cross-task scoring interface addresses a practical bottleneck.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four tasks × two models, extensive controlled variants and ablation studies, with OOD validation.
Writing Quality: ⭐⭐⭐⭐ The framework is clearly described, with rich findings and insights.
Value: ⭐⭐⭐⭐⭐ Significant advancement in VLM interpretability and controllability; the steering method has strong practical value.