DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles¶

Conference: CVPR 2026 arXiv: 2603.01111 Code: GitHub Area: Multimodal VLM Keywords: Prompt Learning, VLM Adaptation, Attention Head Role Decomposition, CLIP, Zero-shot Generalization

TL;DR¶

This paper proposes DeAR, which uses a Concept Entropy metric to decompose the deep-layer attention heads of ViT into three functional roles—attribute heads, generalization heads, and mixed heads—and designs a role-based attention masking mechanism to precisely control information flow, achieving the best balance between task adaptation and zero-shot generalization across 15 datasets.

Background & Motivation¶

Core challenge of CLIP adaptation: Pre-trained VLMs must be adapted to downstream tasks, but full fine-tuning causes catastrophic forgetting and sacrifices strong zero-shot generalization.

Limitations of Prior Work — oversimplified layer-level view in prompt learning: Existing methods assume that shallow layers capture general features and deep layers handle task-specific knowledge, but this layer-level perspective overlooks the functional diversity among attention heads within a single layer.

Uncontrolled token interactions: Due to the self-attention mechanism, inserted learnable tokens interact indiscriminately with original tokens, and task-specific knowledge may corrupt the generalization core.

Contradictions in layer-level strategies: MaPLe injects prompts into early layers while MMRL injects into deep layers—conflicting strategies reveal the absence of fine-grained injection principles.

Insights from interpretability research: VLM interpretability studies have found functional specialization among attention heads, providing a theoretical basis for fine-grained control.

Core hypothesis: Functional specialization within VLMs resides not between layers, but among attention heads within deep layers.

Method¶

Overall Architecture¶

DeAR consists of three components: (1) attention head functional role identification based on Concept Entropy; (2) multimodal attribute-aware prompt learning with role-based attention masking; and (3) task-adaptive fusion inference.

Key Designs¶

Concept Entropy-Based Functional Role Classification¶

For each attention head in the last four layers (layers 9–12) of ViT-B/16, TEXTSPAN is used to generate top-N descriptive texts, which are then encoded via SBERT and clustered with HDBSCAN to automatically discover concept clusters (five core attribute categories: color, shape, texture, object, and position). Concept Entropy is defined to quantify the degree of functional specialization of each head:

\[H(P_{(l,h)}) = -\sum_j P_{(l,h)}(c_j) \log_2 P_{(l,h)}(c_j)\]

Low entropy → attribute head (focused on a single attribute); high entropy → generalization head (general-purpose function); intermediate → mixed head.

Role-Based Attention Mask¶

Generalization heads & other specialist heads: Strict isolation — attribute tokens and original tokens are fully blocked from each other (\(\mathbf{M}[i,j] = -\infty\)), preserving generalization capacity.
Core attribute heads: Corresponding attribute tokens are routed to dedicated expert heads, with other attribute tokens masked out, enabling focused learning.
Mixed heads: All tokens are allowed to interact freely (\(\mathbf{M}[i,j] = 0\)).

Multimodal Attribute Tokens¶

On the visual side, five learnable attribute tokens are injected starting from layer \(J=9\), with a \(\beta\) parameter controlling inter-layer information retention. On the text side, \(K\) learnable tokens are symmetrically injected to ensure cross-modal alignment.

Loss & Training¶

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{CE}} + \lambda_{\text{reg}} \mathcal{L}_{\text{reg}} + \lambda_{\text{fusion}} \mathcal{L}_{\text{fusion}}\]

This includes a classification loss, a self-regularization loss (constraining features to remain close to the frozen CLIP representations), and a fusion weight regularization loss (encouraging the primary features to maintain high weights).

Key Experimental Results¶

Main Results: Base-to-Novel Generalization (Average over 11 Datasets)¶

Method	Base Acc	Novel Acc	HM
CLIP	69.34	74.22	71.70
CoOp	82.69	63.22	71.66
MaPLe	82.28	75.14	78.55
PromptSRC	84.26	76.10	79.97
DeAR (Ours)	84.50+	77.00+	80.60+

Ablation Study¶

Component	Contribution
Remove Role-Based Mask	Significant drop in Novel accuracy
Remove attribute tokens	Drop in both Base and Novel accuracy
Remove fusion regularization	Over-reliance on attribute features
Apply mask to generalization heads only	Effectively protects generalization

Key Findings¶

Attribute-conditioned image retrieval validates that attribute tokens genuinely capture corresponding semantic concepts (e.g., color retrieval returns images of the same color).
Comprehensive validation across 15 datasets, including domain generalization and cross-dataset transfer settings.
The method substantially improves Novel class generalization while maintaining Base performance.

Highlights & Insights¶

Concept Entropy is proposed to quantify attention head functional specialization from a data-driven perspective, avoiding subjective categorization.
The Role-Based Attention Mask design is highly precise, achieving for the first time "surgical-level" control over VLM information flow.
Attribute-conditioned retrieval experiments intuitively validate the effectiveness of the design.
The work combines theoretical innovation (head-level functional decomposition) with engineering practicality (plug-and-play usability).

Limitations & Future Work¶

The analysis targets ViT-B/16 only; generalizability to other architectures (e.g., ViT-L/14) remains to be verified.
The five attribute categories are manually selected; different tasks may require different attribute definitions.
The role-based attention masking introduces additional computational overhead at inference time.
Validation is limited to classification tasks; extension to detection and segmentation has not been explored.

Compared to multimodal prompt learning methods such as MaPLe and MMRL, DeAR is the first to introduce head-level functional analysis.
There is conceptual overlap with the attribute structure in ATPrompt, but DeAR achieves finer-grained control through attention masking.
Skip Tuning is conceptually related but operates at a different granularity (layer-level vs. head-level).
The analysis of VLM internal mechanisms provides a new perspective for subsequent interpretability research.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐