RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models (Oral)¶

Conference: AAAI 2026
arXiv: 2512.06811
Code: Not mentioned
Area: Vision-Language Model / Parameter-Efficient Fine-Tuning
Keywords: CLIP, Adapter, Reconstruction, few-shot learning, Vision-Language Model

TL;DR¶

This paper proposes RMAdapter, a dual-branch adapter architecture that augments the standard adaptation branch with a reconstruction branch (analogous to an AutoEncoder). By sharing the down-projection layer and applying per-layer local reconstruction losses, RMAdapter achieves an optimal balance between task-specific adaptation and general knowledge retention in few-shot CLIP fine-tuning, outperforming state-of-the-art methods (including prompt-based approaches) across three benchmarks: Base-to-Novel generalization, cross-dataset transfer, and domain generalization.

Background & Motivation¶

Pre-trained VLMs (e.g., CLIP) face a fundamental tension in few-shot downstream adaptation — the adaptation-generalization trade-off:

Prompt Learning: The line CoOp → CoCoOp → MaPLe → PromptSRC → CoPrompt has advanced rapidly, yet fundamentally lacks explicit knowledge retention mechanisms. Learned prompts are highly discriminative for seen classes but exhibit strong bias against unseen classes.

Underexplored Adapter Direction: Compared to prompt-based methods, adapter approaches remain significantly underexplored. Existing adapters (e.g., MMA) employ only a single branch focused on adaptation, lacking structured designs to govern the balance between discriminability and generalizability.

Key Observation — Structural Isomorphism Between Adapters and AutoEncoders: The down-projection → up-projection structure of an adapter is isomorphic to the encoder → decoder structure of an AE, naturally motivating the addition of a reconstruction branch to constrain the feature space from drifting away from the original distribution.

Method¶

Overall Architecture¶

RMAdapter inserts dual-branch adapters into the upper layers (last \(k\) layers) of both the visual and text encoders of CLIP. The entire CLIP backbone is frozen; only adapter parameters are trained. The two branches share the down-projection layer and perform task adaptation and feature reconstruction respectively, with residual connections fusing their outputs with the original CLIP representations.

Key Designs¶

Adaptation Branch (RMAdapter_base): Standard adapter structure, \(x_{down} = \sigma(x W_{down} + b_{down})\), \(\text{output} = x_{down} W_{up}^{base} + b_{up}^{base}\), injecting task-specific knowledge.
Reconstruction Branch (RMAdapter_rec): A two-layer up-projection structure, \(\text{output} = \sigma(x_{down} W_{up1}^{rec} + b_{up1}^{rec}) W_{up2}^{rec} + b_{up2}^{rec}\), reconstructing the latent representation back to the original feature space; an L2 reconstruction loss constrains general knowledge retention.
Shared Down-Projection: Both branches share \(W_{down}\), enabling a Pareto-optimal adaptation-reconstruction trade-off. The shared projection forces both branches to operate in the same low-rank space: the adaptation branch learns task-relevant features while the reconstruction branch ensures this space does not deviate from the original distribution — a natural mutual regularization.
Per-Layer Local Reconstruction Loss: \(\mathcal{L}_{rec}^V = \sum_{i=k}^K \|[c_i, E_i] - \text{RMAdapter}_{rec}([c_i, E_i])\|^2\), computing L2 loss independently at each layer without cross-layer backpropagation, yielding computational efficiency.
Consistency Constraint: \(\mathcal{L}_{con} = \lambda_3 \|x^a - x\|_1 + \lambda_4 \|w^a - w\|_1\), constraining the L1 distance between adapted features and original CLIP features.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{ce} + \mathcal{L}_{con} + \mathcal{L}_{rec}\]

Cross-entropy (classification supervision) + consistency constraint (preventing deviation from original features) + reconstruction loss (preserving general knowledge). L2, L1, and cosine reconstruction objectives are evaluated; L2 proves most stable.

Key Experimental Results¶

Main Results: Base-to-Novel Generalization (Average over 11 Datasets)¶

Method	Type	Base Acc	Novel Acc	HM
CLIP (zero-shot)	—	69.34	74.22	71.70
CoOp	Prompt	82.69	63.22	71.66
CoCoOp	Prompt	80.47	71.69	75.83
MaPLe	Prompt	82.28	75.14	78.55
PromptSRC	Prompt	84.26	76.10	79.97
MMA	Adapter	83.20	76.80	79.87
CoPrompt	Prompt	84.00	77.23	80.48
RMAdapter	Adapter	84.52	77.36	80.62

Ablation Study: Contribution of Key Designs to HM¶

Configuration	HM	Note
Single-branch Adapter (MMA)	79.87	No reconstruction constraint
+ Reconstruction branch (non-shared)	~80.1	Independent parameters, limited gain
+ Shared down-projection	~80.4	Pareto-optimal trade-off
+ Consistency constraint	80.62	Full model
Reconstruction branch: 1-layer up-projection	Slightly lower	Insufficient capacity
Reconstruction branch: 2-layer up-projection	Optimal	Sweet spot
Reconstruction branch: 3-layer up-projection	Degraded	Few-shot overfitting

Key Findings¶

Cross-dataset generalization (average over 10 datasets): RMAdapter 67.56% vs. CoPrompt 67.00% vs. MMA 66.61%
Domain generalization (average over 4 ImageNet variants): RMAdapter 60.71% vs. PromptSRC 60.65% vs. CoPrompt 60.42%
The reconstruction branch adds only ~320K parameters, +3% GPU memory and +5% training time, yet yields significant performance gains
Adapters comprehensively surpass prompt-based methods for the first time, demonstrating that the adapter direction has been substantially underestimated

Highlights & Insights¶

Structural Analogy Between Adapters and AutoEncoders: A particularly elegant observation — the adapter's down-projection → up-projection is isomorphic to an AE's encoder → decoder, making the addition of a reconstruction branch both natural and principled. This approach of "discovering new connections within existing structures" is methodologically instructive.
Intuition Behind Shared Down-Projection: Both branches operate in the same low-rank space; the adaptation branch learns task-specific features while the reconstruction branch ensures the low-rank space does not drift — natural mutual regularization achieving a Pareto-optimal trade-off.
No Reliance on Data Augmentation or Complex Prompt Engineering: The approach is simpler than methods such as CoPrompt.
Reconstruction as Regularization: The reconstruction objective serves as a knowledge retention mechanism, conceptually similar to knowledge distillation but more lightweight — no teacher model forward pass is required.

Limitations & Future Work¶

Experiments are conducted on ViT-B/16 CLIP only; ViT-L, SigLIP, and EVA-CLIP variants are not evaluated.
Validation is limited to classification tasks; extension to detection, segmentation, and other downstream tasks is not explored.
The reconstruction branch uses a simple MSE objective; more structured retention strategies (e.g., feature direction preservation, subspace projection) could be explored.
Integration with other PEFT methods such as LoRA and Prefix Tuning is not discussed.

Method Category	Representative Methods	Core Mechanism	Knowledge Retention Strategy
Prompt Learning	CoOp, CoCoOp, MaPLe	Learnable prompt tokens	Implicit (no explicit design)
Regularized Prompt	PromptSRC, KgCoOp	Prompt + constraints	Self-regularization, text embedding distance
Hybrid Prompt+Adapter	CoPrompt	Prompt + adapter	Consistency constraint
Single-branch Adapter	CLIP-Adapter, MMA	Adaptation branch only	None
Dual-branch Adapter (Ours)	RMAdapter	Adaptation + reconstruction branch	Explicit reconstruction loss + shared down-projection

Rating¶

Novelty: ⭐⭐⭐⭐ Elegant AE-Adapter analogy; dual-branch design is principled and clean
Experimental Thoroughness: ⭐⭐⭐⭐ 11-dataset + cross-dataset + domain generalization + comprehensive ablation
Writing Quality: ⭐⭐⭐⭐ Problem motivation and methodological derivation are logically coherent
Value: ⭐⭐⭐⭐ Demonstrates the underestimation of the adapter direction; provides a generalizable PEFT design paradigm