Efficient and Versatile Robust Fine-Tuning of Zero-shot Models¶
Conference: ECCV 2024
arXiv: 2408.05749
Code: Yes (cvlab.postech.ac.kr/research/R-Adapter)
Area: Image Segmentation / Vision-Language Model Fine-Tuning
Keywords: Robust Fine-Tuning, Parameter-Efficient Fine-Tuning, Self-Ensemble, CLIP, Out-of-Distribution Generalization
TL;DR¶
R-Adapter introduces a lightweight adapter module into the CLIP model along with three self-ensemble strategies (Adapter Dropping, Weight Accumulation, and Weight-Scaling Re-parameterization). While fine-tuning only 13% of the parameters, it simultaneously achieves high ID accuracy and strong OOD robustness, and extends robust fine-tuning to cross-modal retrieval and open-vocabulary segmentation tasks for the first time.
Background & Motivation¶
Key Challenge¶
Large-scale vision-language pre-trained models (such as CLIP) exhibit zero-shot classification and cross-distribution generalization capabilities. However, fine-tuning them on downstream tasks faces two conflicting challenges:
Robustness Degradation: Full fine-tuning disrupts pre-trained knowledge, leading to a substantial drop in out-of-distribution (OOD) accuracy. For example, CLIP ViT-B/16 achieves a zero-shot average OOD accuracy of 58.4%, which drops to 52.8% after full fine-tuning—where the ID accuracy gains 18.4% but the OOD accuracy only improves by 2.0%.
Prohibitive Computational Cost: Full fine-tuning requires updating all parameters (e.g., 305M parameters in ViT-L/14), making the memory and storage overhead unsustainable as the model scale grows.
Limitations of Prior Work¶
| Method Type | Representative | Pros | Cons |
|---|---|---|---|
| Robust Fine-Tuning (WiSE-FT, Mask-Fill) | Weight-space interpolation / regularization | OOD Robust | Requires full fine-tuning (80M+ parameters); limited to classification tasks |
| Parameter-Efficient Fine-Tuning (PEFT) (AdaptFormer, MaPLe) | Fine-tuning only a few parameters | Efficient | Poor OOD robustness, performance drops sharply under distribution shifts |
Key Insight: No existing method simultaneously addresses both challenges. Robust fine-tuning methods are inefficient, while PEFT methods lack robustness. The Core Idea of R-Adapter is to introduce a self-ensemble mechanism into the PEFT framework to achieve the robustness benefits of weight-space ensembling at a low cost.
Inspiration from Ensembling¶
Weight-space ensembling (such as WiSE-FT, which linearly interpolates pre-trained and fine-tuned weights) has been shown to improve OOD generalization. However, traditional methods require storing two complete models. The novelty of R-Adapter lies in compressing the weight-space ensemble into a single model via adapter re-parameterization, requiring only the storage of the adapter weights (approx. 13% of the total parameters).
Method¶
Overall Architecture¶
R-Adapter is built on the Vision Transformer and Text Transformer of CLIP: - A lightweight adapter module is inserted after the MHA and FFN of each Transformer layer. - All pre-trained parameters are frozen, and only the adapter weights are trained. - Three self-ensemble strategies are combined to enhance OOD robustness. - An MPM-NCE loss function is used to replace the standard InfoNCE.
Key Designs¶
1. Adapter Module Design¶
A simplified version of the Houlsby Adapter (without non-linear layers and bias terms) is adopted, structured as a residual linear transformation:
where \(W_{\text{adp}} \in \mathbb{R}^{d \times d}\) is the adapter weight matrix. For full data, a full-rank structure is used, while for few-shot learning, a low-rank decomposition \(W_{\text{adp}} = BA\) (\(B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times d}, r \ll d\)) can be used.
Since the adapter has no non-linear layers, it can be merged with the preceding pre-trained weight through re-parameterization, incurring zero extra computational overhead during inference:
Design Motivation: The linear adapter enables re-parameterization, which is the cornerstone for the subsequent weight-space ensemble—eliminating the need to store two separate models.
2. Three Self-Ensemble Strategies¶
(a) Dynamic Ensemble by Adapter Dropping
During training, the adapter module is randomly deactivated with a probability \(p\):
No dropping is applied during inference. This is equivalent to implicitly ensembling multiple subnetworks (with different combinations of active/inactive adapters).
Fundamental Difference from Dropout/Drop-path: Dropout creates feature sparsity, and Drop-path reduces model depth. In contrast, Adapter Dropping uniquely and randomly switches between pre-trained features and adapted features, ensuring that pre-trained knowledge is always preserved. Ablation studies show that using AD alone improves OOD by 1.9% (E3 vs E0).
(b) Temporal Ensemble by Accumulation
The historical versions of the adapter weights are accumulated using the Exponential Moving Average (EMA):
The accumulated weights \(\tilde{W}_{\text{adp}}\) are used during inference. This represents an ensemble in the temporal dimension, capturing the combined information of all models during training at almost zero extra cost.
Design Motivation: Performing EMA only on the adapter parameters (rather than the entire model) is extremely memory-efficient.
(c) Weight-space Ensemble by Re-scaling
Linear interpolation between pre-trained and fine-tuned weights is performed using a scaling coefficient \(\alpha\):
Core Advantage: While traditional WiSE-FT requires storing two full models to perform weight interpolation, R-Adapter only needs to store the adapter weights, achieving an equivalent weight-space ensemble within a single model through re-parameterization. \(\alpha\) controls the degree of pre-trained knowledge preservation (using RS alone boosts OOD by 5.8%, E5 vs E0).
3. MPM-NCE Loss Function¶
Designed for downstream vision-language tasks, addressing two issues in the standard InfoNCE:
(a) Multi-Positive Soft Labels: In classification tasks where multiple text templates correspond to the same category, InfoNCE considers only a single positive pair, which causes semantic conflict. MPM-NCE addresses this with soft label assignment:
where \(P(i)\) is the set of positive samples for sample \(i\), and \(\epsilon\) is the label smoothing noise.
(b) Angular Margin: An angular margin \(\delta\) is applied to negative pairs to enhance discriminative power:
where \(\delta_{ij} = 0\) (for positive pairs) or \(\delta\) (for negative pairs), and \(\tau = 0.01\).
Training Strategy¶
- Optimizer: AdamW (no weight decay)
- Trained for 10 epochs with a learning rate of \(5 \times 10^{-4}\) using a cosine schedule with 500 warmup steps
- Hyperparameters: \(p=0.2\) (drop probability), \(m=0.999\) (EMA momentum), \(\delta=0.05\) (margin), \(\alpha=0.5\) (for classification tasks)
Key Experimental Results¶
Main Results¶
ImageNet classification + 5 OOD datasets (CLIP ViT-B/16):
| Method | Trainable Parameters | IN (ID) | OOD avg | IN-V2 | IN-R | IN-Sketch | ObjectNet | IN-A |
|---|---|---|---|---|---|---|---|---|
| Zero-Shot | 0 | 68.3 | 58.4 | 61.9 | 77.6 | 48.3 | 54.0 | 50.1 |
| Fine-Tuning | 86.7M | 80.7 | 52.8 | 70.4 | 64.0 | 45.1 | 49.1 | 35.2 |
| WiSE-FT | 86.7M | 81.7 | 63.0 | 72.8 | 78.7 | 53.9 | 57.3 | 52.2 |
| Mask-Fill | 86.7M | 82.4 | 63.3 | 73.4 | 78.1 | 53.4 | 57.9 | 53.5 |
| R-Adapter (Ours) | 20.5M | 82.0 | 64.8 | 73.6 | 79.1 | 53.9 | 59.7 | 57.5 |
R-Adapter outperforms all robust fine-tuning methods in OOD performance (+1.5% OOD avg) with only 1/4 of the parameters, showing a particularly pronounced advantage (+4.0%) on IN-A, which represents the largest distribution shift.
Ablation Study¶
Contribution of each component to ImageNet classification (ViT-B/32):
| Configuration | ID | OOD avg | Description |
|---|---|---|---|
| E0: Base Adapter | 77.5 | 47.7 | Baseline |
| E3: + Adapter Dropping | 77.8 (+0.3) | 49.6 (+1.9) | Dynamic ensembling significantly improves OOD |
| E5: + Re-scaling | 76.5 (-1.0) | 53.5 (+5.8) | Weight interpolation substantially boosts OOD |
| E7: AD + AC + RS | 76.6 (-0.9) | 53.7 (+6.0) | Combined three-strategy |
| E9: + MPM-NCE | 77.5 (±0) | 53.9 (+6.2) | Loss function improvement |
| E10: + Label Smooth | 77.7 (+0.2) | 54.3 (+6.6) | Final configuration |
| B1: AdaptFormer | 77.2 | 48.5 | Existing PEFT baseline |
| B2: RepAdapter | 77.2 | 48.3 | Existing PEFT baseline |
Hyperparameter sensitivity analysis (ViT-B/32):
| margin δ | ID | OOD | Description |
|---|---|---|---|
| 0 | 77.1 | 54.0 | No margin |
| 0.05 (Default) | 77.7 | 54.3 | Optimal balance |
| 0.1 | 77.8 | 53.8 | Excessive margin drops OOD |
| w/ Single Positive | 77.2 | 47.0 | Poor performance with single positive + margin |
Key Findings¶
- Adapter Dropping ≠ Dropout: Dropout (E1, +1.1 OOD) and Drop-path (E2, +0.2 OOD) are far less effective than Adapter Dropping (E3, +1.9 OOD), as the latter switches between pre-trained and adapted knowledge.
- Re-scaling is Key to OOD Robustness: Using RS alone yields a +5.8% OOD improvement, but slightly decreases ID performance (-1.0%), necessitating its combination with AD/AC to balance.
- Synergy of MPM-NCE + Label Smoothing: InfoNCE + LS degrades ID accuracy due to semantic conflicts, but MPM-NCE's multi-positive mechanism avoids this, with the combination improving both ID and OOD.
- Broad Generalizability Across Tasks: R-Adapter is effective in few-shot classification (outperforming CLIPood by +1.2% OOD avg), cross-modal retrieval, and open-vocabulary segmentation.
- Insensitivity to α: Compared to WiSE-FT, R-Adapter's performance curve remains flatter across different \(\alpha\) values.
Highlights & Insights¶
- Conceptual Unification: Unified parameter-efficient fine-tuning and robust fine-tuning into a single framework for the first time, proving they are complementary rather than conflicting.
- Lightweight Ensemble via Re-parameterization: Through the linear structure of the adapter combined with re-parameterization, the dual-model weight interpolation of WiSE-FT is compressed into a single-model operation, reducing the storage cost from 2× models to the adapter size.
- Benchmark Expansion: Extended the evaluation of robust fine-tuning from classification to cross-modal retrieval and open-vocabulary segmentation for the first time, driving systematic evaluation in the field.
- Generality of MPM-NCE: The design of multi-positive + margin losses can be widely applied to vision-language tasks with many-to-many mapping relationships.
Limitations & Future Work¶
- The full-rank adapter still has 64.5M parameters on ViT-L (relatively large), and the performance of the low-rank version drops.
- The \(\alpha\) value needs to be manually tuned for different tasks (0.5 for classification, 0.8 for retrieval, 0.4 for segmentation).
- More complex adapter architectures (e.g., with gating mechanisms) were not explored, which might further improve the ID-OOD balance.
- The evaluation is primarily based on CLIP, leaving its effectiveness on other VLMs (e.g., ALIGN, SigLIP) unverified.
- OOD evaluation is limited to natural distribution shifts, without addressing adversarial attack scenarios.
Related Work & Insights¶
- The weight interpolation concept of WiSE-FT is elegantly integrated into the PEFT framework here.
- The Adapter Dropping strategy can inspire robustness improvements in other PEFT methods (such as LoRA).
- The MPM-NCE loss has potential applications in CLIP-based open-world detection/segmentation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The three self-ensemble strategies are highly creative, and the idea of achieving weight-space ensembling via re-parameterization is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extensive evaluations across 4 ViT scales, 5 different tasks, with comprehensive ablated and hyperparameter analyses.
- Writing Quality: ⭐⭐⭐⭐ — Clearly structured with complete mathematical derivations, though some formula notations are slightly heavy.
- Value: ⭐⭐⭐⭐⭐ — Parameter-efficient and robust, highly suited for real-world deployment scenarios.