Efficient Transfer Learning for Video-language Foundation Models¶

Conference: CVPR 2025
arXiv: 2411.11223
Code: https://github.com/chenhaoxing/ETL4Video
Area: Video Understanding
Keywords: Parameter-Efficient Fine-Tuning, Video Action Recognition, Multimodal Adapter, Transfer Learning, Generalization Ability

TL;DR¶

The authors propose a Multimodal Spatiotemporal Adapter (MSTA), which achieves efficient transfer of video-language foundation models to downstream tasks with only 2-7% of trainable parameters, through a vision-language shared projection layer and spatiotemporal description-guided consistency constraints.

Background & Motivation¶

Pre-trained video-language foundation models (such as CLIP, ViCLIP) require adaptation to downstream video tasks. Existing methods face two Key Challenges:

Conflict between parameters and generalization: Methods such as ActionCLIP and XCLIP introduce a large number of extra parameters to model temporal information. Although they improve downstream task performance, they trigger catastrophic forgetting, severely damaging the generalization performance on unseen categories.
Limitations of single-modality PEFT: Although parameter-efficient methods such as LoRA and AdaptFormer use fewer parameters, they are designed for single-modality models. When applied independently to the vision and text branches, they ignore cross-modal interactions, failing to effectively align video and text representations.

In addition, as ViCLIP is a pre-trained model specifically designed for videos, existing CLIP-based methods cannot be directly transferred to it, leaving a lack of efficient fine-tuning schemes specifically tailored for ViCLIP.

Method¶

Overall Architecture¶

Taking ViCLIP (a video-version CLIP model that replaces CLIP's original attention with spatiotemporal attention) as the backbone, lightweight Multimodal Spatiotemporal Adapters (MSTA) are injected into the high-end Transformer blocks of the video and text encoders. During training, the pre-trained parameters are frozen, and only the adapters are optimized. Meanwhile, a spatiotemporal description-guided consistency constraint (\(\mathcal{L}_{CC}\)) is employed to mitigate overfitting.

Key Designs¶

MSTA Adapter Architecture:
- Function: Establishes parameter-efficient cross-modal alignment between the video and text branches.
- Mechanism: Each adapter consists of three components: modality-specific down-projection layers \(\mathbf{W}_v^{kd}\)/\(\mathbf{W}_t^{kd}\), a cross-modality shared intermediate bottleneck projection layer \(\mathbf{W}^{ks}\), and modality-specific up-projection layers. The up-projection of the video branch is split into a spatial up-projection \(\mathbf{W}_v^{ku-s}\) (a linear layer) and a temporal up-projection \(\mathbf{W}_v^{ku-t}\) (a 3D convolutional layer), with their outputs added together. A scaling factor \(\lambda\) controls the intensity of the adapter output: \([c_j, x_j] = \mathcal{E}^j_v([c_{j-1}, x_{j-1}]) + \lambda \mathcal{A}^j_v([c_{j-1}, x_{j-1}])\).
- Design Motivation: The shared intermediate layer can simultaneously receive gradient updates from both the vision and text modalities during fine-tuning, thereby optimizing cross-modal alignment. The separate down- and up-projection layers preserve the specificities of each modality. The spatial and temporal joint up-projection design enhances the model's adaptation to spatial and temporal features, respectively.
Selective Layer Injection Strategy:
- Function: Injects adapters only into higher-level Transformer blocks to protect the general features learned in lower-level layers.
- Mechanism: MSTA adapters are added from the \(k\)-th block to the final block \(L\), while the lower-level layers from \(1\) to \(k-1\) remain frozen. Different values of \(k\) are selected based on task settings (blocks 1-12 for base-to-novel, blocks 8-12 for few-shot).
- Design Motivation: Lower Transformer layers learn generic features, while higher layers learn task-specific features. In scenarios requiring substantial generalization, such as few-shot learning, fine-tuning only the higher layers preserves the pre-trained knowledge more effectively.
Spatiotemporal Description-guided Consistency Constraint:
- Function: Prevents overfitting and enhances generalization via knowledge distillation.
- Mechanism: Large language models (e.g., DeepSeek) are leveraged to generate spatial descriptions \(\text{DES}_s\) and temporal descriptions \(\text{DES}_t\) for each action category. A standard prompt template ("a video of {cls}") is input into the trainable branch, while the LLM-generated descriptions are fed into the frozen pre-trained branch. The outputs of both branches are aligned using a consistency constraint based on cosine distance: \(\mathcal{L}_{CC} = 2 - \cos(w^c, D_s^c) - \cos(w^c, D_t^c)\).
- Design Motivation: Knowledge distillation forces the trainable encoders not to deviate too far from the pre-trained model. Spatiotemporal descriptions provide richer semantic information than simple templates, guiding the model to learn more discriminative representations in the spatiotemporal semantic space.

Loss & Training¶

The final objective function is a weighted sum of the cross-entropy loss and the consistency constraint:

\[\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{CC}\]

where \(\mathcal{L}_{CE}\) is the standard video-text contrastive loss and \(\alpha=1.0\) is the optimal weight. The AdamW optimizer is used with a weight decay of 0.001, and \(N=2\) descriptions is found to be optimal. All modules in MSTA are initialized using Kaiming initialization.

Key Experimental Results¶

Main Results (Base-to-Novel Generalization, Harmonic Mean (HM) Averaged over 4 Datasets)¶

Method	Trainable Params	K-400 HM	HMDB-51 HM	UCF-101 HM	SSv2 HM
ViFi-CLIP (Full Fine-Tuning)	All	68.2	62.5	78.7	14.2
ViCLIP (Full Fine-Tuning)	124.3M	71.3	62.7	81.6	17.0
+AdaptFormer	7.9M	71.1	64.3	82.3	17.0
+LoRA	9.4M	70.9	64.0	82.1	16.0
+MSTA+\(\mathcal{L}_{CC}\)	8.7M	72.0	66.3	82.9	18.9

Ablation Study¶

Configuration	Base	Novel	HM	Description
Language Adapter Only	66.1	51.5	57.9	Inadequate single modality
Vision Adapter Only	65.7	51.7	57.9	Inadequate single modality
No Shared Layer	68.0	52.9	59.5	Lacks cross-modality alignment
Full MSTA	68.6	53.5	60.1	Shared layer improves HM by 0.6

Key Findings¶

MSTA achieves state-of-the-art (SOTA) performance across all four evaluation settings (zero-shot, few-shot, base-to-novel, and fully-supervised), while using only 2-7% of the original model parameters.
On SSv2 (a dataset with strong temporal dependencies), the accuracy of Novel categories increases from OST's 11.5 to 16.5 (+43%), verifying the significance of spatiotemporal modeling.
The shared projection layer yields consistent HM improvements (59.5 → 60.1), demonstrating the efficacy of cross-modal gradient sharing.
The consistency constraint is particularly effective in few-shot learning, successfully mitigating overfitting in low-data regimes.
The number of descriptions \(N=2\) is optimal; larger values of \(N\) introduce more noise due to large language model hallucinations.
Injecting adapters only into higher layers (8→12) performs better in few-shot scenarios compared to full-layer injection.

Highlights & Insights¶

Shared projection layer is the core innovation of the proposed method: a simple idea (cross-modal shared intermediate layers) brings significant performance improvements almost without increasing parameters.
The dual spatial + temporal up-projection design is simple yet effective, utilizing linear layers to capture spatial information and 3D convolutions to capture temporal information.
The combination of LLM-generated descriptions + knowledge distillation successfully transfers knowledge from large language models to downstream tasks, serving as a solid example of the "LLM as teacher" paradigm.
The method is highly versatile and can be adapted to both CLIP and ViCLIP simultaneously.

Limitations & Future Work¶

The consistency constraint relies heavily on the generation quality of the LLM, and hallucination issues at larger values of \(N\) limit its scalability.
The method has only been verified on ViT-B/16, leaving its performance on larger-scale models uncertain.
The generation of spatiotemporal descriptions is offline and decoupled from training; online adaptive generation might yield better results.
The spatial and temporal up-projections are simply added together; more sophisticated fusion strategies (e.g., gating mechanisms) might yield further improvements.

PEFT methods such as LoRA and AdaptFormer are efficient but neglect multimodal alignment; this work highlights the critical importance of cross-modal interactions.
The OST method also leverages LLM-generated descriptions but relies on full-parameter fine-tuning, which is prone to overfitting; this study utilizes descriptions for distillation constraints, which is a more elegant approach.
The MoTE method introduces a mixture of temporal experts, which requires a large parameter budget (88M vs. 8.7M in this work) yet yields slightly worse performance.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of cross-modal shared projection and description-guided consistency constraints is highly novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers four evaluation settings, six datasets, and comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Structure is clear, though some parts have heavy mathematical formulations that require careful reading.
Value: ⭐⭐⭐⭐ Provides a practical solution for the efficient transfer of video-language models with a wide scope of applicability.