Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation¶

Conference: CVPR 2026 arXiv: 2604.09088 Code: https://github.com/Zhang-VKk/MDPD Area: Model Compression / Efficient Transfer Learning Keywords: memory-efficient transfer learning, knowledge distillation, side networks, inference acceleration, dual-path distillation

TL;DR¶

MDPD proposes an efficient fine-tuning framework based on bidirectional knowledge distillation between a frozen backbone and a lightweight side network. Upon training completion, the side network is discarded, achieving both parameter- and memory-efficient training as well as inference-time acceleration.

Background & Motivation¶

Background: Memory-efficient transfer learning (METL) avoids gradient backpropagation through large backbones by constructing lightweight parallel side networks, substantially reducing training memory. However, side networks introduce additional memory and computational overhead at inference time.

Limitations of Prior Work: Existing METL methods achieve parameter and memory efficiency during training, but their inference-time overhead contradicts the ultimate goal of efficient transfer learning.

Key Challenge: Side networks are indispensable during training (to avoid storing gradients of the large backbone) yet become a burden at inference (increasing forward-pass overhead).

Goal: Design a method that leverages side networks for memory-efficient training while discarding them at inference without sacrificing accuracy.

Key Insight: Transfer the downstream task knowledge learned by the side network back to the backbone via bidirectional knowledge distillation.

Core Idea: During training, the backbone and side network serve as mutual teacher and student through distillation; at inference, only the optimized backbone is used and the side network is "faded out."

Method¶

Overall Architecture¶

MDPD consists of two parallel paths: a frozen backbone and a learnable side network. During training, bidirectional knowledge transfer is achieved via feature-level distillation (backbone → side network) and logits-level distillation (side network → backbone). At inference, only the backbone together with the task head is used.

Key Designs¶

Dual-Path Knowledge Distillation (DPKD):
Function: Establishes bidirectional knowledge flow between the backbone and the side network.
Mechanism: In feature distillation, the backbone acts as teacher and the side network as student (enhancing the side network with pretrained knowledge); in logits distillation, the side network acts as teacher and the backbone as student (transferring downstream task knowledge back to the backbone). Low-rank matrices \(M_{down} \in \mathbb{R}^{D_S \times d}\) and \(M_{up} \in \mathbb{R}^{d \times D_B}\) are used for dimensionality alignment.
Design Motivation: Bidirectional distillation enables mutual enhancement — the backbone's pretrained knowledge helps the side network learn better, while the side network's task-specific knowledge helps the backbone adapt to downstream tasks.
Hierarchical Feature Distillation (HFD):
Function: Applies differentiated distillation strategies across encoder layers.
Mechanism: Shallow layers exhibit similar teacher–student attention patterns (both show diagonal self-attention), so direct imitation is applied; deep layers show divergent attention patterns (attending to different sparse key tokens), so a masked generation strategy is adopted — instead of directly imitating teacher features, the student learns to generate them.
Design Motivation: The attention discrepancy between shallow and deep layers renders a uniform distillation strategy suboptimal; a hierarchical strategy transfers knowledge more effectively.
Side Network Ablation for Inference:
Function: Completely removes the side network at inference time.
Mechanism: During training, the backbone only updates its LayerNorm scale/bias parameters and the final output layer (with most parameters frozen), yet acquires task adaptation capability through distillation. At inference, the backbone with the task head is used directly.
Design Motivation: Eliminates inference overhead from the side network, achieving dual efficiency in both training and inference.

Loss & Training¶

The backbone and side network are optimized alternately to minimize the divergence between their feature distributions. The total loss comprises a feature distillation loss and a logits distillation loss.

Key Experimental Results¶

Main Results¶

Task	Metric	MDPD	SOTA METL	Gain
Vision	Inference speedup	≥25.2%	0%	+25.2%
Language	Inference speedup	≥22.5%	0%	+22.5%
Multimodal	Accuracy	Surpasses SOTA	—	Improved

Ablation Study¶

Configuration	Key Metric	Notes
w/o feature distillation	Accuracy drop	Missing pretrained knowledge transfer
w/o logits distillation	Accuracy drop	Missing task knowledge transfer
w/o hierarchical distillation	Accuracy drop	Inappropriate shallow/deep strategy
Full MDPD	Best	Bidirectional distillation + hierarchical strategy

Key Findings¶

At least 25.2% inference speedup is achieved while maintaining or even improving accuracy, demonstrating that the role of the side network can be fully transferred via distillation.
The hierarchical distillation strategy is critical for multi-layer encoders; the combination of direct imitation in shallow layers and masked generation in deep layers yields the best results.
The method is effective across vision, language, and vision-language modalities, validating its generality.

Highlights & Insights¶

Use during training, discard at inference: The concept of the side network as a "disposable coach" elegantly resolves the inference overhead problem in METL.
Hierarchical distillation finding: The observation of attention pattern differences between shallow and deep layers, and the corresponding tailored distillation strategies, offer valuable insights for future work.
Low-rank dimensionality alignment: A bottleneck structure is used to avoid introducing excessive parameters for dimensionality alignment, preserving parameter efficiency.

Limitations & Future Work¶

Training time may increase due to dual-path forward passes and distillation loss computation.
Updating only the backbone's LayerNorm parameters may limit adaptation under more extreme domain shifts.
The relationship between side network scale and distillation effectiveness is not discussed.

vs. LoRA: LoRA directly modifies backbone weights but still requires backpropagation; MDPD indirectly updates the backbone via the side network, resulting in lower memory usage.
vs. Side-Tuning: Conventional side network methods retain the side network at inference; MDPD achieves complete removal at inference through distillation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of bidirectional distillation and side network ablation is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across three task types: vision, language, and multimodal.
Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-organized.
Value: ⭐⭐⭐⭐ Addresses the core contradiction in the METL paradigm.