Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation¶

Conference: ICCV 2025 arXiv: 2410.12342 Code: https://github.com/liguopeng0923/FBT Area: Model Compression / Knowledge Distillation Keywords: Cross-architecture knowledge distillation, heterogeneous model fusion, CNN-ViT-MLP, InfoNCE loss, feature alignment

TL;DR¶

This paper proposes FBT (Fuse Before Transfer), which mitigates the feature gap in cross-architecture knowledge distillation (CAKD) by first fusing modules (CNN/MSA/MLP) from heterogeneous teachers and students to construct an adaptive intermediate fusion model before knowledge transfer, and replaces the conventional MSE loss with a spatial-agnostic InfoNCE loss. FBT achieves an average improvement of 8.38% on CIFAR-100 and 2.31% on ImageNet-1K.

Background & Motivation¶

Most knowledge distillation (KD) methods focus on homogeneous teacher–student pairs (e.g., CNN→CNN), which constrains the potential and flexibility of distillation: - Limited potential: The pool of homogeneous teachers is narrow and may not provide optimal knowledge. OFA has demonstrated that heterogeneous ViT-Base distillation to ResNet50 outperforms homogeneous ResNet152 distillation. - Limited flexibility: As new architectures continuously emerge, domain-specific tasks may lack suitable homogeneous teachers.

The central challenge of CAKD is the large representational gap between heterogeneous models, arising from:

Inductive bias discrepancy: CNNs exhibit locality and translation equivariance, whereas ViTs/MLPs rely on global dependencies.

Module functional discrepancy: Different modules read, encode, and process inputs differently, leading to significant distribution shifts in intermediate features at each stage.

Limitations of existing methods: - Feature-based methods employ simple projectors that cannot bridge the heterogeneous feature gap. - Pixel-wise MSE loss is ill-suited for heterogeneous features with large spatial distribution differences (e.g., FitNet achieves only 24.06% on ConvNeXt-T→Swin-P). - OFA projects features into logit space, sacrificing structural feature information.

Method¶

Overall Architecture¶

FBT adopts a three-level distillation scheme (Teacher–Fusion–Student), with the core principle of fuse before transfer: 1. An adaptive fusion model is constructed by concatenating modules from the teacher and student. 2. Three sets of losses are applied simultaneously: \(\mathcal{L}_{\text{FBT}} = \mathcal{L}(K_t, K_s) + \mathcal{L}(K_t, K_f) + \mathcal{L}(K_f, K_s)\) 3. The fusion model acts as a bridge between the teacher and the student.

Key Designs¶

Adaptive Knowledge Fusion:
- The fusion model comprises the first three CNN stages of the student, an L2G projector, and the final MSA/MLP stage of the teacher.
- Formulation: \(p_f(x) = fc_m \circ S_m^4 \circ (MSA \circ PE) \circ S_c^3 \circ S_c^2 \circ S_c^1(x)\)
- The L2G module consists of a Patch Embedding layer (for dimension transformation) and a Swin Block (for local-to-global feature transition).
- Design Motivation: CNNs and MSA are complementary (the former excels at local features, the latter at global dependencies); weight sharing minimizes additional parameters.
- The fusion model is adaptive: different teacher–student pairs yield different fusion architectures.
Spatial-Agnostic Knowledge Supervision:
- Only the final features after Average Pooling and the logits are transferred, since weight sharing truly integrates different inductive biases only at the final feature level.
- Average Pooling smooths spatial discrepancies, and InfoNCE loss then aligns structural feature information.
- Knowledge is defined as \(K_i = \{f_i, p_i\}\), where \(f_i\) is the pooled feature embedding and \(p_i\) is the output logits.
L2G (Local-to-Global) Projector:
- Serves as the bridge connecting CNN and MSA/MLP modules.
- Includes a Patch Embedding layer that transforms CNN features into the dimensions required by MSA/MLP.
- Appends a Swin Block to achieve local-to-global receptive field transition.
- Introduces minimal additional learnable parameters.

Loss & Training¶

The overall loss consists of two components applied to each knowledge pair \((K_i, K_j)\): - OFA loss (logits): A modulated KL divergence variant that upweights target-class information via a modulation parameter \(\gamma\) when the teacher is uncertain. - InfoNCE loss (features): A spatial-agnostic contrastive loss in which teacher–student features from the same image form positive pairs, capturing complex inter-feature dependencies without relying on spatial positions. - The temperature parameter \(\tau_2\) is learnable.

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{InfoNCE}}(f_i, f_j) + \mathcal{L}_{\text{OFA}}(p_i, p_j)\), applied to all three pairs: T-S, T-F, and F-S.

Key Experimental Results¶

Main Results (Tables)¶

CIFAR-100 Cross-Architecture Distillation Results (Top-1 Accuracy %)

Teacher	Student	KD	FitNet	CRD	OFA	FBT
Swin-T	ResNet18	78.74	78.87	77.63	80.54	81.61
ViT-S	ResNet18	77.26	77.71	76.60	80.15	81.93
ViT-S	MobileNetV2	72.77	73.54	78.14	78.45	82.10
ConvNeXt-T	DeiT-T	72.99	60.78	65.94	75.76	79.57
ConvNeXt-T	ResMLP-S12	72.25	45.47	63.35	75.21	78.03
Avg. Gain		+3.12	-5.21	-0.02	+6.19	+8.38

ImageNet-1K Cross-Architecture Distillation Results (Top-1 Accuracy %)

Teacher	Student	OFA	FBT
Swin-T	ResNet18	71.76	72.21
Swin-T	MobileNetV2	72.32	72.54
ConvNeXt-T	DeiT-T	74.41	75.26
ResNet50	Swin-N	77.76	77.79
Avg. Gain		+2.05	+2.31

Ablation Study (Table)¶

Ablation Setting	Swin-T→ResNet18	ConvNeXt-T→Swin-P	Swin-T→ResNet18 (IN-1K)
KD Baseline	78.74 (-2.87)	76.44 (-4.29)	71.14 (-1.04)
(A) w/o MSA and \(S_m^4\)	75.95 (-5.66)	77.65 (-3.18)	70.86 (-1.35)
(B) w/o \(S_m^4\)	77.21 (-4.40)	77.84 (-2.89)	71.78 (-0.43)
(C) w/o \(\mathcal{L}(K_t,K_f)\)	25.57 (-56.04)	50.46 (-30.27)	71.34 (-0.87)
(F) w/o InfoNCE	80.95 (-0.66)	78.89 (-1.84)	71.47 (-0.74)
(G) w/o OFA	77.91 (-3.70)	80.32 (-0.41)	70.37 (-1.84)
Full FBT	81.61	80.73	72.21

Key Findings¶

The fusion model's performance consistently lies between that of the teacher and the student, validating its role as a knowledge bridge.
Removing the teacher-to-fusion loss \(\mathcal{L}(K_t,K_f)\) causes catastrophic degradation (from 81.61% to 25.57% on CIFAR-100), confirming that the fusion model must learn from the teacher.
InfoNCE and OFA losses exhibit varying importance across different teacher–student pairs; their complementary use yields the best results.
FBT also achieves competitive performance in same-architecture knowledge distillation (SAKD), marginally surpassing FCFD and OFA on ImageNet-1K.
The fusion strategy \(S_c^{1 \to 3} \to S_m^{4 \to fc}\) (3 CNN stages + 1 MSA/MLP stage) achieves the best balance between simplicity and adaptability.

Highlights & Insights¶

Fusion over alignment: Rather than attempting to bridge the heterogeneous feature gap with projectors, FBT directly constructs a fusion model incorporating modules from both architectures, fundamentally reducing the feature gap.
Adaptive design: The fusion model automatically adapts to each teacher–student pair without manual architecture engineering.
Elegant weight sharing: By reusing module weights from the student and teacher, the fusion model introduces almost no extra parameters while implicitly aligning module functionality.
Importance of spatial-agnostic loss: MSE fails in heterogeneous settings (FitNet achieves 24.06%), whereas InfoNCE circumvents the spatial alignment problem through contrastive learning.

Limitations & Future Work¶

For certain well-established student models (e.g., ResNet18), heterogeneous teacher distillation may underperform homogeneous counterparts.
Fusion may disrupt spatial alignment between heterogeneous features; distribution-level (rather than pixel-level) spatial alignment could be explored.
Generalization to broader downstream tasks such as object detection and NLP remains unvalidated.
The current fixed 3+1 fusion ratio could potentially be improved through automatic architecture search.

OFA (NeurIPS 2023): The first general-purpose heterogeneous distillation method, but it sacrifices feature information by projecting features into logit space.
FCFD (ICLR 2023): Aligns functional similarity via module connectivity, directly inspiring FBT's module fusion design.
Hybrid model designs (CoAtNet, ConvMLP, etc.): The complementarity of CNN and MSA directly motivates the fusion strategy.
CRD: A pioneer in applying InfoNCE loss to knowledge distillation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The "fuse before transfer" paradigm is elegant and effective; drawing inspiration from hybrid model design to address distillation is a clever insight.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 12 heterogeneous combinations on CIFAR-100 and 14 on ImageNet-1K, with comprehensive ablations over fusion strategies, loss functions, and module configurations.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with clear motivation analysis; the taxonomy diagram (Fig. 2) is particularly informative.
Value: ⭐⭐⭐⭐ — Provides a general and effective framework for cross-architecture distillation with strong practical applicability.