Cross-Architecture Distillation Made Simple with Redundancy Suppression¶

Conference: ICCV 2025 arXiv: 2507.21844 Code: N/A Area: Model Compression Keywords: knowledge distillation, cross-architecture, redundancy suppression, feature decorrelation, CNN-ViT-MLP

TL;DR¶

This paper proposes RSD (Redundancy Suppression Distillation), which extracts architecture-agnostic knowledge via cross-architecture invariance maximization and feature decorrelation. Using a single simple RSD loss and a lightweight MLP decoupling module, RSD substantially outperforms OFA—the pioneering cross-architecture distillation method—on both CIFAR-100 and ImageNet-1k, while incurring only a fraction of OFA's parameter overhead.

Background & Motivation¶

Knowledge distillation (KD) aims to transfer the capabilities of a pretrained teacher model to a lightweight student model. Conventional KD predominantly operates within homogeneous architectures (e.g., CNN→CNN); however, with the emergence of ViTs, MLP-Mixers, and other novel architectures, cross-architecture knowledge distillation (CAKD) has become increasingly important, as the best-performing models are often unsuitable for direct deployment. → Core challenge: Heterogeneous features differ in dimensionality and exhibit distinct or even conflicting representational patterns; forcing a student to blindly absorb teacher features leads to performance degradation. → The pioneering method OFA requires architecture-specific projection modules (e.g., depthwise separable convolutions for CNNs, attention blocks for ViTs) to project features into an "architecture-agnostic" logit space, resulting in complex designs and substantial parameter overhead (the projector parameters are 3× the student's when distilling ConvNeXt-T→Swin-N). → Core insight of this paper: Complex projections are unnecessary; extracting common knowledge across heterogeneous representations can be achieved simply through redundancy suppression.

Method¶

Overall Architecture¶

RSD operates on the penultimate-layer embeddings of both teacher and student. After aligning dimensions via a lightweight AAD decoupling module, it computes a cross-architecture Pearson correlation matrix and applies the RSD loss. The total loss is \(\mathcal{L} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{RSD}\). The AAD module is discarded after training, introducing no additional inference overhead.

Key Designs¶

Redundancy Suppression Distillation (RSD) Loss:
- Function: Extracts architecture-agnostic common knowledge from teacher and student representations.
- Mechanism: Constructs a Pearson correlation matrix \(\mathbf{P} \in \mathbb{R}^{D \times D}\) between teacher features \(\mathbf{z}^t\) and student features \(\mathbf{z}^s\), with the optimization target being the identity matrix \(\mathbf{T} = I\). (1) Diagonal elements → 1: maximizes cross-architecture invariance along matched dimensions (extracting common knowledge); (2) Off-diagonal elements → 0: decorrelates mutual information across different feature dimensions (suppressing redundant architecture-specific information). The loss is \(\mathcal{L}_{RSD} = d(\mathbf{P}(h(\mathbf{z}^s), \mathbf{z}^t), \mathbf{T})\), using MSE distance. The off-diagonal loss can be weighted by a coefficient κ.
- Design Motivation: Inspired by classical unsupervised feature learning theory (Barlow Twins information maximization and feature decorrelation), minimizing mutual information is equivalent to extracting statistically independent, architecture-agnostic features.
Architecture-Agnostic Decoupling (AAD) Module:
- Function: Acts as a buffer to prevent student internal representations from being completely overridden by the RSD objective, preserving architecture-specific beneficial capabilities of the student.
- Mechanism: A two-layer FC network (expander + adaptor) with BatchNorm and GeLU activation in between. The expander maps student embeddings to a higher-dimensional space; the adaptor aligns them to the teacher embedding dimension.
- Design Motivation: Different architectures possess unique strengths (e.g., CNNs' sensitivity to local textures, which ViTs lack); fully overwriting these with architecture-agnostic knowledge would discard such capabilities. AAD serves as a buffer layer, directing RSD optimization onto the projected representations rather than directly modifying student internal representations.
Design Rationale for Using Penultimate-Layer Embeddings:
- Function: Avoids complex dimension alignment issues associated with intermediate features.
- Mechanism: Penultimate-layer embeddings are always 1D vectors (not feature maps or token sequences), requiring no architecture-specific operations (depthwise separable convolutions, token operations, etc.). Being closer to the network output than intermediate features, they exhibit weaker architecture-specificity and are more amenable to extracting architecture-agnostic information.
- Design Motivation: This is precisely the root of OFA's complexity—it necessitates designing distinct projection modules for intermediate features of different architectures.

Loss & Training¶

The RSD loss can be implemented in approximately 8 lines of PyTorch code: normalize features → compute cross-correlation matrix → diagonal MSE + weighted off-diagonal MSE. The training configuration follows OFA. RSD can also be applied in the logit space as a logit distiller, where it likewise performs strongly.

Key Experimental Results¶

Main Results¶

CIFAR-100 (12 heterogeneous teacher–student pairs; partial results shown):

Teacher→Student	From Scratch	KD	OFA	RSD	RSD vs OFA
Swin-T→ResNet18	74.01	78.74	80.54	83.92	+3.38
ViT-S→MobileNetV2	73.68	72.77	78.45	81.68	+3.23
ConvNeXt-T→DeiT-T	68.00	72.99	75.76	82.46	+6.70
ConvNeXt-T→ResMLP-S12	66.56	72.25	81.22	84.21	+2.99
Average Gain	-	+3.17	+7.47	+10.69	+3.22

ImageNet-1k (15 heterogeneous teacher–student pairs; partial results shown):

Teacher→Student	From Scratch	OFA	RSD	RSD vs OFA
Swin-T→ResNet18	69.75	71.85	72.13	+0.28
ConvNeXt-T→Swin-N	75.53	77.50	77.70	+0.20
ConvNeXt-T→ResMLP-S12	76.65	77.53	78.41	+0.88
Average Gain	-	+2.20	+2.34	+0.14

Ablation Study¶

Configuration	Swin-T→ResNet18	ConvNeXt-T→ResMLP-S12	Note
Baseline (scratch)	74.01	76.65	No distillation
+ RSD-corr only	80.65	83.40	Invariance maximization only
+ RSD-decorr	83.92	84.21	Full RSD with decorrelation

Effect of AAD (CIFAR-100 / ImageNet):

Configuration	ViT-S→ResMLP	ConvNeXt-T→Mixer
Full RSD	82.94	80.73
w/o AAD	82.26 (−0.68)	79.93 (−0.80)

Parameter overhead comparison (ConvNeXt-T→Swin-N @ ImageNet):

Method	Student Params	Extra Params	Extra/Student Ratio
OFA	9.6M	28.2M	2.94×
RSD	9.6M	~2.8M	0.29×

RSD as a logit distiller:

Logit Loss	Swin-T→ResNet18	ConvNeXt-T→ResMLP
KD	78.74	72.25
DKD	80.26	73.22
OFA (logit component only)	80.60	78.87
RSD on logits	83.23	81.15

Key Findings¶

RSD achieves an average gain of +10.69% on CIFAR-100, substantially surpassing OFA's +7.47%.
On ConvNeXt-T→DeiT-T, RSD outperforms OFA by 6.70%—nearly equal to the gap between OFA and no distillation.
RSD alone as a logit distiller surpasses OFA's full framework (including all complex projectors).
The decorrelation objective further improves performance in most settings.
RSD is complementary to OFA: replacing all losses in OFA with RSD losses yields additional gains.
CKA visualization confirms that RSD substantially increases feature similarity between heterogeneous architectures at intermediate and deep layers.

Highlights & Insights¶

A paradigmatic example of "simple yet effective": an 8-line RSD loss outperforms the considerably more complex OFA, with only 1/10 of its parameter overhead.
The redundancy suppression perspective offers a precise reformulation of the CAKD problem: rather than learning how to align heterogeneous features, the goal is to eliminate architecture-specific redundant information.
Choosing penultimate-layer embeddings over intermediate features is a clever design decision that entirely circumvents the heterogeneous feature alignment challenge.
The AAD module's design philosophy of "preserving student-specific capabilities" reflects a deep understanding of the essence of knowledge distillation.

Limitations & Future Work¶

The advantage on ImageNet is less pronounced than on CIFAR-100 (+2.34% vs. +10.69%); further gains on large-scale datasets remain to be explored.
Relying solely on 1D embeddings precludes the exploitation of rich spatial information in 2D feature maps, limiting applicability to spatially sensitive tasks such as object detection.
The hyperparameters λ and κ exhibit some sensitivity and require careful tuning.
Cross-architecture distillation for additional downstream tasks (e.g., object detection, semantic segmentation) has not been explored.

The information maximization and decorrelation principles from Barlow Twins are cleverly repurposed as cross-architecture distillation objectives.
A conceptual connection exists with "domain-invariant representation learning" in domain generalization, though the context and methodological essence differ substantially.
The simplicity and generality of RSD position it as a potential baseline method for the CAKD community.

Rating¶

Novelty: ⭐⭐⭐⭐ The redundancy suppression perspective is novel, though the core technique (correlation matrix + decorrelation) is borrowed from the SSL literature.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 12+15 heterogeneous model pairs, validation on both CIFAR-100 and ImageNet, with comprehensive ablation, compatibility, and visualization analyses.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, naturally motivated derivations, and thorough comparative analysis against OFA.
Value: ⭐⭐⭐⭐⭐ A simple and effective method of great community value; strong potential to become the standard strong baseline for CAKD.