Skip to content

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

Conference: AAAI2026
arXiv: 2511.08901
Code: weirl-922/ACKD
Area: Remote Sensing
Keywords: cross-modal knowledge distillation, optimal transport, remote sensing, scene classification, weak semantic consistency

TL;DR

This paper proposes a novel paradigm termed Asymmetric Cross-modal Knowledge Distillation (ACKD), realized through the SemBridge framework — comprising two plug-and-play modules, namely self-supervised semantic matching and optimal transport alignment — to enable cross-modal knowledge distillation under weak semantic consistency. This allows multispectral (MS) images collected from different geographic regions to effectively guide RGB-based remote sensing scene classification.

Background & Motivation

Conventional cross-modal knowledge distillation (CMKD) assumes strict semantic alignment (i.e., paired data) between teacher and student modalities, a setting referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). In remote sensing, multispectral images are commonly used as the teacher modality due to their high spectral resolution; however, their high acquisition cost and reliance on specialized equipment make large-scale deployment impractical. In practice, only a small fraction of RGB images have corresponding MS paired data, severely limiting the applicability of SCKD.

The core motivation is: can knowledge be effectively distilled across modalities without strong semantic correspondence — for example, between MS images collected in Europe and RGB images collected in Asia? This motivates the ACKD setting proposed in this work, which relaxes the pairing constraint and permits cross-modal knowledge transfer under weak semantic consistency.

Core Problem

  1. High transfer cost due to semantic gap: Through rigorous analysis using optimal transport theory (Wasserstein distance), the authors demonstrate that the knowledge transfer cost under ACKD is substantially higher than under SCKD. Directly applying SCKD methods to ACKD not only yields poor performance but may even fall below the distillation-free baseline.
  2. Reduced mutual information: Weak semantic consistency not only increases transfer cost but also reduces inter-modal mutual information, thereby shrinking the overlap of transferable knowledge.
  3. Lack of dedicated frameworks: Existing KD methods (Vanilla KD, DKD, RKD, etc.) all fail to achieve satisfactory performance in the ACKD setting.

Method

Overall Architecture: SemBridge

SemBridge consists of two plug-and-play modules that can be stacked atop existing SCKD methods.

1. Student-Friendly Matching (SFM) Module

This module aims to reduce the optimal transport cost by adaptively selecting appropriate teacher samples for each student sample.

Self-supervised Semantic Matching (SSM):

  • Requires no paired RGB data; utilizes only MS images: pseudo-RGB images \(\tilde{G}\) are constructed by extracting the R/G/B channels from MS images.
  • A CLIP-style InfoNCE contrastive loss is used to train a matcher \(\mathcal{M} = (\mathcal{M}_V, \mathcal{M}_G)\) to learn cross-modal semantic representations.
  • For each student RGB sample, the teacher MS sample with the highest cosine similarity within the same class is selected as the initial match.

Dynamic Matching (DynM):

  • Inspired by the pedagogical concept of "changing teachers at different learning stages," teacher–student pairings are periodically updated during training.
  • The current student model is used to compute KL divergence, and the teacher sample with the lowest divergence (i.e., most challenging) is selected.
  • Matching intervals are progressively increased following a curriculum learning schedule: \(e_t = e_0 + \sum_{i=1}^{t}(\Delta e + e_\mu(i-1))\)

2. Semantic-aware Knowledge Alignment (SKA) Module

This module further optimizes the transport path between matched pairs, referred to as the Planner.

  • The unmerged features \(z_T\) and \(z_S\) of the teacher and student are flattened into patch sequences.
  • A learnable multi-head attention structure replaces hand-crafted cost functions and regularization coefficients to compute intra-modal transport plans: \(\pi = \text{softmax}(QK^\top / \sqrt{d})\)
  • Cross-modal transport plans are constructed by performing horizontal/vertical mean pooling on teacher and student representations separately, followed by cross-multiplication.
  • CORAL is applied to align refined features (\(\mathcal{L}_{ot1}\)) and fused features (\(\mathcal{L}_{ot2}\)).

Loss & Training

\[\mathcal{L}_{all} = \mathcal{L}_{task} + \lambda_1 \mathcal{L}_{kd} + \lambda_2 (\mathcal{L}_{ot1} + \mathcal{L}_{ot2})\]

where \(\lambda_2 = 1 - \lambda_1\), and \(\mathcal{L}_{kd}\) can be any existing SCKD loss.

Dataset Construction

The authors construct a benchmark comprising three subsets:

Subset MS Source RGB Source MS Bands Classes Label Type
S2S-EU Sentinel-2 (Europe) Unpaired RGB 10 10 Single→Single
S2S-CN Tiangong-2 (China) Unpaired RGB 14 10 Single→Single
M2S-GL Sentinel-2 (Global) Unpaired RGB 10 15 Multi→Single

In total, the benchmark contains 70,414 MS images and 63,549 unpaired RGB images.

Key Experimental Results

Comparison with distillation-free baseline (ResNet34 homogeneous model, OA):

Dataset Baseline +SemBridge Gain
S2S-EU 91.7 93.7 +2.0
S2S-CN 94.9 96.2 +1.3
M2S-GL 94.9 96.6 +1.7

Comparison with SOTA methods (R/R denotes ResNet34→ResNet34, OA):

  • SemBridge (Vanilla KD): 93.7 / 96.2 / 96.6
  • Best competing method CTKD: 92.5 / — / —; LSKD: — / 95.4 / 95.4

Generalization: SemBridge improves all 6 SCKD methods as a plug-in, with DKD achieving the largest gain of +14.9% OA on M2S-GL.

Ablation Study

(R/R, OA):

SSM DynM \(\mathcal{L}_{ot1}\) \(\mathcal{L}_{ot2}\) S2S-EU S2S-CN M2S-GL
92.5 95.3 95.6
93.7 96.2 96.6

All four components are indispensable; the full configuration achieves optimal performance. Additional training overhead is approximately 8.7%–18.6%.

Highlights & Insights

  1. Clear problem formulation: The paper is the first to formally define ACKD, systematically contrasting it with SCKD and providing rigorous theoretical analysis via optimal transport theory.
  2. Plug-and-play design: The two SemBridge modules can be seamlessly integrated into any existing SCKD method, ensuring strong generalizability.
  3. Clever self-supervised matching: Pseudo-paired data are constructed by extracting RGB channels from MS images, eliminating the dependency on genuine paired data.
  4. Pedagogically motivated dynamic matching: The curriculum learning strategy — progressing from easy to hard — draws an intuitive analogy to human education, further boosting performance.
  5. Comprehensive benchmark construction: The benchmark spans multiple sensors (Sentinel-2, Tiangong-2), geographic regions (Europe/China/Global), and label types.

Limitations & Future Work

  1. Training overhead: Student-Friendly Matching, particularly DynM, introduces additional training time; the authors acknowledge this as a direction for future improvement in Table 8.
  2. Evaluation limited to remote sensing scene classification: While the ACKD concept is broadly applicable, experiments are confined to the remote sensing domain, lacking validation on natural image datasets or other domains.
  3. Intra-class global search for teacher samples: When intra-class sample sizes are large, the search cost may increase substantially.
  4. Conventional CORAL alignment: More advanced domain adaptation methods could be explored as alternatives to CORAL.
  • Conventional CMKD (SCKD) methods: Vanilla KD, RKD, DKD, and others suffer significant performance degradation under the ACKD setting, with some even falling below the distillation-free baseline.
  • VPR: Designed for cross-modal settings but assumes semantic consistency; it performs extremely poorly under ACKD (e.g., 46.2% on S2S-EU), validating the necessity of the ACKD formulation.
  • Optimal transport: The paper employs Wasserstein distance to quantify transfer difficulty and Lagrangian optimization to derive transport plans, providing a solid theoretical foundation.

The ACKD concept is extensible to other cross-modal scenarios (e.g., LiDAR→RGB, SAR→RGB). The self-supervised matching design — constructing pseudo-paired data via sub-channel extraction from multispectral images — is applicable to other modality-missing settings. The combination of dynamic teacher matching with curriculum learning offers insights for large-scale knowledge distillation. The benchmark construction methodology (collecting weakly paired data across geographic regions) is a valuable reference for other remote sensing tasks.

Rating

  • Novelty: ⭐⭐⭐⭐ — The ACKD problem formulation is novel and the theoretical analysis is rigorous.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets, six model combinations, seven baselines, and complete ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, consistent notation, and complete theoretical derivations.
  • Value: ⭐⭐⭐⭐ — Opens a new direction for cross-modal distillation under weak semantic consistency with practical application potential.