Skip to content

Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation

Conference: CVPR 2026 arXiv: 2602.19863 Code: Project Page Area: Image Segmentation Keywords: Remote Sensing Foundation Model, Multispectral, Knowledge Distillation, Contrastive Learning, Dual-Teacher Training

TL;DR

This paper proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework that employs a multispectral self-distillation teacher to learn spectral representations and a frozen optical VFM teacher (DINOv3) to inject high-level semantic priors. The resulting single student network excels at both optical and multispectral remote sensing tasks, achieving state-of-the-art performance across semantic segmentation, change detection, and classification.

Background & Motivation

Background: Foundation models are transforming Earth Observation (EO). Large amounts of unlabeled data combined with flexible task adaptation make them especially valuable in annotation-scarce EO settings. However, given the diversity of EO sensors and modalities, training a single universal model is impractical; multiple specialized foundation models will coexist.

Limitations of Prior Work: - Most EO pre-training relies on Masked Image Modeling (MIM), which emphasizes local reconstruction but offers limited control over global semantic structure - General-purpose VFMs (e.g., DINOv2/DINOv3) possess strong optical semantic priors but lack multispectral (MS) capability - Training MS foundation models from scratch is computationally expensive

Key Challenge: How can the strong optical semantic priors of VFMs be efficiently transferred to a multispectral student without compromising the learning of MS-specific information? Existing approaches (e.g., Copernicus-FM) combine MIM with VFM distillation, but the MIM objective is incompatible with the contrastive self-distillation objective of VFMs, resulting in weaker global semantic structure.

Goal: Propose a pre-training strategy that enables strong performance when multispectral data is available, without sacrificing performance on optical-only tasks.

Key Insight: Align the pre-training objective of the student with that of the VFM teacher — if the VFM was trained with contrastive self-distillation, the student should be trained the same way, making latent feature space alignment more tractable.

Core Idea: Dual teacher = a multispectral contrastive self-distillation teacher (for structured MS feature space) + a frozen optical VFM teacher (for global semantic priors), unified under a contrastive distillation framework.

Method

Overall Architecture

As shown in Figure 2: - Input Augmentation: Multi-scale global/local views generated from Sentinel-2 multispectral images - Multispectral Branch (red): MS teacher (EMA-updated) + student, contrastive self-distillation - Optical Branch (blue): Frozen DINOv3 teacher + student, feature distillation - Student Network (green): Swin Transformer backbone with dual patch embeddings (10-channel for MS, 3-channel for optical)

Key Designs

1. Multispectral Contrastive Self-Distillation

  • Function: Learn robust multispectral representations
  • Mechanism: Based on the DINO framework; MS teacher weights are updated via EMA. The loss combines cosine similarity (compression) and coding rate regularization (expansion): $\(\mathcal{L}_{MS} = \mathcal{L}_\text{cos}(p_M(\mathbf{z}_g^M), p_s^{MS}(\mathbf{z}_{g \cup l}^M)) - \gamma \mathcal{L}_{CR}(\cdot)\)$ where \(\mathcal{L}_{CR} = -\log\det(\mathbf{I} + \text{Cov}[\mathbf{z}])\) prevents representation collapse
  • Design Motivation: Contrastive learning yields strong semantic representations invariant to distributional shifts; coding rate regularization replaces conventional temperature scaling/negative sampling strategies to prevent collapse more elegantly

2. Optical VFM Distillation

  • Function: Transfer DINOv3's global semantic and pixel-level features to the student
  • Mechanism: Three categories of features are distilled via independent projection heads: $\(\mathcal{L}_O = \alpha_1 \mathcal{L}_\text{cos}(\text{[cls]}_F) + \alpha_2 \mathcal{L}_\text{cos}(\text{[p]}_F) + \alpha_3 \mathcal{L}_\text{cos}(\text{[p]}_\text{mid})\)$
  • \(\text{[cls]}_F\): final-layer class token (global semantics)
  • \(\text{[p]}_F\): final-layer patch tokens (pixel-level features)
  • \(\text{[p]}_\text{mid}\): intermediate-layer patch tokens (mid-level features)
  • Design Motivation: Distilling only the class token is insufficient for dense prediction tasks; patch-level features are necessary. Intermediate-layer features provide complementary mid-level semantic information.

3. Backbone Selection and Data Strategy

  • Backbone: Swin Transformer (patch size 4 vs. ViT's 16), yielding finer feature resolution
  • Data: fMoW-Sentinel (MS) + fMoW-RGB (optical), with low-resolution optical bands replaced by 150K high-resolution aerial images
  • Dual Patch Embedding: 10-channel for MS, 3-channel for optical; subsequent Transformer layers are shared

Loss & Training

\[\mathcal{L} = -\mathcal{L}_{MS} - \mathcal{L}_O\]

Multispectral and optical objectives are jointly optimized with coefficients \(\alpha_1=1, \alpha_2=0.5, \alpha_3=0.5, \gamma=1\).

Key Experimental Results

Main Results: Semantic Segmentation (mIoU)

Optical Segmentation:

Method SpaceNet GB-cattle GB-pv GB-chesa. Avg.
DINOv3-B (RGB) 79.06 73.01 94.34 64.04 77.61
Copernicus-FM (MS) 75.45 68.88 93.56 55.81 73.43
DEO 82.22 76.22 95.36 75.08 82.22

Multispectral Segmentation:

Method GB-SA-crop GB-cashew S1F11 PASTIS Avg.
TerraFM (MS) 30.95 59.49 92.72 19.65 50.70
Copernicus-FM (MS) - 55.71 92.58 21.49 51.11
DEO 36.59 65.60 93.30 23.06 63.51
  • MS segmentation average surpasses Prev. SOTA by +4.20 pp (63.51 vs. 51.11)

Change Detection (F1)

Method LEVIR (Optical) OSCD (MS) Avg.
DINOv3-LS 91.8 57.2 74.5
TerraFM 89.5 57.5 73.5
DEO 91.3 59.2 75.3

Classification (Linear Probing)

Method m-bigearthnet F1 m-so2sat Top1 m-eurosat Top1 Avg.
DINOv3-B 55.48 - 93.3 -
TerraFM - 47.57 93.1 67.61
DEO 58.43 53.09 93.8 68.44

Ablation Study

Component Optical Avg. MS Avg. Overall Avg.
Base (MS self-distillation only) 77.87 60.44 69.16
+DINOv3 [cls] 79.07 (+1.20) 62.81 (+2.37) 70.94
+Separate optical path 81.20 (+2.13) 62.69 (-0.12) 71.95
+DINOv3 [p] 81.74 (+0.53) 62.46 72.10
+Optical augmentation 81.95 63.02 (+0.55) 72.48
+High-resolution optical 82.22 (+0.27) 63.51 (+0.50) 72.87

Key Findings

  1. Optical VFM distillation improves not only optical but also MS performance: Adding DINOv3 [cls] distillation yields +2.37 pp on MS average.
  2. Objective compatibility is critical: The contrastive self-distillation objective aligns naturally with DINOv3's training objective, enabling inherent feature space alignment (confirmed by PCA visualization in Figure 3).
  3. All components contribute cumulatively: Overall average improves from 69.16 (base) to 72.87 (full model), with each component contributing positively.
  4. DEO ranks first overall: Achieves the highest average rank across 11 benchmarks (Table 4), with only 87M parameters and 500K pre-training images.

Highlights & Insights

  1. Deep insight into objective compatibility: The student's pre-training objective should match that of the teacher model — this explains why MIM + VFM distillation (e.g., Copernicus-FM) underperforms contrastive distillation + VFM distillation.
  2. Exceptional efficiency: Trained on only 500K images (vs. TerraFM's 18M), with 87M parameters (vs. DINOv3-LS's 303M) and 100 epochs on 16× A100s, yet achieves comprehensive state-of-the-art results.
  3. Non-destructive multimodality: Incorporating MS capability does not sacrifice optical performance — a rare quality in multimodal foundation models.
  4. Swin over ViT: The finer feature resolution from patch size 4 is critical for dense prediction tasks; cross-architecture distillation from a ViT teacher to a Swin student is shown to be effective.

Limitations & Future Work

  1. Limited to 10 Sentinel-2 bands: SAR, thermal infrared, and other modalities are not addressed.
  2. Spatial resolution constraint: Sentinel-2's native 10–60 m resolution limits applications; while high-resolution optical data partially replaces low-resolution bands, MS bands remain low-resolution.
  3. Geographic bias in fMoW: The dataset primarily covers certain regions; generalization to polar, oceanic, and other underrepresented areas remains unknown.
  4. Whether larger student models could further benefit from this framework is left unexplored.
  • DINOv3: The latest vision foundation model with particular attention to remote sensing — DEO demonstrates the merit of efficiently leveraging its knowledge rather than competing from scratch.
  • Coding Rate Regularization: Derived from MCR² (Ma et al.), it replaces negative sampling/temperature scaling in conventional contrastive learning, preventing representation collapse more elegantly.
  • Implications for the EO community: Rather than investing enormous compute to train MS foundation models from scratch, efficiently absorbing knowledge from existing VFMs via distillation points toward a sustainable EO foundation model ecosystem.

Rating

⭐⭐⭐⭐⭐ — Insightful (objective compatibility), highly efficient (SOTA with only 500K images), and experimentally comprehensive (11 datasets across 3 tasks). An outstanding contribution to the remote sensing foundation model literature.