Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation¶

Conference: CVPR 2026 arXiv: 2602.19863 Code: Project Page Area: Image Segmentation Keywords: Remote Sensing Foundation Model, Multispectral, Knowledge Distillation, Contrastive Learning, Dual-Teacher Training

TL;DR¶

This paper proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework that employs a multispectral self-distillation teacher to learn spectral representations and a frozen optical VFM teacher (DINOv3) to inject high-level semantic priors. The resulting single student network excels at both optical and multispectral remote sensing tasks, achieving state-of-the-art performance across semantic segmentation, change detection, and classification.

Background & Motivation¶

Background: Foundation models are transforming Earth Observation (EO). Large amounts of unlabeled data combined with flexible task adaptation make them especially valuable in annotation-scarce EO settings. However, given the diversity of EO sensors and modalities, training a single universal model is impractical; multiple specialized foundation models will coexist.

Limitations of Prior Work: - Most EO pre-training relies on Masked Image Modeling (MIM), which emphasizes local reconstruction but offers limited control over global semantic structure - General-purpose VFMs (e.g., DINOv2/DINOv3) possess strong optical semantic priors but lack multispectral (MS) capability - Training MS foundation models from scratch is computationally expensive

Key Challenge: How can the strong optical semantic priors of VFMs be efficiently transferred to a multispectral student without compromising the learning of MS-specific information? Existing approaches (e.g., Copernicus-FM) combine MIM with VFM distillation, but the MIM objective is incompatible with the contrastive self-distillation objective of VFMs, resulting in weaker global semantic structure.

Goal: Propose a pre-training strategy that enables strong performance when multispectral data is available, without sacrificing performance on optical-only tasks.

Key Insight: Align the pre-training objective of the student with that of the VFM teacher — if the VFM was trained with contrastive self-distillation, the student should be trained the same way, making latent feature space alignment more tractable.

Core Idea: Dual teacher = a multispectral contrastive self-distillation teacher (for structured MS feature space) + a frozen optical VFM teacher (for global semantic priors), unified under a contrastive distillation framework.

Method¶

Overall Architecture¶

As shown in Figure 2: - Input Augmentation: Multi-scale global/local views generated from Sentinel-2 multispectral images - Multispectral Branch (red): MS teacher (EMA-updated) + student, contrastive self-distillation - Optical Branch (blue): Frozen DINOv3 teacher + student, feature distillation - Student Network (green): Swin Transformer backbone with dual patch embeddings (10-channel for MS, 3-channel for optical)

Key Designs¶

1. Multispectral Contrastive Self-Distillation¶

Function: Learn robust multispectral representations
Mechanism: Based on the DINO framework; MS teacher weights are updated via EMA. The loss combines cosine similarity (compression) and coding rate regularization (expansion): $$\mathcal{L}_{MS} = \mathcal{L}_\text{cos}(p_M(\mathbf{z}_g^M), p_s^{MS}(\mathbf{z}_{g \cup l}^M)) - \gamma \mathcal{L}_{CR}(\cdot)$$ where $\mathcal{L}_{CR} = -\log\det(\mathbf{I} + \text{Cov}[\mathbf{z}])$ prevents representation collapse
Design Motivation: Contrastive learning yields strong semantic representations invariant to distributional shifts; coding rate regularization replaces conventional temperature scaling/negative sampling strategies to prevent collapse more elegantly

2. Optical VFM Distillation¶

Function: Transfer DINOv3's global semantic and pixel-level features to the student
Mechanism: Three categories of features are distilled via independent projection heads: $$\mathcal{L}_O = \alpha_1 \mathcal{L}_\text{cos}(\text{[cls]}_F) + \alpha_2 \mathcal{L}_\text{cos}(\text{[p]}_F) + \alpha_3 \mathcal{L}_\text{cos}(\text{[p]}_\text{mid})$$
$\text{[cls]}_F$: final-layer class token (global semantics)
$\text{[p]}_F$: final-layer patch tokens (pixel-level features)
$\text{[p]}_\text{mid}$: intermediate-layer patch tokens (mid-level features)
Design Motivation: Distilling only the class token is insufficient for dense prediction tasks; patch-level features are necessary. Intermediate-layer features provide complementary mid-level semantic information.

3. Backbone Selection and Data Strategy¶

Backbone: Swin Transformer (patch size 4 vs. ViT's 16), yielding finer feature resolution
Data: fMoW-Sentinel (MS) + fMoW-RGB (optical), with low-resolution optical bands replaced by 150K high-resolution aerial images
Dual Patch Embedding: 10-channel for MS, 3-channel for optical; subsequent Transformer layers are shared

Loss & Training¶

\[\mathcal{L} = -\mathcal{L}_{MS} - \mathcal{L}_O\]

Multispectral and optical objectives are jointly optimized with coefficients $\alpha_1=1, \alpha_2=0.5, \alpha_3=0.5, \gamma=1$.

Key Experimental Results¶

Main Results: Semantic Segmentation (mIoU)¶

Optical Segmentation:

Method	SpaceNet	GB-cattle	GB-pv	GB-chesa.	Avg.
DINOv3-B (RGB)	79.06	73.01	94.34	64.04	77.61
Copernicus-FM (MS)	75.45	68.88	93.56	55.81	73.43
DEO	82.22	76.22	95.36	75.08	82.22

Multispectral Segmentation:

Method	GB-SA-crop	GB-cashew	S1F11	PASTIS	Avg.
TerraFM (MS)	30.95	59.49	92.72	19.65	50.70
Copernicus-FM (MS)	-	55.71	92.58	21.49	51.11
DEO	36.59	65.60	93.30	23.06	63.51

MS segmentation average surpasses Prev. SOTA by +4.20 pp (63.51 vs. 51.11)

Change Detection (F1)¶

Method	LEVIR (Optical)	OSCD (MS)	Avg.
DINOv3-LS	91.8	57.2	74.5
TerraFM	89.5	57.5	73.5
DEO	91.3	59.2	75.3

Classification (Linear Probing)¶

Method	m-bigearthnet F1	m-so2sat Top1	m-eurosat Top1	Avg.
DINOv3-B	55.48	-	93.3	-
TerraFM	-	47.57	93.1	67.61
DEO	58.43	53.09	93.8	68.44

Ablation Study¶

Component	Optical Avg.	MS Avg.	Overall Avg.
Base (MS self-distillation only)	77.87	60.44	69.16
+DINOv3 [cls]	79.07 (+1.20)	62.81 (+2.37)	70.94
+Separate optical path	81.20 (+2.13)	62.69 (-0.12)	71.95
+DINOv3 [p]	81.74 (+0.53)	62.46	72.10
+Optical augmentation	81.95	63.02 (+0.55)	72.48
+High-resolution optical	82.22 (+0.27)	63.51 (+0.50)	72.87

Key Findings¶

Optical VFM distillation improves not only optical but also MS performance: Adding DINOv3 [cls] distillation yields +2.37 pp on MS average.
Objective compatibility is critical: The contrastive self-distillation objective aligns naturally with DINOv3's training objective, enabling inherent feature space alignment (confirmed by PCA visualization in Figure 3).
All components contribute cumulatively: Overall average improves from 69.16 (base) to 72.87 (full model), with each component contributing positively.
DEO ranks first overall: Achieves the highest average rank across 11 benchmarks (Table 4), with only 87M parameters and 500K pre-training images.

Highlights & Insights¶

Deep insight into objective compatibility: The student's pre-training objective should match that of the teacher model — this explains why MIM + VFM distillation (e.g., Copernicus-FM) underperforms contrastive distillation + VFM distillation.
Exceptional efficiency: Trained on only 500K images (vs. TerraFM's 18M), with 87M parameters (vs. DINOv3-LS's 303M) and 100 epochs on 16× A100s, yet achieves comprehensive state-of-the-art results.
Non-destructive multimodality: Incorporating MS capability does not sacrifice optical performance — a rare quality in multimodal foundation models.
Swin over ViT: The finer feature resolution from patch size 4 is critical for dense prediction tasks; cross-architecture distillation from a ViT teacher to a Swin student is shown to be effective.

Limitations & Future Work¶

Limited to 10 Sentinel-2 bands: SAR, thermal infrared, and other modalities are not addressed.
Spatial resolution constraint: Sentinel-2's native 10–60 m resolution limits applications; while high-resolution optical data partially replaces low-resolution bands, MS bands remain low-resolution.
Geographic bias in fMoW: The dataset primarily covers certain regions; generalization to polar, oceanic, and other underrepresented areas remains unknown.
Whether larger student models could further benefit from this framework is left unexplored.

DINOv3: The latest vision foundation model with particular attention to remote sensing — DEO demonstrates the merit of efficiently leveraging its knowledge rather than competing from scratch.
Coding Rate Regularization: Derived from MCR² (Ma et al.), it replaces negative sampling/temperature scaling in conventional contrastive learning, preventing representation collapse more elegantly.
Implications for the EO community: Rather than investing enormous compute to train MS foundation models from scratch, efficiently absorbing knowledge from existing VFMs via distillation points toward a sustainable EO foundation model ecosystem.

Rating¶

⭐⭐⭐⭐⭐ — Insightful (objective compatibility), highly efficient (SOTA with only 500K images), and experimentally comprehensive (11 datasets across 3 tasks). An outstanding contribution to the remote sensing foundation model literature.