Brewing Stronger Features: Dual-Teacher Distillation for Multispectral Earth Observation¶
Conference: CVPR 2026 arXiv: 2602.19863 Code: Project Page Area: Image Segmentation Keywords: Remote Sensing Foundation Model, Multispectral, Knowledge Distillation, Contrastive Learning, Dual-Teacher Training
TL;DR¶
This paper proposes DEO (Distillation for Earth Observation), a dual-teacher contrastive distillation framework that employs a multispectral self-distillation teacher to learn spectral representations and a frozen optical VFM teacher (DINOv3) to inject high-level semantic priors. The resulting single student network excels at both optical and multispectral remote sensing tasks, achieving state-of-the-art performance across semantic segmentation, change detection, and classification.
Background & Motivation¶
Background: Foundation models are transforming Earth Observation (EO). Large amounts of unlabeled data combined with flexible task adaptation make them especially valuable in annotation-scarce EO settings. However, given the diversity of EO sensors and modalities, training a single universal model is impractical; multiple specialized foundation models will coexist.
Limitations of Prior Work: - Most EO pre-training relies on Masked Image Modeling (MIM), which emphasizes local reconstruction but offers limited control over global semantic structure - General-purpose VFMs (e.g., DINOv2/DINOv3) possess strong optical semantic priors but lack multispectral (MS) capability - Training MS foundation models from scratch is computationally expensive
Key Challenge: How can the strong optical semantic priors of VFMs be efficiently transferred to a multispectral student without compromising the learning of MS-specific information? Existing approaches (e.g., Copernicus-FM) combine MIM with VFM distillation, but the MIM objective is incompatible with the contrastive self-distillation objective of VFMs, resulting in weaker global semantic structure.
Goal: Propose a pre-training strategy that enables strong performance when multispectral data is available, without sacrificing performance on optical-only tasks.
Key Insight: Align the pre-training objective of the student with that of the VFM teacher — if the VFM was trained with contrastive self-distillation, the student should be trained the same way, making latent feature space alignment more tractable.
Core Idea: Dual teacher = a multispectral contrastive self-distillation teacher (for structured MS feature space) + a frozen optical VFM teacher (for global semantic priors), unified under a contrastive distillation framework.
Method¶
Overall Architecture¶
As shown in Figure 2: - Input Augmentation: Multi-scale global/local views generated from Sentinel-2 multispectral images - Multispectral Branch (red): MS teacher (EMA-updated) + student, contrastive self-distillation - Optical Branch (blue): Frozen DINOv3 teacher + student, feature distillation - Student Network (green): Swin Transformer backbone with dual patch embeddings (10-channel for MS, 3-channel for optical)
Key Designs¶
1. Multispectral Contrastive Self-Distillation¶
- Function: Learn robust multispectral representations
- Mechanism: Based on the DINO framework; MS teacher weights are updated via EMA. The loss combines cosine similarity (compression) and coding rate regularization (expansion): $\(\mathcal{L}_{MS} = \mathcal{L}_\text{cos}(p_M(\mathbf{z}_g^M), p_s^{MS}(\mathbf{z}_{g \cup l}^M)) - \gamma \mathcal{L}_{CR}(\cdot)\)$ where \(\mathcal{L}_{CR} = -\log\det(\mathbf{I} + \text{Cov}[\mathbf{z}])\) prevents representation collapse
- Design Motivation: Contrastive learning yields strong semantic representations invariant to distributional shifts; coding rate regularization replaces conventional temperature scaling/negative sampling strategies to prevent collapse more elegantly
2. Optical VFM Distillation¶
- Function: Transfer DINOv3's global semantic and pixel-level features to the student
- Mechanism: Three categories of features are distilled via independent projection heads: $\(\mathcal{L}_O = \alpha_1 \mathcal{L}_\text{cos}(\text{[cls]}_F) + \alpha_2 \mathcal{L}_\text{cos}(\text{[p]}_F) + \alpha_3 \mathcal{L}_\text{cos}(\text{[p]}_\text{mid})\)$
- \(\text{[cls]}_F\): final-layer class token (global semantics)
- \(\text{[p]}_F\): final-layer patch tokens (pixel-level features)
- \(\text{[p]}_\text{mid}\): intermediate-layer patch tokens (mid-level features)
- Design Motivation: Distilling only the class token is insufficient for dense prediction tasks; patch-level features are necessary. Intermediate-layer features provide complementary mid-level semantic information.
3. Backbone Selection and Data Strategy¶
- Backbone: Swin Transformer (patch size 4 vs. ViT's 16), yielding finer feature resolution
- Data: fMoW-Sentinel (MS) + fMoW-RGB (optical), with low-resolution optical bands replaced by 150K high-resolution aerial images
- Dual Patch Embedding: 10-channel for MS, 3-channel for optical; subsequent Transformer layers are shared
Loss & Training¶
Multispectral and optical objectives are jointly optimized with coefficients \(\alpha_1=1, \alpha_2=0.5, \alpha_3=0.5, \gamma=1\).
Key Experimental Results¶
Main Results: Semantic Segmentation (mIoU)¶
Optical Segmentation:
| Method | SpaceNet | GB-cattle | GB-pv | GB-chesa. | Avg. |
|---|---|---|---|---|---|
| DINOv3-B (RGB) | 79.06 | 73.01 | 94.34 | 64.04 | 77.61 |
| Copernicus-FM (MS) | 75.45 | 68.88 | 93.56 | 55.81 | 73.43 |
| DEO | 82.22 | 76.22 | 95.36 | 75.08 | 82.22 |
Multispectral Segmentation:
| Method | GB-SA-crop | GB-cashew | S1F11 | PASTIS | Avg. |
|---|---|---|---|---|---|
| TerraFM (MS) | 30.95 | 59.49 | 92.72 | 19.65 | 50.70 |
| Copernicus-FM (MS) | - | 55.71 | 92.58 | 21.49 | 51.11 |
| DEO | 36.59 | 65.60 | 93.30 | 23.06 | 63.51 |
- MS segmentation average surpasses Prev. SOTA by +4.20 pp (63.51 vs. 51.11)
Change Detection (F1)¶
| Method | LEVIR (Optical) | OSCD (MS) | Avg. |
|---|---|---|---|
| DINOv3-LS | 91.8 | 57.2 | 74.5 |
| TerraFM | 89.5 | 57.5 | 73.5 |
| DEO | 91.3 | 59.2 | 75.3 |
Classification (Linear Probing)¶
| Method | m-bigearthnet F1 | m-so2sat Top1 | m-eurosat Top1 | Avg. |
|---|---|---|---|---|
| DINOv3-B | 55.48 | - | 93.3 | - |
| TerraFM | - | 47.57 | 93.1 | 67.61 |
| DEO | 58.43 | 53.09 | 93.8 | 68.44 |
Ablation Study¶
| Component | Optical Avg. | MS Avg. | Overall Avg. |
|---|---|---|---|
| Base (MS self-distillation only) | 77.87 | 60.44 | 69.16 |
| +DINOv3 [cls] | 79.07 (+1.20) | 62.81 (+2.37) | 70.94 |
| +Separate optical path | 81.20 (+2.13) | 62.69 (-0.12) | 71.95 |
| +DINOv3 [p] | 81.74 (+0.53) | 62.46 | 72.10 |
| +Optical augmentation | 81.95 | 63.02 (+0.55) | 72.48 |
| +High-resolution optical | 82.22 (+0.27) | 63.51 (+0.50) | 72.87 |
Key Findings¶
- Optical VFM distillation improves not only optical but also MS performance: Adding DINOv3 [cls] distillation yields +2.37 pp on MS average.
- Objective compatibility is critical: The contrastive self-distillation objective aligns naturally with DINOv3's training objective, enabling inherent feature space alignment (confirmed by PCA visualization in Figure 3).
- All components contribute cumulatively: Overall average improves from 69.16 (base) to 72.87 (full model), with each component contributing positively.
- DEO ranks first overall: Achieves the highest average rank across 11 benchmarks (Table 4), with only 87M parameters and 500K pre-training images.
Highlights & Insights¶
- Deep insight into objective compatibility: The student's pre-training objective should match that of the teacher model — this explains why MIM + VFM distillation (e.g., Copernicus-FM) underperforms contrastive distillation + VFM distillation.
- Exceptional efficiency: Trained on only 500K images (vs. TerraFM's 18M), with 87M parameters (vs. DINOv3-LS's 303M) and 100 epochs on 16× A100s, yet achieves comprehensive state-of-the-art results.
- Non-destructive multimodality: Incorporating MS capability does not sacrifice optical performance — a rare quality in multimodal foundation models.
- Swin over ViT: The finer feature resolution from patch size 4 is critical for dense prediction tasks; cross-architecture distillation from a ViT teacher to a Swin student is shown to be effective.
Limitations & Future Work¶
- Limited to 10 Sentinel-2 bands: SAR, thermal infrared, and other modalities are not addressed.
- Spatial resolution constraint: Sentinel-2's native 10–60 m resolution limits applications; while high-resolution optical data partially replaces low-resolution bands, MS bands remain low-resolution.
- Geographic bias in fMoW: The dataset primarily covers certain regions; generalization to polar, oceanic, and other underrepresented areas remains unknown.
- Whether larger student models could further benefit from this framework is left unexplored.
Related Work & Insights¶
- DINOv3: The latest vision foundation model with particular attention to remote sensing — DEO demonstrates the merit of efficiently leveraging its knowledge rather than competing from scratch.
- Coding Rate Regularization: Derived from MCR² (Ma et al.), it replaces negative sampling/temperature scaling in conventional contrastive learning, preventing representation collapse more elegantly.
- Implications for the EO community: Rather than investing enormous compute to train MS foundation models from scratch, efficiently absorbing knowledge from existing VFMs via distillation points toward a sustainable EO foundation model ecosystem.
Rating¶
⭐⭐⭐⭐⭐ — Insightful (objective compatibility), highly efficient (SOTA with only 500K images), and experimentally comprehensive (11 datasets across 3 tasks). An outstanding contribution to the remote sensing foundation model literature.