OmniSat: Self-Supervised Modality Fusion for Earth Observation¶

Conference: ECCV 2024
arXiv: 2404.08351
Code: GitHub
Area: Self-Supervised
Keywords: multi-modal fusion, earth observation, self-supervised, Sentinel, PASTIS

TL;DR¶

This paper proposes OmniSat, a unified framework that fuses heterogeneous remote sensing data—including multi-spectral time-series (S2), SAR time-series (S1), and high-resolution single-temporal images (SPOT/Aerial)—into a unified representation using modality-specific encoders and cross-modal contrastive self-supervised pre-training. It outperforms all unimodal and multimodal baselines on semantic segmentation and crop classification.

Background & Motivation¶

Background: Earth observation has abundant multi-source data, such as S2 (multi-spectral time-series), S1 (SAR time-series), and high-resolution single-temporal images (SPOT/aerial), with resolutions ranging from 10m to 0.2m. Existing methods mostly utilize only a single modality.

Limitations of Prior Work: The spatial-temporal resolution, number of bands, and acquisition frequency differ completely across various modalities. Direct concatenation or simple fusion fails to effectively exploit complementary information.

Key Challenge: High-resolution data has excellent spatial details but lacks temporal information; time-series data is temporally rich but has low spatial resolution.

Goal: Design a unified multimodal remote sensing architecture that can flexibly ingest any subset of modalities.

Key Insight: Dedicated encoders are used for each modality to align them into a shared semantic space using cross-modal contrastive learning.

Core Idea: Modality-specific encoders + cross-modal CLIP-style alignment + flexible dropout to achieve robust fusion of arbitrary modality combinations.

Method¶

Overall Architecture¶

Each modality has a dedicated encoder: time-series data utilizes a spatio-temporal attention Transformer (a U-TAE variant), while high-resolution data uses CNN/ViT. The output of each encoder is mapped to a shared space via projection heads. During inference, cross-attention is used to fuse multimodal tokens.

Key Designs¶

Modality-Specific Encoders
- Function: Handle heterogeneous data separately (time-series multi-spectral, SAR, single-temporal high-resolution).
- Mechanism: S2 uses U-TAE (spatio-temporal attention), S1 is similar but adapted to 2-channel SAR, and high-resolution data uses ViT/ResNet.
- Design Motivation: The input formats are too disparate to use a shared encoder.
Cross-Modal Contrastive Alignment
- Function: Self-supervised pre-training maps different modalities of the same geographical plot to adjacent embeddings.
- Mechanism: CLIP-style contrastive learning—bringing S1 and S2 embeddings of the same plot closer while pushing different plots apart.
- Design Motivation: Labels are scarce; spatial co-occurrence is leveraged to obtain alignment signals for free.
Modality Dropout + Flexible Inference
- Function: Randomly drop certain modalities during training, allowing the model to accept any subset during inference.
- Mechanism: Randomly mask some modality inputs at each step to force the model to extract information from any subset.
- Design Motivation: In practical deployment, not all plots have coverage from all modalities.

Loss & Training¶

Pre-training: Cross-modal InfoNCE; Fine-tuning: Cross-entropy + Dice
Datasets: PASTIS (French S1+S2+SPOT), FLAIR (Aerial+S2)

Key Experimental Results¶

Main Results (PASTIS Semantic Segmentation mIoU%)¶

Modality Combination	Method	mIoU
S2 only	U-TAE	63.1
S1+S2	Dual-stream	65.2
S1+S2+SPOT	OmniSat	67.5

Ablation Study¶

Configuration	mIoU	Note
Full	67.5	Tri-modal
w/o Pre-training	64.1	Self-supervised +3.4
w/o Dropout	65.8	Robustness +1.7
w/o S1	66.2	All-weather SAR
w/o SPOT	65.9	High-res details

Key Findings¶

Tri-modal fusion improves by 4.4 mIoU compared to the best unimodal baseline.
Self-supervised pre-training contributes 3.4 mIoU, acting as the largest source of performance gain.
Modality dropout limits the performance degradation of missing any modality to within 1-2 mIoU.

Highlights & Insights¶

First unified framework to handle optical time-series, SAR time-series, and high-resolution single-temporal images
Geographical Co-occurrence Self-Supervision: Aligns multi-source data by exploiting their inherent spatial correspondence.
Flexible Inference: Ingests arbitrary modality subsets into a single model, making it highly practical for deployment.

Limitations & Future Work¶

Only validated on French agricultural regions; global generalization capability remains unverified.
Contrastive learning may lead to confusion in geographically close but semantically distinct areas.
Spatial coverage of the high-resolution modality is limited.

vs U-TAE: Unimodal S2 baseline, which OmniSat extends to a multimodal setting.
vs SatCLIP: Performs geographic coordinate alignment but does not handle temporal sequences.
vs SkySense: Large-scale pre-training but does not address missing modalities.

Rating¶

Novelty: ⭐⭐⭐⭐ First unified framework to handle three types of heterogeneous remote sensing data.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets + modality ablations + pre-training ablations.
Writing Quality: ⭐⭐⭐⭐ Well-defined problem definitions.
Value: ⭐⭐⭐⭐⭐ Modality dropout is extremely valuable for real-world deployment.