Skip to content

OmniSat: Self-Supervised Modality Fusion for Earth Observation

Conference: ECCV 2024
arXiv: 2404.08351
Code: GitHub
Area: Self-Supervised
Keywords: multi-modal fusion, earth observation, self-supervised, Sentinel, PASTIS

TL;DR

This paper proposes OmniSat, a unified framework that fuses heterogeneous remote sensing data—including multi-spectral time-series (S2), SAR time-series (S1), and high-resolution single-temporal images (SPOT/Aerial)—into a unified representation using modality-specific encoders and cross-modal contrastive self-supervised pre-training. It outperforms all unimodal and multimodal baselines on semantic segmentation and crop classification.

Background & Motivation

Background: Earth observation has abundant multi-source data, such as S2 (multi-spectral time-series), S1 (SAR time-series), and high-resolution single-temporal images (SPOT/aerial), with resolutions ranging from 10m to 0.2m. Existing methods mostly utilize only a single modality.

Limitations of Prior Work: The spatial-temporal resolution, number of bands, and acquisition frequency differ completely across various modalities. Direct concatenation or simple fusion fails to effectively exploit complementary information.

Key Challenge: High-resolution data has excellent spatial details but lacks temporal information; time-series data is temporally rich but has low spatial resolution.

Goal: Design a unified multimodal remote sensing architecture that can flexibly ingest any subset of modalities.

Key Insight: Dedicated encoders are used for each modality to align them into a shared semantic space using cross-modal contrastive learning.

Core Idea: Modality-specific encoders + cross-modal CLIP-style alignment + flexible dropout to achieve robust fusion of arbitrary modality combinations.

Method

Overall Architecture

Each modality has a dedicated encoder: time-series data utilizes a spatio-temporal attention Transformer (a U-TAE variant), while high-resolution data uses CNN/ViT. The output of each encoder is mapped to a shared space via projection heads. During inference, cross-attention is used to fuse multimodal tokens.

Key Designs

  1. Modality-Specific Encoders

    • Function: Handle heterogeneous data separately (time-series multi-spectral, SAR, single-temporal high-resolution).
    • Mechanism: S2 uses U-TAE (spatio-temporal attention), S1 is similar but adapted to 2-channel SAR, and high-resolution data uses ViT/ResNet.
    • Design Motivation: The input formats are too disparate to use a shared encoder.
  2. Cross-Modal Contrastive Alignment

    • Function: Self-supervised pre-training maps different modalities of the same geographical plot to adjacent embeddings.
    • Mechanism: CLIP-style contrastive learning—bringing S1 and S2 embeddings of the same plot closer while pushing different plots apart.
    • Design Motivation: Labels are scarce; spatial co-occurrence is leveraged to obtain alignment signals for free.
  3. Modality Dropout + Flexible Inference

    • Function: Randomly drop certain modalities during training, allowing the model to accept any subset during inference.
    • Mechanism: Randomly mask some modality inputs at each step to force the model to extract information from any subset.
    • Design Motivation: In practical deployment, not all plots have coverage from all modalities.

Loss & Training

  • Pre-training: Cross-modal InfoNCE; Fine-tuning: Cross-entropy + Dice
  • Datasets: PASTIS (French S1+S2+SPOT), FLAIR (Aerial+S2)

Key Experimental Results

Main Results (PASTIS Semantic Segmentation mIoU%)

Modality Combination Method mIoU
S2 only U-TAE 63.1
S1+S2 Dual-stream 65.2
S1+S2+SPOT OmniSat 67.5

Ablation Study

Configuration mIoU Note
Full 67.5 Tri-modal
w/o Pre-training 64.1 Self-supervised +3.4
w/o Dropout 65.8 Robustness +1.7
w/o S1 66.2 All-weather SAR
w/o SPOT 65.9 High-res details

Key Findings

  • Tri-modal fusion improves by 4.4 mIoU compared to the best unimodal baseline.
  • Self-supervised pre-training contributes 3.4 mIoU, acting as the largest source of performance gain.
  • Modality dropout limits the performance degradation of missing any modality to within 1-2 mIoU.

Highlights & Insights

  • First unified framework to handle optical time-series, SAR time-series, and high-resolution single-temporal images
  • Geographical Co-occurrence Self-Supervision: Aligns multi-source data by exploiting their inherent spatial correspondence.
  • Flexible Inference: Ingests arbitrary modality subsets into a single model, making it highly practical for deployment.

Limitations & Future Work

  • Only validated on French agricultural regions; global generalization capability remains unverified.
  • Contrastive learning may lead to confusion in geographically close but semantically distinct areas.
  • Spatial coverage of the high-resolution modality is limited.
  • vs U-TAE: Unimodal S2 baseline, which OmniSat extends to a multimodal setting.
  • vs SatCLIP: Performs geographic coordinate alignment but does not handle temporal sequences.
  • vs SkySense: Large-scale pre-training but does not address missing modalities.

Rating

  • Novelty: ⭐⭐⭐⭐ First unified framework to handle three types of heterogeneous remote sensing data.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets + modality ablations + pre-training ablations.
  • Writing Quality: ⭐⭐⭐⭐ Well-defined problem definitions.
  • Value: ⭐⭐⭐⭐⭐ Modality dropout is extremely valuable for real-world deployment.