CVPR 2025 Segmentation SAR foundation model domain generalizable semantic segmentation sparse mixture-of-experts (MoE) physics-guided routing billion parameters

CrossEarth-SAR: A SAR-Centric and Billion-Scale Geospatial Foundation Model for Domain Generalizable Semantic Segmentation¶

Conference: CVPR 2025
arXiv: 2603.12008
Code: VisionXLab/CrossEarth-SAR
Area: Image Segmentation / SAR Remote Sensing / Foundation Models
Keywords: SAR foundation model, domain generalizable semantic segmentation, sparse mixture-of-experts (MoE), physics-guided routing, billion parameters

TL;DR¶

This paper proposes CrossEarth-SAR, the first billion-parameter-scale SAR vision foundation model. Based on a physics-guided sparse Mixture-of-Experts (MoE) architecture, a training set containing 200K images and an evaluation framework of 22 sub-benchmarks are constructed. It achieves state-of-the-art (SOTA) performance on 20 out of 22 cross-domain semantic segmentation benchmarks.

Background & Motivation¶

1. Background¶

Synthetic Aperture Radar (SAR) possesses all-weather, day-and-night Earth observation capabilities, making it indispensable in disaster monitoring, environmental surveillance, and urban management. Semantic segmentation is a core task for converting complex SAR data into actionable information.

2. Limitations of Prior Work¶

Inherent SAR Challenges: Coherent imaging causes multiplicative speckle noise; side-looking geometry leads to layover, foreshortening, and shadow distortions; radar backscattering results in semantic ambiguity (different categories might share similar appearance).
Extreme Domain Shift: Diverse sensors (Sentinel-1/ALOS-2/Capella), frequency bands (C/L/X), polarization modes (HH/HV/VV/VH), and incident angles lead to highly fragmented data characteristics, causing catastrophic failures on model cross-domain transfer.
Deficiencies of Existing Foundation Models: Optical foundation models (such as SatMAE and SkySense) fail to adapt to the physical properties of SAR. Existing SAR models primarily focus on object detection rather than dense segmentation, and they lack designs tailored for cross-domain generalization.

3. Key Challenge¶

How to build a foundation model with sufficient capacity to absorb the extreme diversity of SAR data while remaining computationally feasible, in order to achieve robust semantic segmentation across different sensors, regions, and polarizations.

4. Mechanism¶

A sparse MoE architecture is adopted to scale the parameter count to the billion level (to capture extreme domain diversity) while keeping the inference cost per image manageable, alongside a physics-guided routing mechanism to stabilize expert selection.

5. Prior Attempts & Limitations¶

CrossEarth is the first VFM targeting domain generalizable segmentation, but it only supports optical remote sensing.
SAR foundation models like SARATR-X focus on object recognition rather than dense segmentation.
Existing methods often rely on paired optical images to assist SAR segmentation, which limits their generalization capabilities.

6. Goal¶

A three-pronged approach is proposed: (1) CrossEarth-SAR, a physics-guided sparse MoE foundation model; (2) CrossEarth-SAR-200K, a large-scale training set with 200K weakly and fully supervised images; (3) an evaluation framework with 22 sub-benchmarks covering 8 types of domain gaps.

Method¶

Overall Architecture¶

On top of the DINOv2 (ViT) backbone, the FFN in each block is replaced with a Heuristic MoE (with $n$ experts, activating top-$k$ at a time), and the input SAR image is replicated into three channels for forward propagation. Three SAR physical descriptors are computed in parallel to guide the routing. Mask2Former is adopted as the decoder. The training process jointly optimizes the segmentation loss and load-balanced loss.

Key Designs¶

Key Design 1: SAR Physical Descriptors¶

Function: Compute three scalar physical features for each SAR image to provide robust domain-aware signals.
Mechanism: The three descriptors characterize imaging geometry, radar systems, and target scattering characteristics respectively:
- Directional Entropy $H_{DE}$: Measures the uniformity of gradient direction distribution to capture variations in imaging geometry.
- Equivalent Number of Looks $ENL$: Calculated as $\mu^2/\sigma^2$ to measure speckle intensity and characterize radar system noise.
- Local Roughness $R_{LR}$: Measured via the spatial variance of local patch means to depict the degree of texture variation.
Design Motivation: Standard routers select experts based on learned token embeddings, but heterogeneous SAR data causes extreme embedding fluctuations (Routing Instability). Physical descriptors provide stable domain prior signals.

Key Design 2: Physics-Guided Sparse MoE¶

Function: Scale model capacity to billion-level parameters while preserving inference efficiency.
Mechanism: In each ViT block, token embeddings are concatenated with physical descriptors and fed into the router. The router selects the top-$k$ experts and aggregates their outputs through weighted summation.
Design Motivation: Densely scaling the network incurs prohibitive computational costs. Sparse MoE, with its multi-expert design, allows different experts to specialize in distinct SAR properties, while sparse activation maintains a controlled inference cost.
Load Balancing Loss: $$L_{BC} = \lambda \cdot n \cdot \sum(f_k \cdot p_k)$$ is used to prevent expert collapse, where $\lambda=0.005$.

Key Design 3: CrossEarth-SAR-200K Dataset¶

Function: Construct a large-scale training dataset for SAR semantic segmentation.
Mechanism: Combine 40K fully supervised annotations with 160K weakly supervised data (using the CrossEarth optical model to segment paired optical images to generate pseudo-labels).
Design Motivation: SAR annotated data is highly scarce, requiring large-scale data to support the pre-training of the billion-parameter model.
Pseudo-Label Quality: The mean agreement of 4 models on 1K samples reaches $75.88\%$, higher than OpenEarthMap-SAR's $63.20\%$.

Key Design 4: Earth-Adapter (SAR RS-PEFT)¶

Function: Parameter-efficient fine-tuning (PEFT) on downstream tasks.
Mechanism: Freeze the backbone, training only the decoder and the adapters.

Loss & Training¶

Total Loss = Segmentation Loss $L_{seg}$ + Load Balancing Loss $L_{BC}$, where the weight of $L_{BC}$ is $\lambda = 0.005$.

Key Experimental Results¶

Main Results: Single-Domain Gap Benchmarks (Table 2, average mIoU of 12 benchmarks)¶

Method	Parameter Count	Single-Domain Gap Avg.
DINOv2 (Baseline)	300M	55.5
DINOv3	300M	53.0
MTP	300M	44.7
SARATR-X	60M	49.3
CrossEarth-SAR-S	90M (20M activated)	59.7
CrossEarth-SAR-B	300M (80M activated)	61.1
CrossEarth-SAR-L	1.3B (300M activated)	61.9
CrossEarth-SAR-L*	1.3B (300M activated)	62.7

Key Polarization/Complex-Valued Benchmarks (Table 2)¶

Benchmark	DINOv2	CrossEarth-SAR-L	Gain
VV2F	65.7	73.8	+8.1
HH2F	56.8	72.3	+15.5
F2VV	63.2	69.8	+6.6
F2HH	55.2	67.1	+11.9
C(r)2R	71.3	76.4	+5.1

Multi-Domain Gap Benchmarks (Table 3, average mIoU of 10 benchmarks)¶

Method	2+3 Domain Gaps Avg.
DINOv2 (Baseline)	24.3
CrossEarth-SAR-S	24.8
CrossEarth-SAR-L	27.7
CrossEarth-SAR-L*	28.5

Ablation Study (Table 5-6)¶

Ablation Settings	mIoU	Gain
40K fully supervised only	45.1	-14.3
CrossEarth-SAR-200K (without MoE)	59.4	+0.0
+ plain MoE	61.1	+1.7
+ $L_{BC}$	62.2	+2.8
+ physical descriptors	61.6	+2.2
+ both	62.4	+3.0

Key Findings¶

Data scale is crucial: The 200K dataset achieves a $14.3\%$ mIoU improvement over the 40K fully supervised subset.
Pseudo-labels are effective: The gain from 40K pseudo-labels is even larger than that from 40K ground-truth labels.
Physical descriptors and load balancing are complementary: Individual uses yield $+2.2\%$ and $+2.8\%$ improvements respectively, while their combination gains $+3.0\%$.
Optimal expert configuration is $n=6, \text{top-}k=1$: Increasing top-$k$ degrades performance (single-expert specialization is superior under this data scale).
Sensitivity of physical descriptors: $H_{DE}$ is most sensitive to polarization ($73.47\%$), $ENL$ is most sensitive to complex values ($75.97\%$), and $R_{LR}$ is most sensitive to region shifts ($37.49\%$).
Hierarchical specialization of experts: Experts 3 and 4 dominate the early layers (low-level SAR cues); Experts 1, 2, 5, and 6 are active in the intermediate layers (geometry/texture); and Experts 1 and 5 concentrate on the deeper layers (high-level semantics).
Small models remain competitive: The 90M CrossEarth-SAR-S outperforms the 300M DINOv2 by $11.7\%$ on the HH2F benchmark.

Highlights & Insights¶

Physics-guided routing is the core innovation: Explicitly injecting SAR physical priors (directional entropy, equivalent number of looks, and local roughness) into the MoE routing addresses routing instability caused by heterogeneous SAR data.
The sparse MoE architecture elegantly resolves the trade-off between capacity and efficiency: It contains 1.3B parameters in total, but only 300M parameters are activated during inference.
Comprehensive ecosystem development: Beyond presenting a model, this work contributes a 200K dataset and 22 sub-benchmarks, promoting standardization in SAR domain generalization research.
Visual analysis of hierarchical expert specialization reveals meaningful patterns: early layers focus on speckles, middle layers process geometric textures, and deep layers capture semantics.

Limitations & Future Work¶

Performance in multi-domain gap (2-3 gaps) scenarios still has significant room for improvement (absolute mIoU is only around $28\%$).
The quality of pseudo-labels is constrained by the performance of the optical model (CrossEarth), introducing label noise.
Continual pre-training requires $16 \times \text{A100}$ GPUs, which is computationally expensive.
This work primarily validates semantic segmentation, and has not yet been extended to other downstream tasks such as change detection or object recognition.
The model fails to outperform the baseline on certain benchmarks like D2F, indicating that multi-domain gap generalization remains an open challenge.

Relation to CrossEarth: CrossEarth is the first optical domain-generalizable VFM. CrossEarth-SAR extends similar concepts to the SAR modality and handles SAR-specific multi-domain fragmentation issues using the MoE architecture.
Comparison with SARATR-X: SARATR-X focuses on object recognition with only 60M parameters, whereas CrossEarth-SAR targets dense segmentation with billion-scale parameters.
The concept of physics descriptors can be generalized to domain generalization problems in other physical imaging modalities (e.g., ultrasound, MRI).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (First billion-scale SAR VFM; novel physics-guided MoE routing mechanism; integration of domain knowledge via three physical descriptors.)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (22 benchmarks covering 8 types of domain gaps; comparison with 10+ methods; comprehensive ablations covering data scales, MoE configurations, learning rates, expert counts, and physical descriptors.)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, precise problem definitions, and rich visualizations; the paper is long but highly informative.)
Value: ⭐⭐⭐⭐⭐ (Fills the gap in SAR domain-generalizable foundation models; delivers full-stack contributions including the dataset, benchmark, and model; open-sources both code and data.)