S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing¶
Conference: AAAI 2026 arXiv: 2508.12409 Code: S5 Area: Remote Sensing Image Segmentation Keywords: semi-supervised learning, remote sensing, foundation model, semantic segmentation, mixture of experts
TL;DR¶
This paper proposes the S5 framework, which for the first time extends semi-supervised semantic segmentation into a pre-training paradigm for remote sensing foundation models (RSFMs). By constructing the million-scale RS4P-1M dataset and introducing a MoE-based multi-dataset fine-tuning strategy, S5 achieves state-of-the-art performance across multiple remote sensing segmentation and detection benchmarks.
Background & Motivation¶
Semantic segmentation in remote sensing relies heavily on pixel-level annotations, which are extremely costly to obtain. Semi-supervised semantic segmentation (S4) leverages unlabeled data through pseudo-labels and consistency regularization, yet existing S4 research remains confined to small-scale datasets and models (e.g., splitting training sets solely on iSAID), making it unable to fully exploit the vast amount of available Earth observation data.
Meanwhile, although RSFMs have made progress in self-supervised (MAE, contrastive learning) and supervised pre-training, supervised pre-training is limited by annotation scale, and self-supervised pre-training suffers from a gap with downstream segmentation tasks. The root cause is: can S4 be scaled to millions of unlabeled remote sensing images as a pre-training strategy for RSFMs?
The core idea of this paper is to elevate S4 from "an intra-dataset training strategy" to "a large-scale pre-training paradigm (S4P)," while ensuring pseudo-label quality through data filtering and enabling efficient multi-dataset adaptation through MoE fine-tuning.
Method¶
Overall Architecture¶
S5 consists of three stages: (1) Dataset Construction — filtering million-scale unlabeled images from datasets such as MillionAID via entropy filtering and diversity expansion to build RS4P-1M; (2) S4 Pre-training (S4P) — using iSAID as labeled data and RS4P-1M as unlabeled data, performing semi-supervised pre-training with FixMatch; (3) MoE Multi-Dataset Fine-tuning (MoE-MDF) — embedding FFN-MoE modules into a shared backbone, jointly fine-tuning on multiple downstream benchmarks using shared experts and dataset-specific experts.
Key Designs¶
1. Data Filtering via Entropy Filtering + Diversity Expansion
Directly using all unlabeled data degrades pseudo-label quality due to distribution mismatch. S5 first trains an initial segmentation model (ViT-H + UperNet) on iSAID, performs inference on unlabeled images, and computes the pixel-level mean entropy:
Low-entropy samples are prioritized (more reliable pseudo-labels), but selecting only low-entropy samples leads to semantic redundancy. Therefore, K-Means is further applied to cluster labeled image features into \(M\) clusters, unlabeled images are assigned to the nearest cluster, and selection proceeds by quota:
Selection stops once each cluster reaches its quota, ensuring semantic diversity. The final RS4P-1M dataset contains 1 million images.
2. S4 Pre-training Based on FixMatch
Weak-strong augmentation consistency regularization is adopted. Two views of each unlabeled image are generated: a weakly augmented view (random scaling, cropping, flipping) and a strongly augmented view (CutMix, color jitter, Gaussian blur). The total loss is:
where the supervised loss \(\mathcal{L}_s\) is standard cross-entropy, and the unsupervised loss is computed only for high-confidence pseudo-labels (\(\max(p_j^{u_w}) \geq \tau\)):
All RSFMs are initialized with MAE pre-trained weights, and S4P further enhances representation capability on top of this.
3. MoE Multi-Dataset Fine-tuning (MoE-MDF)
The conventional "one model per dataset" approach leads to parameter redundancy. MoE-MDF introduces a branching structure in the FFN of ViT: shared experts learn general features while dataset-specific experts learn domain-specific features. Specifically, the intermediate feature \(F_{\text{FFN}}\) is passed through two parallel linear layers:
The outputs are concatenated as \(F^{\text{out}} = \text{Concat}(F^{\text{shared}}, F^{\text{specific}})\). The hyperparameter \(\alpha\) controls the ratio of shared to specific capacity, with no additional inference latency.
Loss & Training¶
- Pre-training stage: Joint supervised and unsupervised FixMatch loss; low-quality pseudo-labels are filtered by confidence threshold \(\tau\)
- Fine-tuning stage: Dataset-specific decoders with a shared MoE backbone; shared experts are updated with all datasets, while specific experts are updated only with their corresponding dataset
Key Experimental Results¶
Main Results¶
| Method | Backbone | Det. Params (M) | DIOR-R mAP | DOTA-v2 mAP | Seg. Params (M) | Vaihingen mIoU | Potsdam mF1 | LoveDA mIoU | OpenEarthMap mIoU |
|---|---|---|---|---|---|---|---|---|---|
| RVSA | ViT-B+RVSA | 222.4 | 68.06 | 55.22 | 412.8 | 78.49 | 91.58 | 52.44 | 66.63 |
| GFM | Swin-B | 208.2 | 67.67 | 59.15 | 387.6 | 79.61 | 91.85 | 54.98 | 67.78 |
| MTP | ViT-L+RVSA | 669.2 | 74.54 | 58.41 | 1309.6 | 80.62 | 92.47 | 54.16 | 69.04 |
| SelectiveMAE | ViT-L | 669.2 | 71.75 | 57.84 | 1309.6 | 80.45 | 92.78 | 54.31 | 69.30 |
| BillionFM | ViT-G | 1993.9 | 73.62 | 58.69 | - | - | 92.58 | 54.40 | - |
| S5 | ViT-B | 138.3 | 72.95 | 57.20 | 160.4 | 79.85 | 92.40 | 54.02 | 68.65 |
| S5 | ViT-L | 377.8 | 75.21 | 59.71 | 435.0 | 80.72 | 92.78 | 55.67 | 69.66 |
| S5 | ViT-H | 730.0 | 75.30 | 59.89 | 824.5 | 80.85 | 92.97 | 55.65 | 70.02 |
S5-ViT-L requires only 435M parameters in the multi-dataset setting (1/3 of Scale-MAE/SelectiveMAE) while comprehensively outperforming all baselines.
Ablation Study¶
Comparison of Pre-training Datasets (ViT-B backbone):
| Unlabeled Data | # Images | iSAID Val | Vaihingen | LoveDA | DIOR-R |
|---|---|---|---|---|---|
| None (MAE baseline) | - | 65.93 | 78.27 | 52.47 | 68.02 |
| SAMRS | 100k | 67.59 | 79.61 | 53.66 | 69.13 |
| MillionAID-random | 100k | 66.32 | 79.49 | 53.20 | 69.02 |
| MillionAID* (filtered) | 100k | 67.66 | 79.77 | 53.81 | 69.65 |
The filtered MillionAID* outperforms both random sampling and SAMRS across all tasks.
Comparison of Fine-tuning Strategies (MAE+S4P, ViT-B):
| Fine-tuning Strategy | Params (M) | Vaihingen | Potsdam | OpenEarthMap | LoveDA | Avg. |
|---|---|---|---|---|---|---|
| SDF (single-dataset) | 412.8 | 79.93 | 92.24 | 67.35 | 54.51 | 73.51 |
| MDF (multi-dataset) | 132.1 | 79.82 | 92.25 | 68.41 | 54.53 | 73.75 |
| MoE-MDF (α=1/4) | 160.4 | 79.85 | 92.40 | 68.80 | 54.57 | 74.15 |
| MoE-MDF (α=1/2) | 188.7 | 79.84 | 92.39 | 68.66 | 54.64 | 73.88 |
MoE-MDF achieves an average accuracy gain of 0.64 points over SDF while using only 1/3 of its parameters. α=1/4 is the optimal ratio.
Key Findings¶
- S4P as a second-stage pre-training following MAE yields significant improvements across all downstream tasks
- The data filtering strategy is effective: filtered 100k images outperform random 100k and scale well to 1M
- Both model scale and data scale contribute consistently: ViT-B→ViT-H and 100k→1M both yield continuous gains
- MoE-MDF enables unified multi-dataset deployment with minimal parameter overhead
Highlights & Insights¶
- New Paradigm: Elevating semi-supervised learning from a "training trick" to a "foundation model pre-training strategy" represents a conceptual breakthrough
- High Practicality: RS4P-1M dataset construction requires no additional annotations (unlike SAMRS which depends on SAM-generated masks), offering strong scalability
- Parameter Efficiency: MoE-MDF reduces parameter count to 1/3–1/4 of conventional methods in multi-dataset scenarios
Limitations & Future Work¶
- Pre-training exclusively employs FixMatch as the S4 method; the potential gains from more recent semi-supervised methods (e.g., UniMatchV2) remain unexplored
- The MoE design is relatively simple (only shared and specific branches), without routing mechanisms or more sophisticated expert selection
- Data filtering requires an initial model trained on iSAID, introducing a dependency on the quality of this initial model
Related Work & Insights¶
- vs. MTP: MTP relies on supervised multi-task pre-training, whereas S5 uses semi-supervised pre-training without additional annotations yet achieves superior performance (ViT-L: 75.21 vs. 74.54 mAP on DIOR-R)
- vs. SAMRS: SAMRS depends on SAM to generate masks for annotation construction, limiting its scale; S5 directly leverages unlabeled data, making it more scalable
- vs. SelectiveMAE: Both focus on data selection, but SelectiveMAE applies it to self-supervised pre-training while S5 applies it to semi-supervised pre-training
Rating¶
- Novelty: ⭐⭐⭐⭐ Extending S4 into a pre-training paradigm is a novel direction; the RS4P-1M dataset construction method is practically valuable
- Experimental Thoroughness: ⭐⭐⭐⭐ Six benchmarks, multi-scale models, and detailed ablations constitute a comprehensive evaluation
- Writing Quality: ⭐⭐⭐⭐ Clear logic, rich figures and tables, intuitive architecture diagrams
- Value: ⭐⭐⭐⭐ Provides a new pre-training paradigm for remote sensing foundation models with high practical value