Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision¶
Conference: CVPR 2026 arXiv: 2603.13660 Code: Available (mentioned in the paper) Area: Medical Imaging Keywords: Self-supervised learning, 3D medical imaging, mask-guided pretraining, in-context segmentation, foundation model
TL;DR¶
This paper proposes MASS (MAsk-guided Self-Supervised learning), which leverages category-agnostic masks automatically generated by SAM2 as pseudo-annotations and adopts in-context segmentation as a pretext task for self-supervised pretraining. Without any manual annotation, MASS learns semantically rich and highly generalizable 3D medical image representations, achieving strong performance on both few-shot segmentation and frozen-encoder classification.
Background & Motivation¶
Absence of foundation models: Models such as GPT, CLIP, and DINO have learned universal representations from large-scale unlabeled data in natural image and language domains, yet no analogous foundation model paradigm exists for 3D medical imaging.
Limitations of existing self-supervised methods: Contrastive learning methods (SimCLR, MoCo) emphasize global features, while reconstruction-based methods such as MAE focus on low-level texture; neither captures the anatomical semantics and spatial precision required for medical imaging.
Constraints of supervised pretraining: Methods such as SuPreM and STU-Net rely on large volumes of expert annotations and are confined to predefined class taxonomies (e.g., 25 organs + 7 tumor types), making them difficult to scale to the thousands of anatomical variants and pathologies encountered in clinical practice.
Unique challenges of medical imaging: Unlike natural images, nearly every voxel in a medical scan carries clinical significance (bone density → fracture, soft-tissue texture → tumor, vascular pattern → ischemia), and spatial precision is critical.
Annotation cost barrier: Voxel-level annotation of 3D medical images requires specialized expertise and is extremely costly, limiting the scalability of segmentation-based pretraining approaches.
Core insight: Semantic segmentation is the pretext task most aligned with clinical reasoning (clinicians reason by identifying what a structure is and where it is). Although automatically generated category-agnostic masks carry no semantic labels and contain noise, they are sufficient to capture anatomically and pathologically meaningful regions.
Method¶
Overall Architecture¶
MASS consists of two stages:
Stage 1: Annotation-Free Mask Generation
- SAM2 (trained on natural images, with no medical knowledge) is applied to unlabeled 3D volumes for automatic segmentation.
- Workflow: a 3-channel input is constructed (different window width/level settings for CT; quantile normalization for MRI/PET); 2D slices are uniformly sampled along the optimal imaging axis; SAM2's automatic mask generation (dense point prompts) is applied; and SAM2's video prediction capability propagates masks across the full volume.
- Hundreds to thousands of 3D masks per volume are generated, covering diverse structures including organs, vessels, tumors, and lesions.
Stage 2: Mask-Guided Self-Supervised Learning
- An in-context segmentation (ICS) framework is adopted, following the Iris architecture.
- The model comprises three components: an image encoder \(E_\theta\), a task encoding module \(T_\phi\), and a mask decoder \(D_\psi\).
- At each iteration, an unlabeled 3D image \(x\) and its automatically generated mask \(m\) are sampled; two augmented views are created — a reference view \((x_s, y_s)\) and a query view \((x_q, y_q)\).
- The reference view provides positional ("where") information, while appearance transformations force the model to learn semantic consistency ("what") across varying visual appearances.
Key Designs¶
- Task embedding mechanism: The reference image is encoded as \(F_s = E_\theta(x_s)\); a task embedding \(\mathcal{T} = T_\phi(F_s, y_s)\) is extracted to capture information about "which anatomical structure to segment," guiding the decoder to predict the query mask \(\hat{y}_q = D_\psi(E_\theta(x_q), \mathcal{T})\).
- Implicit semantic learning: Semantic understanding is acquired without semantic labels through invariance mechanisms. Appearance augmentations (brightness, contrast, gamma, Gaussian noise) eliminate shortcuts such as intensity matching and texture patterns; spatial augmentations (rotation, scaling, translation) remove positional and orientation cues. The model is forced to learn what remains invariant across all transformations — the essential semantic identity of anatomical structures.
- Open-set mask diversity: Thousands of category-agnostic masks spanning multiple granularities — from organ-level to sub-anatomical regions to pathologies — are used during training, compelling the model to learn broad medical concepts and composable visual primitives (texture patterns, boundary features, spatial configurations, intensity distributions).
- Multi-modal compatibility: The mask generation pipeline is applicable to CT, MRI, PET, and other modalities through modality-specific preprocessing strategies (window width/level for CT; quantile normalization for MRI/PET).
Loss & Training¶
- Loss function: \(\mathcal{L}_{Seg} = \mathcal{L}_{Dice}(\hat{y}_q, y_q) + \mathcal{L}_{BCE}(\hat{y}_q, y_q)\), jointly optimizing Dice Loss and binary cross-entropy.
- Data augmentation: Spatial transformations (rotation, scaling, translation) are applied jointly to images and masks to preserve correspondence; appearance transformations (brightness, contrast, gamma, Gaussian noise) are applied to images only.
- Default backbone: 3D ResUNet.
- Pretraining scale: Ranges from small-scale (single dataset, 20–200 scans) to large-scale (5K multi-modal CT/MRI/PET volumes across 12 datasets).
- Three downstream usage modes: (1) training-free in-context segmentation (no parameter updates); (2) task-specific fine-tuning; (3) frozen-encoder classification.
Key Experimental Results¶
Main Results¶
Table 1: Single-dataset few-shot segmentation (Dice %)
| Method | BCV 1-shot | BCV 10-shot | AMOS MR 1-shot | AMOS MR 10-shot | SS H&N 1-shot | KiTS 30-shot |
|---|---|---|---|---|---|---|
| Scratch | 27.3 | 75.2 | 32.8 | 75.9 | 51.8 | 35.7 |
| SimCLR | 44.9 | 78.4 | 35.9 | 78.0 | 53.6 | 41.5 |
| MASS-IC | 65.5 | 73.6 | 62.1 | 71.6 | 59.3 | 3.8 |
| MASS-FT | 68.8 | 83.7 | 65.9 | 84.7 | 66.9 | 64.3 |
| Fully supervised | 83.6 | — | 85.5 | — | 78.2 | 81.7 |
Table 2: Large-scale multi-modal pretraining segmentation (Dice %, pretrained on 5K volumes)
| Method | BCV 1-shot | AMOS MR 1-shot | KiTS 30-shot | Pelvic 1-shot |
|---|---|---|---|---|
| SuPreM (supervised) | 63.9 | 55.1 | 64.1 | 85.4 |
| Iris-FT (supervised) | 83.4 | 83.6 | 78.3 | 86.9 |
| AnatoMix | 53.1 | 35.9 | 40.6 | 82.2 |
| Merlin | 50.1 | 37.9 | 51.1 | 79.3 |
| MASS-FT | 70.2 | 74.3 | 68.5 | 92.8 |
Table 3: Classification performance (AUC %, frozen encoder)
| Method | RSNA ICH 5% | RSNA ICH 100% | Liver Trauma 30% | Kidney Trauma 30% |
|---|---|---|---|---|
| Scratch (full training) | 72.8 | 89.5 | 74.4 | 75.0 |
| SuPreM | 73.5 | 78.3 | 68.3 | 54.9 |
| Merlin | 57.3 | 65.5 | 60.1 | 58.0 |
| MASS | 75.4 | 81.5 | 86.7 | 82.9 |
Ablation Study¶
Mask quality analysis: The average Dice between automatically generated masks and ground truth is only 15.2% (BCV) and 7.1% (SS H&N), with only 14%/13% of masks achieving Dice > 40; nevertheless, MASS achieves 1-shot performance of 65.5% and 59.3% respectively — demonstrating that weak supervision is sufficient.
Comparison of mask generation methods:
| Mask Source | BCV 1-shot | SS H&N 1-shot |
|---|---|---|
| TotalSegmentator | 80.7 | 13.5 (category not covered) |
| SAM2 | 65.5 | 59.3 |
| SLIC superpixels | 54.3 | 43.8 |
Data diversity > quantity: Expanding from single-organ abdominal CT (BCV, 42.7%) to whole-body CT and multi-modal data yields 73.9%. Anatomical and modality diversity drives performance gains, while stacking in-domain data saturates rapidly.
Architecture generalization: ResUNet and I3DResNet152 achieve comparable performance under identical settings (segmentation: 73.87 vs. 72.56; classification: 75.42 vs. 75.98), confirming that the method is independent of the specific encoder design.
Key Findings¶
- Anatomy vs. pathology: MASS-IC demonstrates strong few-shot capability on anatomical structures (organs), but in-context performance is limited for high-variability tumors (KiTS: 2.7%); fine-tuning (MASS-FT) substantially outperforms baselines (64.3% vs. 42.2%).
- 20–40% annotations match full supervision: On anatomical structure datasets, MASS-FT with only 10-shot data (25–40% of training data) matches fully supervised performance.
- Frozen encoder surpasses full training: With 5% of RSNA ICH data, MASS with a frozen encoder (75.4%) outperforms training from scratch (72.8%); gains are even more pronounced on Trauma 30% data (liver: 86.7 vs. 74.4; kidney: 82.9 vs. 75.0).
- OOD generalization: On entirely unseen datasets (BraTS, ACDC, Pelvic), MASS demonstrates competitive or superior performance compared to supervised pretraining (Pelvic: 92.8 vs. Iris 86.9).
Highlights & Insights¶
- Paradigm innovation: This work is the first to establish "category-agnostic mask-guided in-context segmentation" as a pretext task for self-supervised pretraining of 3D medical images, effectively bypassing the annotation bottleneck.
- Weak to strong: Although automated masks overlap with ground truth by only 7–15% on average, training on thousands of "approximately correct" segmentation tasks enables the model to learn semantic concepts that transcend the boundaries of individual masks.
- High data efficiency: Using only 5K volumes (far fewer than OpenMind's 114K), MASS outperforms all self-supervised baselines; single-dataset pretraining on BCV (23 scans) already approaches SuPreM on ICH classification.
- Semantics emerge from invariance: Without requiring semantic labels, appearance and spatial variations induced by augmentation force the model to learn the only invariant factor — the essential semantic identity of anatomical structures.
- Open-set advantage: Unconstrained by predefined categories, SAM2 masks naturally cover multiple granularities and structures; in taxonomy-mismatched scenarios (e.g., SS H&N), MASS substantially outperforms TotalSegmentator-based approaches.
Limitations & Future Work¶
- Weak in-context capability for pathological structures: Zero-shot in-context segmentation of high-variability tumors (e.g., KiTS) is ineffective (2.7%); fine-tuning is necessary for adequate performance.
- Unexplored synergy between weak masks and expert annotations: The paper deliberately excludes annotated data and does not investigate the potential of combining automatic masks with a small number of expert annotations.
- Dependence on SAM2's boundary detection capability: Mask quality is bounded by SAM2's domain transfer performance on medical images, and results may be suboptimal for structures with ambiguous boundaries.
- No vision-language alignment: The method is not aligned with textual modalities such as radiology reports, limiting its applicability to tasks such as report generation.
- Gap with supervised pretraining: When evaluation targets align with supervised annotation (e.g., BCV), supervised methods (Iris: 83.2%) still lead MASS (70.2%) by approximately 10–15 points.
Related Work & Insights¶
- Self-supervised learning: Model Genesis (image restoration), MAE/SimMIM (masked reconstruction), DINO (self-distillation), SimCLR/MoCo (contrastive learning) — these focus on general visual features and lack the spatial precision required for semantic segmentation.
- Supervised pretraining: SuPreM (25M annotated voxels), STU-Net (TotalSegmentator whole-body CT) — constrained by annotation scale and predefined category taxonomies.
- Synthetic data: AnatoMix (synthetic CT generated from TotalSegmentator masks) — distribution shift limits transfer performance.
- Universal segmentation and interactive models: UniverSeg, Iris (in-context learning but requiring annotations), SAM medical adaptations — still require annotations or manual interaction at inference time.
- MASS's distinction: The only self-supervised approach that jointly offers annotation-free pretraining, open-set masks, emergent semantics, and multi-modal compatibility.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First work to use category-agnostic automatic masks for self-supervised pretraining of medical images; the pretext task design is elegant and intuitively motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 4 modalities, 12+ datasets, segmentation and classification task tracks, scale experiments from 20 scans to 5K volumes, and multi-dimensional ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ — The motivation–method–experiment logical chain is complete; the explanation of "invariance → emergent semantics" is elegant and compelling.
- Value: ⭐⭐⭐⭐⭐ — Provides a scalable, annotation-free pathway toward 3D medical image foundation models with strong practical utility.