Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision¶

Conference: CVPR 2026 arXiv: 2603.13660 Code: Available (mentioned in the paper) Area: Medical Imaging Keywords: Self-supervised learning, 3D medical imaging, mask-guided pretraining, in-context segmentation, foundation model

TL;DR¶

This paper proposes MASS (MAsk-guided Self-Supervised learning), which leverages category-agnostic masks automatically generated by SAM2 as pseudo-annotations and adopts in-context segmentation as a pretext task for self-supervised pretraining. Without any manual annotation, MASS learns semantically rich and highly generalizable 3D medical image representations, achieving strong performance on both few-shot segmentation and frozen-encoder classification.

Background & Motivation¶

Absence of foundation models: Models such as GPT, CLIP, and DINO have learned universal representations from large-scale unlabeled data in natural image and language domains, yet no analogous foundation model paradigm exists for 3D medical imaging.

Limitations of existing self-supervised methods: Contrastive learning methods (SimCLR, MoCo) emphasize global features, while reconstruction-based methods such as MAE focus on low-level texture; neither captures the anatomical semantics and spatial precision required for medical imaging.

Constraints of supervised pretraining: Methods such as SuPreM and STU-Net rely on large volumes of expert annotations and are confined to predefined class taxonomies (e.g., 25 organs + 7 tumor types), making them difficult to scale to the thousands of anatomical variants and pathologies encountered in clinical practice.

Unique challenges of medical imaging: Unlike natural images, nearly every voxel in a medical scan carries clinical significance (bone density → fracture, soft-tissue texture → tumor, vascular pattern → ischemia), and spatial precision is critical.

Annotation cost barrier: Voxel-level annotation of 3D medical images requires specialized expertise and is extremely costly, limiting the scalability of segmentation-based pretraining approaches.

Core insight: Semantic segmentation is the pretext task most aligned with clinical reasoning (clinicians reason by identifying what a structure is and where it is). Although automatically generated category-agnostic masks carry no semantic labels and contain noise, they are sufficient to capture anatomically and pathologically meaningful regions.

Method¶

Overall Architecture¶

MASS consists of two stages:

Stage 1: Annotation-Free Mask Generation

SAM2 (trained on natural images, with no medical knowledge) is applied to unlabeled 3D volumes for automatic segmentation.
Workflow: a 3-channel input is constructed (different window width/level settings for CT; quantile normalization for MRI/PET); 2D slices are uniformly sampled along the optimal imaging axis; SAM2's automatic mask generation (dense point prompts) is applied; and SAM2's video prediction capability propagates masks across the full volume.
Hundreds to thousands of 3D masks per volume are generated, covering diverse structures including organs, vessels, tumors, and lesions.

Stage 2: Mask-Guided Self-Supervised Learning

An in-context segmentation (ICS) framework is adopted, following the Iris architecture.
The model comprises three components: an image encoder \(E_\theta\), a task encoding module \(T_\phi\), and a mask decoder \(D_\psi\).
At each iteration, an unlabeled 3D image \(x\) and its automatically generated mask \(m\) are sampled; two augmented views are created — a reference view \((x_s, y_s)\) and a query view \((x_q, y_q)\).
The reference view provides positional ("where") information, while appearance transformations force the model to learn semantic consistency ("what") across varying visual appearances.

Key Designs¶

Task embedding mechanism: The reference image is encoded as \(F_s = E_\theta(x_s)\); a task embedding \(\mathcal{T} = T_\phi(F_s, y_s)\) is extracted to capture information about "which anatomical structure to segment," guiding the decoder to predict the query mask \(\hat{y}_q = D_\psi(E_\theta(x_q), \mathcal{T})\).
Implicit semantic learning: Semantic understanding is acquired without semantic labels through invariance mechanisms. Appearance augmentations (brightness, contrast, gamma, Gaussian noise) eliminate shortcuts such as intensity matching and texture patterns; spatial augmentations (rotation, scaling, translation) remove positional and orientation cues. The model is forced to learn what remains invariant across all transformations — the essential semantic identity of anatomical structures.
Open-set mask diversity: Thousands of category-agnostic masks spanning multiple granularities — from organ-level to sub-anatomical regions to pathologies — are used during training, compelling the model to learn broad medical concepts and composable visual primitives (texture patterns, boundary features, spatial configurations, intensity distributions).
Multi-modal compatibility: The mask generation pipeline is applicable to CT, MRI, PET, and other modalities through modality-specific preprocessing strategies (window width/level for CT; quantile normalization for MRI/PET).

Loss & Training¶

Loss function: \(\mathcal{L}_{Seg} = \mathcal{L}_{Dice}(\hat{y}_q, y_q) + \mathcal{L}_{BCE}(\hat{y}_q, y_q)\), jointly optimizing Dice Loss and binary cross-entropy.
Data augmentation: Spatial transformations (rotation, scaling, translation) are applied jointly to images and masks to preserve correspondence; appearance transformations (brightness, contrast, gamma, Gaussian noise) are applied to images only.
Default backbone: 3D ResUNet.
Pretraining scale: Ranges from small-scale (single dataset, 20–200 scans) to large-scale (5K multi-modal CT/MRI/PET volumes across 12 datasets).
Three downstream usage modes: (1) training-free in-context segmentation (no parameter updates); (2) task-specific fine-tuning; (3) frozen-encoder classification.

Key Experimental Results¶

Main Results¶

Table 1: Single-dataset few-shot segmentation (Dice %)

Method	BCV 1-shot	BCV 10-shot	AMOS MR 1-shot	AMOS MR 10-shot	SS H&N 1-shot	KiTS 30-shot
Scratch	27.3	75.2	32.8	75.9	51.8	35.7
SimCLR	44.9	78.4	35.9	78.0	53.6	41.5
MASS-IC	65.5	73.6	62.1	71.6	59.3	3.8
MASS-FT	68.8	83.7	65.9	84.7	66.9	64.3
Fully supervised	83.6	—	85.5	—	78.2	81.7

Table 2: Large-scale multi-modal pretraining segmentation (Dice %, pretrained on 5K volumes)

Method	BCV 1-shot	AMOS MR 1-shot	KiTS 30-shot	Pelvic 1-shot
SuPreM (supervised)	63.9	55.1	64.1	85.4
Iris-FT (supervised)	83.4	83.6	78.3	86.9
AnatoMix	53.1	35.9	40.6	82.2
Merlin	50.1	37.9	51.1	79.3
MASS-FT	70.2	74.3	68.5	92.8

Table 3: Classification performance (AUC %, frozen encoder)

Method	RSNA ICH 5%	RSNA ICH 100%	Liver Trauma 30%	Kidney Trauma 30%
Scratch (full training)	72.8	89.5	74.4	75.0
SuPreM	73.5	78.3	68.3	54.9
Merlin	57.3	65.5	60.1	58.0
MASS	75.4	81.5	86.7	82.9

Ablation Study¶

Mask quality analysis: The average Dice between automatically generated masks and ground truth is only 15.2% (BCV) and 7.1% (SS H&N), with only 14%/13% of masks achieving Dice > 40; nevertheless, MASS achieves 1-shot performance of 65.5% and 59.3% respectively — demonstrating that weak supervision is sufficient.

Comparison of mask generation methods:

Mask Source	BCV 1-shot	SS H&N 1-shot
TotalSegmentator	80.7	13.5 (category not covered)
SAM2	65.5	59.3
SLIC superpixels	54.3	43.8

Data diversity > quantity: Expanding from single-organ abdominal CT (BCV, 42.7%) to whole-body CT and multi-modal data yields 73.9%. Anatomical and modality diversity drives performance gains, while stacking in-domain data saturates rapidly.

Architecture generalization: ResUNet and I3DResNet152 achieve comparable performance under identical settings (segmentation: 73.87 vs. 72.56; classification: 75.42 vs. 75.98), confirming that the method is independent of the specific encoder design.

Key Findings¶

Anatomy vs. pathology: MASS-IC demonstrates strong few-shot capability on anatomical structures (organs), but in-context performance is limited for high-variability tumors (KiTS: 2.7%); fine-tuning (MASS-FT) substantially outperforms baselines (64.3% vs. 42.2%).
20–40% annotations match full supervision: On anatomical structure datasets, MASS-FT with only 10-shot data (25–40% of training data) matches fully supervised performance.
Frozen encoder surpasses full training: With 5% of RSNA ICH data, MASS with a frozen encoder (75.4%) outperforms training from scratch (72.8%); gains are even more pronounced on Trauma 30% data (liver: 86.7 vs. 74.4; kidney: 82.9 vs. 75.0).
OOD generalization: On entirely unseen datasets (BraTS, ACDC, Pelvic), MASS demonstrates competitive or superior performance compared to supervised pretraining (Pelvic: 92.8 vs. Iris 86.9).

Highlights & Insights¶

Paradigm innovation: This work is the first to establish "category-agnostic mask-guided in-context segmentation" as a pretext task for self-supervised pretraining of 3D medical images, effectively bypassing the annotation bottleneck.
Weak to strong: Although automated masks overlap with ground truth by only 7–15% on average, training on thousands of "approximately correct" segmentation tasks enables the model to learn semantic concepts that transcend the boundaries of individual masks.
High data efficiency: Using only 5K volumes (far fewer than OpenMind's 114K), MASS outperforms all self-supervised baselines; single-dataset pretraining on BCV (23 scans) already approaches SuPreM on ICH classification.
Semantics emerge from invariance: Without requiring semantic labels, appearance and spatial variations induced by augmentation force the model to learn the only invariant factor — the essential semantic identity of anatomical structures.
Open-set advantage: Unconstrained by predefined categories, SAM2 masks naturally cover multiple granularities and structures; in taxonomy-mismatched scenarios (e.g., SS H&N), MASS substantially outperforms TotalSegmentator-based approaches.

Limitations & Future Work¶

Weak in-context capability for pathological structures: Zero-shot in-context segmentation of high-variability tumors (e.g., KiTS) is ineffective (2.7%); fine-tuning is necessary for adequate performance.
Unexplored synergy between weak masks and expert annotations: The paper deliberately excludes annotated data and does not investigate the potential of combining automatic masks with a small number of expert annotations.
Dependence on SAM2's boundary detection capability: Mask quality is bounded by SAM2's domain transfer performance on medical images, and results may be suboptimal for structures with ambiguous boundaries.
No vision-language alignment: The method is not aligned with textual modalities such as radiology reports, limiting its applicability to tasks such as report generation.
Gap with supervised pretraining: When evaluation targets align with supervised annotation (e.g., BCV), supervised methods (Iris: 83.2%) still lead MASS (70.2%) by approximately 10–15 points.

Self-supervised learning: Model Genesis (image restoration), MAE/SimMIM (masked reconstruction), DINO (self-distillation), SimCLR/MoCo (contrastive learning) — these focus on general visual features and lack the spatial precision required for semantic segmentation.
Supervised pretraining: SuPreM (25M annotated voxels), STU-Net (TotalSegmentator whole-body CT) — constrained by annotation scale and predefined category taxonomies.
Synthetic data: AnatoMix (synthetic CT generated from TotalSegmentator masks) — distribution shift limits transfer performance.
Universal segmentation and interactive models: UniverSeg, Iris (in-context learning but requiring annotations), SAM medical adaptations — still require annotations or manual interaction at inference time.
MASS's distinction: The only self-supervised approach that jointly offers annotation-free pretraining, open-set masks, emergent semantics, and multi-modal compatibility.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First work to use category-agnostic automatic masks for self-supervised pretraining of medical images; the pretext task design is elegant and intuitively motivated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 4 modalities, 12+ datasets, segmentation and classification task tracks, scale experiments from 20 scans to 5K volumes, and multi-dimensional ablations.
Writing Quality: ⭐⭐⭐⭐⭐ — The motivation–method–experiment logical chain is complete; the explanation of "invariance → emergent semantics" is elegant and compelling.
Value: ⭐⭐⭐⭐⭐ — Provides a scalable, annotation-free pathway toward 3D medical image foundation models with strong practical utility.