Skip to content

Learning Generalizable 3D Medical Image Representations from Mask-Guided Self-Supervision

Conference: CVPR 2026 arXiv: 2603.13660 Code: Available (mentioned in the paper) Area: Medical Imaging Keywords: Self-supervised learning, 3D medical imaging, mask-guided pretraining, in-context segmentation, foundation model

TL;DR

This paper proposes MASS (MAsk-guided Self-Supervised learning), which leverages category-agnostic masks automatically generated by SAM2 as pseudo-annotations and adopts in-context segmentation as a pretext task for self-supervised pretraining. Without any manual annotation, MASS learns semantically rich and highly generalizable 3D medical image representations, achieving strong performance on both few-shot segmentation and frozen-encoder classification.

Background & Motivation

Absence of foundation models: Models such as GPT, CLIP, and DINO have learned universal representations from large-scale unlabeled data in natural image and language domains, yet no analogous foundation model paradigm exists for 3D medical imaging.

Limitations of existing self-supervised methods: Contrastive learning methods (SimCLR, MoCo) emphasize global features, while reconstruction-based methods such as MAE focus on low-level texture; neither captures the anatomical semantics and spatial precision required for medical imaging.

Constraints of supervised pretraining: Methods such as SuPreM and STU-Net rely on large volumes of expert annotations and are confined to predefined class taxonomies (e.g., 25 organs + 7 tumor types), making them difficult to scale to the thousands of anatomical variants and pathologies encountered in clinical practice.

Unique challenges of medical imaging: Unlike natural images, nearly every voxel in a medical scan carries clinical significance (bone density → fracture, soft-tissue texture → tumor, vascular pattern → ischemia), and spatial precision is critical.

Annotation cost barrier: Voxel-level annotation of 3D medical images requires specialized expertise and is extremely costly, limiting the scalability of segmentation-based pretraining approaches.

Core insight: Semantic segmentation is the pretext task most aligned with clinical reasoning (clinicians reason by identifying what a structure is and where it is). Although automatically generated category-agnostic masks carry no semantic labels and contain noise, they are sufficient to capture anatomically and pathologically meaningful regions.

Method

Overall Architecture

MASS consists of two stages:

Stage 1: Annotation-Free Mask Generation

  • SAM2 (trained on natural images, with no medical knowledge) is applied to unlabeled 3D volumes for automatic segmentation.
  • Workflow: a 3-channel input is constructed (different window width/level settings for CT; quantile normalization for MRI/PET); 2D slices are uniformly sampled along the optimal imaging axis; SAM2's automatic mask generation (dense point prompts) is applied; and SAM2's video prediction capability propagates masks across the full volume.
  • Hundreds to thousands of 3D masks per volume are generated, covering diverse structures including organs, vessels, tumors, and lesions.

Stage 2: Mask-Guided Self-Supervised Learning

  • An in-context segmentation (ICS) framework is adopted, following the Iris architecture.
  • The model comprises three components: an image encoder \(E_\theta\), a task encoding module \(T_\phi\), and a mask decoder \(D_\psi\).
  • At each iteration, an unlabeled 3D image \(x\) and its automatically generated mask \(m\) are sampled; two augmented views are created — a reference view \((x_s, y_s)\) and a query view \((x_q, y_q)\).
  • The reference view provides positional ("where") information, while appearance transformations force the model to learn semantic consistency ("what") across varying visual appearances.

Key Designs

  1. Task embedding mechanism: The reference image is encoded as \(F_s = E_\theta(x_s)\); a task embedding \(\mathcal{T} = T_\phi(F_s, y_s)\) is extracted to capture information about "which anatomical structure to segment," guiding the decoder to predict the query mask \(\hat{y}_q = D_\psi(E_\theta(x_q), \mathcal{T})\).
  2. Implicit semantic learning: Semantic understanding is acquired without semantic labels through invariance mechanisms. Appearance augmentations (brightness, contrast, gamma, Gaussian noise) eliminate shortcuts such as intensity matching and texture patterns; spatial augmentations (rotation, scaling, translation) remove positional and orientation cues. The model is forced to learn what remains invariant across all transformations — the essential semantic identity of anatomical structures.
  3. Open-set mask diversity: Thousands of category-agnostic masks spanning multiple granularities — from organ-level to sub-anatomical regions to pathologies — are used during training, compelling the model to learn broad medical concepts and composable visual primitives (texture patterns, boundary features, spatial configurations, intensity distributions).
  4. Multi-modal compatibility: The mask generation pipeline is applicable to CT, MRI, PET, and other modalities through modality-specific preprocessing strategies (window width/level for CT; quantile normalization for MRI/PET).

Loss & Training

  • Loss function: \(\mathcal{L}_{Seg} = \mathcal{L}_{Dice}(\hat{y}_q, y_q) + \mathcal{L}_{BCE}(\hat{y}_q, y_q)\), jointly optimizing Dice Loss and binary cross-entropy.
  • Data augmentation: Spatial transformations (rotation, scaling, translation) are applied jointly to images and masks to preserve correspondence; appearance transformations (brightness, contrast, gamma, Gaussian noise) are applied to images only.
  • Default backbone: 3D ResUNet.
  • Pretraining scale: Ranges from small-scale (single dataset, 20–200 scans) to large-scale (5K multi-modal CT/MRI/PET volumes across 12 datasets).
  • Three downstream usage modes: (1) training-free in-context segmentation (no parameter updates); (2) task-specific fine-tuning; (3) frozen-encoder classification.

Key Experimental Results

Main Results

Table 1: Single-dataset few-shot segmentation (Dice %)

Method BCV 1-shot BCV 10-shot AMOS MR 1-shot AMOS MR 10-shot SS H&N 1-shot KiTS 30-shot
Scratch 27.3 75.2 32.8 75.9 51.8 35.7
SimCLR 44.9 78.4 35.9 78.0 53.6 41.5
MASS-IC 65.5 73.6 62.1 71.6 59.3 3.8
MASS-FT 68.8 83.7 65.9 84.7 66.9 64.3
Fully supervised 83.6 85.5 78.2 81.7

Table 2: Large-scale multi-modal pretraining segmentation (Dice %, pretrained on 5K volumes)

Method BCV 1-shot AMOS MR 1-shot KiTS 30-shot Pelvic 1-shot
SuPreM (supervised) 63.9 55.1 64.1 85.4
Iris-FT (supervised) 83.4 83.6 78.3 86.9
AnatoMix 53.1 35.9 40.6 82.2
Merlin 50.1 37.9 51.1 79.3
MASS-FT 70.2 74.3 68.5 92.8

Table 3: Classification performance (AUC %, frozen encoder)

Method RSNA ICH 5% RSNA ICH 100% Liver Trauma 30% Kidney Trauma 30%
Scratch (full training) 72.8 89.5 74.4 75.0
SuPreM 73.5 78.3 68.3 54.9
Merlin 57.3 65.5 60.1 58.0
MASS 75.4 81.5 86.7 82.9

Ablation Study

Mask quality analysis: The average Dice between automatically generated masks and ground truth is only 15.2% (BCV) and 7.1% (SS H&N), with only 14%/13% of masks achieving Dice > 40; nevertheless, MASS achieves 1-shot performance of 65.5% and 59.3% respectively — demonstrating that weak supervision is sufficient.

Comparison of mask generation methods:

Mask Source BCV 1-shot SS H&N 1-shot
TotalSegmentator 80.7 13.5 (category not covered)
SAM2 65.5 59.3
SLIC superpixels 54.3 43.8

Data diversity > quantity: Expanding from single-organ abdominal CT (BCV, 42.7%) to whole-body CT and multi-modal data yields 73.9%. Anatomical and modality diversity drives performance gains, while stacking in-domain data saturates rapidly.

Architecture generalization: ResUNet and I3DResNet152 achieve comparable performance under identical settings (segmentation: 73.87 vs. 72.56; classification: 75.42 vs. 75.98), confirming that the method is independent of the specific encoder design.

Key Findings

  1. Anatomy vs. pathology: MASS-IC demonstrates strong few-shot capability on anatomical structures (organs), but in-context performance is limited for high-variability tumors (KiTS: 2.7%); fine-tuning (MASS-FT) substantially outperforms baselines (64.3% vs. 42.2%).
  2. 20–40% annotations match full supervision: On anatomical structure datasets, MASS-FT with only 10-shot data (25–40% of training data) matches fully supervised performance.
  3. Frozen encoder surpasses full training: With 5% of RSNA ICH data, MASS with a frozen encoder (75.4%) outperforms training from scratch (72.8%); gains are even more pronounced on Trauma 30% data (liver: 86.7 vs. 74.4; kidney: 82.9 vs. 75.0).
  4. OOD generalization: On entirely unseen datasets (BraTS, ACDC, Pelvic), MASS demonstrates competitive or superior performance compared to supervised pretraining (Pelvic: 92.8 vs. Iris 86.9).

Highlights & Insights

  • Paradigm innovation: This work is the first to establish "category-agnostic mask-guided in-context segmentation" as a pretext task for self-supervised pretraining of 3D medical images, effectively bypassing the annotation bottleneck.
  • Weak to strong: Although automated masks overlap with ground truth by only 7–15% on average, training on thousands of "approximately correct" segmentation tasks enables the model to learn semantic concepts that transcend the boundaries of individual masks.
  • High data efficiency: Using only 5K volumes (far fewer than OpenMind's 114K), MASS outperforms all self-supervised baselines; single-dataset pretraining on BCV (23 scans) already approaches SuPreM on ICH classification.
  • Semantics emerge from invariance: Without requiring semantic labels, appearance and spatial variations induced by augmentation force the model to learn the only invariant factor — the essential semantic identity of anatomical structures.
  • Open-set advantage: Unconstrained by predefined categories, SAM2 masks naturally cover multiple granularities and structures; in taxonomy-mismatched scenarios (e.g., SS H&N), MASS substantially outperforms TotalSegmentator-based approaches.

Limitations & Future Work

  1. Weak in-context capability for pathological structures: Zero-shot in-context segmentation of high-variability tumors (e.g., KiTS) is ineffective (2.7%); fine-tuning is necessary for adequate performance.
  2. Unexplored synergy between weak masks and expert annotations: The paper deliberately excludes annotated data and does not investigate the potential of combining automatic masks with a small number of expert annotations.
  3. Dependence on SAM2's boundary detection capability: Mask quality is bounded by SAM2's domain transfer performance on medical images, and results may be suboptimal for structures with ambiguous boundaries.
  4. No vision-language alignment: The method is not aligned with textual modalities such as radiology reports, limiting its applicability to tasks such as report generation.
  5. Gap with supervised pretraining: When evaluation targets align with supervised annotation (e.g., BCV), supervised methods (Iris: 83.2%) still lead MASS (70.2%) by approximately 10–15 points.
  • Self-supervised learning: Model Genesis (image restoration), MAE/SimMIM (masked reconstruction), DINO (self-distillation), SimCLR/MoCo (contrastive learning) — these focus on general visual features and lack the spatial precision required for semantic segmentation.
  • Supervised pretraining: SuPreM (25M annotated voxels), STU-Net (TotalSegmentator whole-body CT) — constrained by annotation scale and predefined category taxonomies.
  • Synthetic data: AnatoMix (synthetic CT generated from TotalSegmentator masks) — distribution shift limits transfer performance.
  • Universal segmentation and interactive models: UniverSeg, Iris (in-context learning but requiring annotations), SAM medical adaptations — still require annotations or manual interaction at inference time.
  • MASS's distinction: The only self-supervised approach that jointly offers annotation-free pretraining, open-set masks, emergent semantics, and multi-modal compatibility.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First work to use category-agnostic automatic masks for self-supervised pretraining of medical images; the pretext task design is elegant and intuitively motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 4 modalities, 12+ datasets, segmentation and classification task tracks, scale experiments from 20 scans to 5K volumes, and multi-dimensional ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ — The motivation–method–experiment logical chain is complete; the explanation of "invariance → emergent semantics" is elegant and compelling.
  • Value: ⭐⭐⭐⭐⭐ — Provides a scalable, annotation-free pathway toward 3D medical image foundation models with strong practical utility.