Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography¶
Conference: CVPR 2026 arXiv: 2603.11627 Code: None Area: Medical Image Segmentation Keywords: Foundation Models, PET Imaging, Universal Segmentation, 3D Segmentation, Promptable Segmentation
TL;DR¶
This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans and 59,831 segmentation masks), and proposes SegAnyPET, a foundation model enabling prompt-driven universal volumetric segmentation of organs and lesions in PET imaging. The model demonstrates strong performance in zero-shot cross-center and cross-tracer settings.
Background & Motivation¶
PET (Positron Emission Tomography) is a critical modality in nuclear medicine that visualizes in vivo metabolic processes via radioactive tracers, and is indispensable in oncology and neurology. However, PET image segmentation faces multiple challenges:
- Inherent difficulties: PET lacks high-contrast anatomical boundary information, suffers from low signal-to-noise ratio and limited spatial resolution, making organ/lesion delineation significantly harder than CT/MRI.
- Data scarcity: PET data acquisition and annotation are prohibitively costly, and public datasets are scarce and narrow in scope (restricted to specific tumor tasks).
- Failure of existing foundation models: General medical segmentation foundation models such as SAM-Med3D, SegVol, and SAT are primarily trained on CT/MRI; direct transfer to PET yields poor results (SAT achieves DSC near zero).
- Limitations of task-specific models: Conventional deep learning models can only segment fixed categories seen during training, requiring re-annotation and retraining for new organs or lesions.
Method¶
Overall Architecture¶
SegAnyPET adopts a SAM-like 3D promptable segmentation architecture consisting of three core components:
- Image Encoder: Extracts discrete 3D feature embeddings from the input PET volume.
- Prompt Encoder: Converts user-provided sparse prompts (e.g., points) or dense prompts (e.g., coarse masks) into compact prompt embeddings via fixed positional encodings and adaptive embedding layers.
- Mask Decoder: Fuses image features and prompt embeddings, generating the final segmentation output via upsampling and MLPs.
Key Designs¶
- PETWB-Seg11K Dataset Construction:
- Integrates two public datasets (AutoPET, UDPET) and three private cohorts, totaling 11,041 whole-body 3D PET scans and 59,831 segmentation masks.
- Covers real-world variations across multiple centers (global clinical sites), devices, tracers (FDG/PSMA), and disease types.
-
Carefully partitioned into internal validation sets (in-distribution) and external validation sets (different centers, cancer types, and tracers) for rigorous evaluation.
-
3D Volumetric Architecture:
- Unlike 2D models that segment slice-by-slice and stack results, SegAnyPET operates directly on 3D volumes, fully exploiting inter-slice contextual information in PET volumes.
-
Point prompts enable efficient 3D interaction; mask prompts support iterative refinement.
-
Dual-Variant Strategy:
- SegAnyPET: A universal segmentation foundation model trained on the full dataset, providing broad organ and lesion coverage with strong generalizability.
- SegAnyPET-Lesion: A specialized variant fine-tuned on lesion-centric data, improving sensitivity and boundary precision for small, heterogeneous lesions.
- Clinicians can select the general model or the lesion-specialized variant based on clinical needs.
Loss & Training¶
- Training is conducted on PETWB-Seg11K using prompt engineering strategies for mask generation.
- Supports a human-in-the-loop interactive workflow: radiologists can iteratively refine segmentation results by appending positive/negative point prompts.
- SegAnyPET-Lesion is obtained by fine-tuning SegAnyPET on lesion-centric data.
Key Experimental Results¶
Main Results: Comparison with Task-Specific Models¶
SegAnyPET, as a universal model, is compared against several task-specific models trained exclusively on organ annotations:
| Model | Type | Training | Organ Segmentation | New Target Capability |
|---|---|---|---|---|
| nnU-Net | Task-specific | Fully supervised | Strong | ❌ Requires retraining |
| STUNet | Task-specific | Fully supervised | Moderate | ❌ Requires retraining |
| SwinUNETR | Task-specific | Fully supervised | Moderate | ❌ Requires retraining |
| SegResNet | Task-specific | Fully supervised | Moderate | ❌ Requires retraining |
| SegAnyPET | Universal Foundation | Prompt-based | Comparable/Superior | ✅ Zero-shot |
Key finding: Despite being a general-purpose model not specifically trained for organ segmentation, SegAnyPET achieves performance comparable to or exceeding nnU-Net on five seen target organs, without requiring task-specific retraining.
Comparison with Segmentation Foundation Models¶
| Model | Prompt Type | PET Organ Segmentation | PET Lesion Segmentation |
|---|---|---|---|
| SAM-Med3D | Point prompt | Poor | Poor |
| SegVol | Point prompt | Poor | Poor |
| SAT | Text prompt | DSC≈0 | DSC≈0 |
| nnInteractive | Point prompt | Poor | Poor |
| VISTA3D | Point prompt | Poor | Poor |
| SegAnyPET | Point/Mask prompt | Best | Best |
Key finding: Text-prompt models (e.g., SAT) completely fail on PET (DSC≈0), indicating that their text-visual alignment is severely overfitted to anatomical features of structural imaging.
Ablation & Generalization Experiments¶
| Validation Scenario | Distribution Shift | SegAnyPET Performance |
|---|---|---|
| Internal validation | In-distribution | Consistent and reliable |
| External – new cancer type | Unseen disease types | Robust generalization |
| External – PET/MRI | Scanner architecture change | Robust generalization |
| External – PSMA-PET | New tracer | Robust generalization |
Clinical utility validation: In lymphoma and lung cancer scenarios, the SegAnyPET-assisted annotation workflow reduced annotation time by 82.37% and 82.95% for two expert annotators, respectively.
Key Findings¶
- Existing general medical segmentation foundation models (e.g., SAM-Med3D) perform poorly on PET, exposing a large domain gap between structural and functional imaging.
- Large-scale PET-specific training is essential — SegAnyPET learns domain-robust metabolic representations.
- A single SegAnyPET model can effectively replace multiple task-specific networks.
- Prompt-driven interaction enables the model to handle targets beyond the training label space, which is critical for whole-body PET analysis.
Highlights & Insights¶
- Outstanding data contribution: PETWB-Seg11K is the largest and most comprehensive whole-body PET segmentation dataset to date, far surpassing existing PET datasets in scale.
- First PET foundation model: SegAnyPET is the first promptable segmentation foundation model specifically designed for PET imaging, filling a gap in functional imaging.
- Zero-shot generalization: The model demonstrates robustness under strict distribution shifts including cross-center, cross-tracer (FDG→PSMA), and cross-scanner (PET/CT→PET/MRI) settings.
- Demonstrated clinical value: Beyond segmentation accuracy, the work validates annotation efficiency gains (>82% time savings) and downstream applications in whole-body metabolic network analysis.
- Insightful experimental observations: The complete failure of text-prompt models on PET is attributed to cross-modal alignment being overfitted to structural imaging anatomy.
Limitations & Future Work¶
- Interaction efficiency for diffuse lesions: For multifocal lesions distributed throughout the body (e.g., lymphoma), point prompts require per-lesion interaction, limiting efficiency.
- Underrepresentation of rare diseases/tracers: Certain rare diseases and tracers remain underrepresented in the dataset.
- Room for improvement in lesion segmentation: Quantitative metrics indicate significant room for improvement in lesion segmentation accuracy.
- Absence of text prompt support: Multimodal vision-language PET foundation models are an important future direction, enabling simultaneous identification of all diffuse lesions via semantic descriptions.
- Inference efficiency: The computational overhead of 3D volumetric processing is not sufficiently discussed.
Related Work & Insights¶
- SAM (Segment Anything) inspired the promptable segmentation paradigm, but fundamental architectural and data adaptations are required to transfer from 2D natural images to 3D PET.
- nnU-Net remains competitive under sufficient task-specific supervision, suggesting that the advantage of foundation models lies in flexibility rather than absolute superiority on any single task.
- AutoPET and UDPET are existing PET segmentation datasets, but their scope is too narrow; PETWB-Seg11K substantially expands both scale and diversity.
- Insight: A fundamental domain gap exists between functional imaging (PET/SPECT) and structural imaging (CT/MRI); general foundation models cannot be naively transferred, and modality-specific large-scale training is necessary.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Theoretical Depth | ⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐⭐ |
| Practical Value | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐ |