Developing Foundation Models for Universal Segmentation from 3D Whole-Body Positron Emission Tomography¶

Conference: CVPR 2026 arXiv: 2603.11627 Code: None Area: Medical Image Segmentation Keywords: Foundation Models, PET Imaging, Universal Segmentation, 3D Segmentation, Promptable Segmentation

TL;DR¶

This work constructs PETWB-Seg11K, the largest whole-body PET segmentation dataset to date (11,041 3D PET scans and 59,831 segmentation masks), and proposes SegAnyPET, a foundation model enabling prompt-driven universal volumetric segmentation of organs and lesions in PET imaging. The model demonstrates strong performance in zero-shot cross-center and cross-tracer settings.

Background & Motivation¶

PET (Positron Emission Tomography) is a critical modality in nuclear medicine that visualizes in vivo metabolic processes via radioactive tracers, and is indispensable in oncology and neurology. However, PET image segmentation faces multiple challenges:

Inherent difficulties: PET lacks high-contrast anatomical boundary information, suffers from low signal-to-noise ratio and limited spatial resolution, making organ/lesion delineation significantly harder than CT/MRI.
Data scarcity: PET data acquisition and annotation are prohibitively costly, and public datasets are scarce and narrow in scope (restricted to specific tumor tasks).
Failure of existing foundation models: General medical segmentation foundation models such as SAM-Med3D, SegVol, and SAT are primarily trained on CT/MRI; direct transfer to PET yields poor results (SAT achieves DSC near zero).
Limitations of task-specific models: Conventional deep learning models can only segment fixed categories seen during training, requiring re-annotation and retraining for new organs or lesions.

Method¶

Overall Architecture¶

SegAnyPET adopts a SAM-like 3D promptable segmentation architecture consisting of three core components:

Image Encoder: Extracts discrete 3D feature embeddings from the input PET volume.
Prompt Encoder: Converts user-provided sparse prompts (e.g., points) or dense prompts (e.g., coarse masks) into compact prompt embeddings via fixed positional encodings and adaptive embedding layers.
Mask Decoder: Fuses image features and prompt embeddings, generating the final segmentation output via upsampling and MLPs.

Key Designs¶

PETWB-Seg11K Dataset Construction:
Integrates two public datasets (AutoPET, UDPET) and three private cohorts, totaling 11,041 whole-body 3D PET scans and 59,831 segmentation masks.
Covers real-world variations across multiple centers (global clinical sites), devices, tracers (FDG/PSMA), and disease types.
Carefully partitioned into internal validation sets (in-distribution) and external validation sets (different centers, cancer types, and tracers) for rigorous evaluation.
3D Volumetric Architecture:
Unlike 2D models that segment slice-by-slice and stack results, SegAnyPET operates directly on 3D volumes, fully exploiting inter-slice contextual information in PET volumes.
Point prompts enable efficient 3D interaction; mask prompts support iterative refinement.
Dual-Variant Strategy:
SegAnyPET: A universal segmentation foundation model trained on the full dataset, providing broad organ and lesion coverage with strong generalizability.
SegAnyPET-Lesion: A specialized variant fine-tuned on lesion-centric data, improving sensitivity and boundary precision for small, heterogeneous lesions.
Clinicians can select the general model or the lesion-specialized variant based on clinical needs.

Loss & Training¶

Training is conducted on PETWB-Seg11K using prompt engineering strategies for mask generation.
Supports a human-in-the-loop interactive workflow: radiologists can iteratively refine segmentation results by appending positive/negative point prompts.
SegAnyPET-Lesion is obtained by fine-tuning SegAnyPET on lesion-centric data.

Key Experimental Results¶

Main Results: Comparison with Task-Specific Models¶

SegAnyPET, as a universal model, is compared against several task-specific models trained exclusively on organ annotations:

Model	Type	Training	Organ Segmentation	New Target Capability
nnU-Net	Task-specific	Fully supervised	Strong	❌ Requires retraining
STUNet	Task-specific	Fully supervised	Moderate	❌ Requires retraining
SwinUNETR	Task-specific	Fully supervised	Moderate	❌ Requires retraining
SegResNet	Task-specific	Fully supervised	Moderate	❌ Requires retraining
SegAnyPET	Universal Foundation	Prompt-based	Comparable/Superior	✅ Zero-shot

Key finding: Despite being a general-purpose model not specifically trained for organ segmentation, SegAnyPET achieves performance comparable to or exceeding nnU-Net on five seen target organs, without requiring task-specific retraining.

Comparison with Segmentation Foundation Models¶

Model	Prompt Type	PET Organ Segmentation	PET Lesion Segmentation
SAM-Med3D	Point prompt	Poor	Poor
SegVol	Point prompt	Poor	Poor
SAT	Text prompt	DSC≈0	DSC≈0
nnInteractive	Point prompt	Poor	Poor
VISTA3D	Point prompt	Poor	Poor
SegAnyPET	Point/Mask prompt	Best	Best

Key finding: Text-prompt models (e.g., SAT) completely fail on PET (DSC≈0), indicating that their text-visual alignment is severely overfitted to anatomical features of structural imaging.

Ablation & Generalization Experiments¶

Validation Scenario	Distribution Shift	SegAnyPET Performance
Internal validation	In-distribution	Consistent and reliable
External – new cancer type	Unseen disease types	Robust generalization
External – PET/MRI	Scanner architecture change	Robust generalization
External – PSMA-PET	New tracer	Robust generalization

Clinical utility validation: In lymphoma and lung cancer scenarios, the SegAnyPET-assisted annotation workflow reduced annotation time by 82.37% and 82.95% for two expert annotators, respectively.

Key Findings¶

Existing general medical segmentation foundation models (e.g., SAM-Med3D) perform poorly on PET, exposing a large domain gap between structural and functional imaging.
Large-scale PET-specific training is essential — SegAnyPET learns domain-robust metabolic representations.
A single SegAnyPET model can effectively replace multiple task-specific networks.
Prompt-driven interaction enables the model to handle targets beyond the training label space, which is critical for whole-body PET analysis.

Highlights & Insights¶

Outstanding data contribution: PETWB-Seg11K is the largest and most comprehensive whole-body PET segmentation dataset to date, far surpassing existing PET datasets in scale.
First PET foundation model: SegAnyPET is the first promptable segmentation foundation model specifically designed for PET imaging, filling a gap in functional imaging.
Zero-shot generalization: The model demonstrates robustness under strict distribution shifts including cross-center, cross-tracer (FDG→PSMA), and cross-scanner (PET/CT→PET/MRI) settings.
Demonstrated clinical value: Beyond segmentation accuracy, the work validates annotation efficiency gains (>82% time savings) and downstream applications in whole-body metabolic network analysis.
Insightful experimental observations: The complete failure of text-prompt models on PET is attributed to cross-modal alignment being overfitted to structural imaging anatomy.

Limitations & Future Work¶

Interaction efficiency for diffuse lesions: For multifocal lesions distributed throughout the body (e.g., lymphoma), point prompts require per-lesion interaction, limiting efficiency.
Underrepresentation of rare diseases/tracers: Certain rare diseases and tracers remain underrepresented in the dataset.
Room for improvement in lesion segmentation: Quantitative metrics indicate significant room for improvement in lesion segmentation accuracy.
Absence of text prompt support: Multimodal vision-language PET foundation models are an important future direction, enabling simultaneous identification of all diffuse lesions via semantic descriptions.
Inference efficiency: The computational overhead of 3D volumetric processing is not sufficiently discussed.

SAM (Segment Anything) inspired the promptable segmentation paradigm, but fundamental architectural and data adaptations are required to transfer from 2D natural images to 3D PET.
nnU-Net remains competitive under sufficient task-specific supervision, suggesting that the advantage of foundation models lies in flexibility rather than absolute superiority on any single task.
AutoPET and UDPET are existing PET segmentation datasets, but their scope is too narrow; PETWB-Seg11K substantially expands both scale and diversity.
Insight: A fundamental domain gap exists between functional imaging (PET/SPECT) and structural imaging (CT/MRI); general foundation models cannot be naively transferred, and modality-specific large-scale training is necessary.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐
Theoretical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Practical Value	⭐⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Overall	⭐⭐⭐⭐