S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds¶
Conference: CVPR2026 arXiv: 2512.00995 Code: Project Page Area: 3D Vision Keywords: Point cloud part segmentation, multi-granularity control, contrastive learning, 2D-3D joint supervision, SAM
TL;DR¶
This paper presents S2AM3D, a point cloud part segmentation framework that integrates 2D pretrained priors with 3D contrastive supervision. A point-consistent encoder produces globally coherent per-point features, while a scale-aware prompt decoder enables continuously controllable segmentation granularity. The method substantially outperforms existing approaches across multiple benchmarks.
Background & Motivation¶
Part-level segmentation of point clouds is a critical task bridging fine-grained geometric detail and high-level semantic understanding, with important applications in 3D content creation, robotic manipulation, and reverse engineering. Existing methods face three key challenges:
- Poor generalization of native 3D methods: High-quality 3D part annotations are costly, and existing datasets are limited in scale and category diversity (e.g., ShapeNet-Part, PartNet), severely constraining generalization to open-domain shapes.
- Cross-view inconsistency of 2D prior-based methods: Methods that apply 2D models such as SAM to rendered views and fuse results suffer from cross-view inconsistencies under occlusion, elongated structures, and complex topology, accumulating errors that degrade global 3D coherence.
- Inflexible granularity control: Feature-clustering methods such as PartField rely on post-processing clustering, yielding discontinuous and unintuitive control; prompt-based methods such as Point-SAM lack explicit granularity control mechanisms.
Method¶
Overall Architecture¶
S2AM3D adopts a decoupled training strategy with two stages:
- Stage 1: Train a Point-Consistent Part Encoder that fuses 2D segmentation priors with 3D contrastive supervision to produce globally consistent per-point features.
- Stage 2: Freeze the encoder and train a Scale-Aware Prompt Decoder conditioned on a point prompt index \(p\) and an optional scale prompt \(s \in [0,1]\) for flexible part segmentation.
Given input point cloud \(\mathbf{P} \in \mathbb{R}^{N \times 3}\), the encoder outputs per-point features \(\mathbf{F} \in \mathbb{R}^{N \times D}\), and the decoder produces a probability mask \(\hat{\mathbf{m}} \in [0,1]^N\).
Point-Consistent Part Encoder¶
The core idea is to layer 3D contrastive supervision on top of 2D distillation, resolving cross-view inconsistencies that arise from multi-view 2D distillation alone.
Base architecture: A PVCNN voxel encoder extracts point features, which are converted into a tri-plane representation \(\mathbf{T} \in \mathbb{R}^{3 \times D \times H \times W}\) (three orthogonal planes: \(xy\), \(yz\), \(zx\)), followed by Transformer blocks for feature aggregation. Tri-plane features are rendered from random viewpoints into 2D latents and supervised via SAM distillation.
Tri-plane feature extraction: Given 3D coordinates \((x,y,z)\), each point is back-projected onto the three feature planes and summed:
3D contrastive supervision: The key innovation is the introduction of contrastive learning using native 3D labeled data. Contrastive pairs are constructed intra-instance, with each mini-batch containing a single object to avoid cross-instance semantic mismatch. For anchor point \(i\) with label \(y_i\), the positive set is:
Using cosine similarity with temperature \(\tau\), \(s_{ij} = \mathbf{f}_i^\top \mathbf{f}_j / \tau\), the contrastive loss is:
This objective compactly clusters features from the same part and separates features from different parts, yielding globally consistent embeddings and sharp boundaries.
Scale-Aware Prompt Decoder¶
The decoder receives the encoder's output \(\mathbf{F}\) and 3D coordinates \(\mathbf{P}\), augmented with 3D sinusoidal positional encodings to form the base representation:
Scale Modulator¶
The scale \(s\) is defined as the relative size of a part (the fraction of total points belonging to that part). For continuous scale \(s \in [0,1]\), learnable sinusoidal embeddings are constructed:
where \(\{\omega_k, \phi_k\}\) are learnable parameters and \(M\) is the number of frequency pairs. Global features are then modulated channel-wise via FiLM (Feature-wise Linear Modulation):
where \(\alpha\) is a learnable scalar gate. FiLM layers and Transformer blocks are interleaved for \(L_m\) layers:
This yields the scale-conditioned representation \(\tilde{\mathbf{F}} = \mathbf{X}^{(L_m)}\).
Scale Dropout: During training, \(\mathbf{e}(s)\) is randomly zeroed with probability 0.1, reducing FiLM to an identity mapping and ensuring the model remains functional at inference time without a scale input.
Bi-directional Cross-Attention¶
Unidirectional cross-attention struggles to simultaneously perform context aggregation and fine-grained refinement in a single forward pass. Bi-directional cross-attention enables mutual interaction between the prompt point feature \(\tilde{\mathbf{F}}_p \in \mathbb{R}^{1 \times D}\) and global features \(\tilde{\mathbf{F}} \in \mathbb{R}^{N \times D}\):
After stacking \(L_d\) layers, a per-point probability mask is output via MLP and Sigmoid:
Loss & Training¶
The segmentation loss uses a dynamically weighted BCE + Dice hybrid objective:
Dynamic BCE adaptively computes the weight \(\beta = (1-\pi)/(\pi + \varepsilon)\) based on each sample's positive ratio \(\pi\), alleviating class imbalance. The Dice term directly optimizes set-level overlap, yielding greater robustness for small parts and long-tail distributions.
Dataset Construction¶
A large-scale dataset containing 100K+ point cloud instances and approximately 1.2M part annotations is constructed from Objaverse, covering 400+ categories. The automated pipeline consists of three steps:
- Part annotation: Parts are sampled and assigned labels based on surface area ratios.
- Quality filtering: A binary PointNet classifier is trained to automatically filter samples with unreliable annotations.
- Connectivity refinement: Spatially disconnected regions sharing the same label are split into independent labels using DBSCAN clustering.
Key Experimental Results¶
Main Results¶
Interactive segmentation (point prompt → single part mask):
| Method | PartObjaverse-Tiny (IoU%) | PartNet-E (IoU%) | Mean |
|---|---|---|---|
| Point-SAM | 31.46 | 50.23 | 40.85 |
| P3-SAM | 35.05 | 39.98 | 37.52 |
| S2AM3D | 46.47 | 62.52 | 54.50 |
| S2AM3D (+scale) | 61.19 | 77.51 | 69.35 |
Full segmentation (predicting part labels for all points):
| Method | PartObjaverse-Tiny (IoU%) | PartNet-E (IoU%) | Mean |
|---|---|---|---|
| Find3D | 20.76 | 21.69 | 21.23 |
| SAMPart3D | 48.79 | 56.17 | 52.48 |
| SAMesh | - | 26.66 | - |
| PartField | 51.54 | 59.10 | 55.32 |
| P3-SAM | 58.10 | 65.39 | 61.75 |
| S2AM3D | 63.29 | 77.98 | 70.64 |
Ablation Study¶
| Setting | PartObjaverse-Tiny | PartNet-E | Mean |
|---|---|---|---|
| +scale full model | 61.19 | 77.51 | 69.35 |
| +scale w/o 3D supervision | 53.94 | 64.11 | 59.03 |
| +scale w/o custom dataset | 53.12 | 66.12 | 59.62 |
| No scale full model | 46.47 | 62.52 | 54.50 |
| No scale w/o 3D supervision | 41.14 | 55.39 | 48.27 |
| No scale w/o custom dataset | 42.12 | 58.56 | 50.34 |
| No scale w/o scale embedding | 42.31 | 58.28 | 50.30 |
Key Findings¶
- 3D contrastive supervision is the largest performance contributor: Removing it causes a 10.32% mean IoU drop under the +scale setting; feature visualizations reveal blurred boundaries and internal inconsistencies.
- Scale prompts yield substantial gains: Adding scale conditioning improves mean IoU from 54.50% to 69.35% (+14.85%) in the interactive segmentation setting.
- Critical role of the custom dataset: Replacing it with PartNet training data leads to a notable performance drop, demonstrating the complementary distributional value of the large-scale, high-quality dataset.
- Scale embedding enhances decoding robustness: Even without a scale input at inference time, models trained with scale embeddings outperform those without (54.50 vs. 50.30).
- Only XYZ coordinates required: Unlike Point-SAM, which requires color, and P3-SAM, which requires normals, S2AM3D achieves state-of-the-art results using coordinates alone.
Highlights & Insights¶
- The 2D-3D joint training paradigm is elegantly designed: 2D priors provide generalization capability while 3D contrastive supervision enforces global consistency, with strong complementarity between the two.
- The scale-aware decoder achieves continuous granularity control via FiLM combined with bi-directional cross-attention, supporting smooth coarse-to-fine transitions.
- The automated data pipeline (annotation → filtering → refinement) is scalable, producing one of the largest 3D part segmentation datasets to date.
- A single unified framework handles both interactive segmentation and full segmentation tasks.
Limitations & Future Work¶
- Only point prompts and scale signals are supported for interaction; future work could incorporate text instructions for more intuitive semantic control.
- The encoder relies on PartField pretrained weights for initialization, limiting the novelty of the encoder architecture itself.
- Dataset annotation depends on an automated pipeline, and the generalizability of the filtering strategy (PointNet classifier) has not been thoroughly validated.
- Inference speed and memory consumption are not discussed; the tri-plane representation at \(448 \times 512 \times 512\) dimensions for 10,000 sampled points is relatively large.
- Evaluation datasets are small in scale (PartObjaverse-Tiny contains only 200 samples); larger-scale evaluation would strengthen the conclusions.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of 2D-3D joint supervision and continuous scale control is creative, though the base encoder architecture follows PartField.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Two task settings, ablations, and visualizations are provided, though the evaluation datasets are limited in scale.
- Writing Quality: ⭐⭐⭐⭐ — Logical structure is clear, mathematical derivations are rigorous, and figures are of high quality.
- Value: ⭐⭐⭐⭐ — Provides a practical granularity-controllable solution for 3D part segmentation; the dataset contribution adds additional value.