S2AM3D: Scale-controllable Part Segmentation of 3D Point Clouds¶

Conference: CVPR2026 arXiv: 2512.00995 Code: Project Page Area: 3D Vision Keywords: Point cloud part segmentation, multi-granularity control, contrastive learning, 2D-3D joint supervision, SAM

TL;DR¶

This paper presents S2AM3D, a point cloud part segmentation framework that integrates 2D pretrained priors with 3D contrastive supervision. A point-consistent encoder produces globally coherent per-point features, while a scale-aware prompt decoder enables continuously controllable segmentation granularity. The method substantially outperforms existing approaches across multiple benchmarks.

Background & Motivation¶

Part-level segmentation of point clouds is a critical task bridging fine-grained geometric detail and high-level semantic understanding, with important applications in 3D content creation, robotic manipulation, and reverse engineering. Existing methods face three key challenges:

Poor generalization of native 3D methods: High-quality 3D part annotations are costly, and existing datasets are limited in scale and category diversity (e.g., ShapeNet-Part, PartNet), severely constraining generalization to open-domain shapes.
Cross-view inconsistency of 2D prior-based methods: Methods that apply 2D models such as SAM to rendered views and fuse results suffer from cross-view inconsistencies under occlusion, elongated structures, and complex topology, accumulating errors that degrade global 3D coherence.
Inflexible granularity control: Feature-clustering methods such as PartField rely on post-processing clustering, yielding discontinuous and unintuitive control; prompt-based methods such as Point-SAM lack explicit granularity control mechanisms.

Method¶

Overall Architecture¶

S2AM3D adopts a decoupled training strategy with two stages:

Stage 1: Train a Point-Consistent Part Encoder that fuses 2D segmentation priors with 3D contrastive supervision to produce globally consistent per-point features.
Stage 2: Freeze the encoder and train a Scale-Aware Prompt Decoder conditioned on a point prompt index \(p\) and an optional scale prompt \(s \in [0,1]\) for flexible part segmentation.

Given input point cloud \(\mathbf{P} \in \mathbb{R}^{N \times 3}\), the encoder outputs per-point features \(\mathbf{F} \in \mathbb{R}^{N \times D}\), and the decoder produces a probability mask \(\hat{\mathbf{m}} \in [0,1]^N\).

Point-Consistent Part Encoder¶

The core idea is to layer 3D contrastive supervision on top of 2D distillation, resolving cross-view inconsistencies that arise from multi-view 2D distillation alone.

Base architecture: A PVCNN voxel encoder extracts point features, which are converted into a tri-plane representation \(\mathbf{T} \in \mathbb{R}^{3 \times D \times H \times W}\) (three orthogonal planes: \(xy\), \(yz\), \(zx\)), followed by Transformer blocks for feature aggregation. Tri-plane features are rendered from random viewpoints into 2D latents and supervised via SAM distillation.

Tri-plane feature extraction: Given 3D coordinates \((x,y,z)\), each point is back-projected onto the three feature planes and summed:

\[\mathbf{F} = \Big[\mathbf{T}_{xy}(x_n, y_n) + \mathbf{T}_{yz}(y_n, z_n) + \mathbf{T}_{zx}(z_n, x_n)\Big]_{n=1}^{N}\]

3D contrastive supervision: The key innovation is the introduction of contrastive learning using native 3D labeled data. Contrastive pairs are constructed intra-instance, with each mini-batch containing a single object to avoid cross-instance semantic mismatch. For anchor point \(i\) with label \(y_i\), the positive set is:

\[\hat{P}(i) = \{j \in \hat{P} \setminus \{i\} \mid y_j = y_i\}\]

Using cosine similarity with temperature \(\tau\), \(s_{ij} = \mathbf{f}_i^\top \mathbf{f}_j / \tau\), the contrastive loss is:

\[\mathcal{L}_{\text{contr}} = \frac{1}{|\hat{P}|} \sum_{i \in \hat{P}} -\log \frac{\sum_{j \in \hat{P}(i)} e^{s_{ij}}}{\sum_{j \in \hat{P} \setminus \{i\}} e^{s_{ij}}}\]

This objective compactly clusters features from the same part and separates features from different parts, yielding globally consistent embeddings and sharp boundaries.

Scale-Aware Prompt Decoder¶

The decoder receives the encoder's output \(\mathbf{F}\) and 3D coordinates \(\mathbf{P}\), augmented with 3D sinusoidal positional encodings to form the base representation:

\[\mathbf{X}^{(0)} = \mathbf{F} + \mathrm{PE}(\mathbf{P})\]

Scale Modulator¶

The scale \(s\) is defined as the relative size of a part (the fraction of total points belonging to that part). For continuous scale \(s \in [0,1]\), learnable sinusoidal embeddings are constructed:

\[\mathbf{e}(s) = \big[\sin(\omega_k s + \phi_k), \ \cos(\omega_k s + \phi_k)\big]_{k=1}^{M}\]

where \(\{\omega_k, \phi_k\}\) are learnable parameters and \(M\) is the number of frequency pairs. Global features are then modulated channel-wise via FiLM (Feature-wise Linear Modulation):

\[[\boldsymbol{\gamma}, \boldsymbol{\beta}] = \text{Linear}(\mathrm{LN}(\mathbf{e}(s)))\]

\[\mathrm{FiLM}(\mathbf{X}; s) = \mathbf{X} \odot (1 + \alpha \boldsymbol{\gamma}) + \alpha \boldsymbol{\beta}\]

where \(\alpha\) is a learnable scalar gate. FiLM layers and Transformer blocks are interleaved for \(L_m\) layers:

\[\mathbf{X}^{(\ell+1)} = T_\ell\big(\mathrm{FiLM}(\mathbf{X}^{(\ell)}; s)\big), \quad \ell = 0, \dots, L_m - 1\]

This yields the scale-conditioned representation \(\tilde{\mathbf{F}} = \mathbf{X}^{(L_m)}\).

Scale Dropout: During training, \(\mathbf{e}(s)\) is randomly zeroed with probability 0.1, reducing FiLM to an identity mapping and ensuring the model remains functional at inference time without a scale input.

Bi-directional Cross-Attention¶

Unidirectional cross-attention struggles to simultaneously perform context aggregation and fine-grained refinement in a single forward pass. Bi-directional cross-attention enables mutual interaction between the prompt point feature \(\tilde{\mathbf{F}}_p \in \mathbb{R}^{1 \times D}\) and global features \(\tilde{\mathbf{F}} \in \mathbb{R}^{N \times D}\):

\[\mathbf{q}^{(\ell+1)} = \mathbf{q}^{(\ell)} + \mathrm{CAttn}(\mathbf{q}^{(\ell)}; \mathbf{Y}^{(\ell)})\]

\[\mathbf{Y}^{(\ell+1)} = \mathrm{FFN}\Big(\mathbf{Y}^{(\ell)} + \mathrm{CAttn}(\mathbf{Y}^{(\ell)}; \mathbf{q}^{(\ell+1)})\Big)\]

After stacking \(L_d\) layers, a per-point probability mask is output via MLP and Sigmoid:

\[\hat{\mathbf{m}} = \sigma(\mathrm{MLP}(\mathbf{H})) \in [0,1]^N\]

Loss & Training¶

The segmentation loss uses a dynamically weighted BCE + Dice hybrid objective:

\[\mathcal{L}_{\text{seg}} = \lambda_{\text{bce}} \mathrm{BCE}_{\text{dyn}}(\hat{\mathbf{m}}, \mathbf{m}) + \lambda_{\text{dice}} \left(1 - \frac{2\hat{\mathbf{m}}^\top \mathbf{m}}{\|\hat{\mathbf{m}}\|_1 + \|\mathbf{m}\|_1}\right)\]

Dynamic BCE adaptively computes the weight \(\beta = (1-\pi)/(\pi + \varepsilon)\) based on each sample's positive ratio \(\pi\), alleviating class imbalance. The Dice term directly optimizes set-level overlap, yielding greater robustness for small parts and long-tail distributions.

Dataset Construction¶

A large-scale dataset containing 100K+ point cloud instances and approximately 1.2M part annotations is constructed from Objaverse, covering 400+ categories. The automated pipeline consists of three steps:

Part annotation: Parts are sampled and assigned labels based on surface area ratios.
Quality filtering: A binary PointNet classifier is trained to automatically filter samples with unreliable annotations.
Connectivity refinement: Spatially disconnected regions sharing the same label are split into independent labels using DBSCAN clustering.

Key Experimental Results¶

Main Results¶

Interactive segmentation (point prompt → single part mask):

Method	PartObjaverse-Tiny (IoU%)	PartNet-E (IoU%)	Mean
Point-SAM	31.46	50.23	40.85
P3-SAM	35.05	39.98	37.52
S2AM3D	46.47	62.52	54.50
S2AM3D (+scale)	61.19	77.51	69.35

Full segmentation (predicting part labels for all points):

Method	PartObjaverse-Tiny (IoU%)	PartNet-E (IoU%)	Mean
Find3D	20.76	21.69	21.23
SAMPart3D	48.79	56.17	52.48
SAMesh	-	26.66	-
PartField	51.54	59.10	55.32
P3-SAM	58.10	65.39	61.75
S2AM3D	63.29	77.98	70.64

Ablation Study¶

Setting	PartObjaverse-Tiny	PartNet-E	Mean
+scale full model	61.19	77.51	69.35
+scale w/o 3D supervision	53.94	64.11	59.03
+scale w/o custom dataset	53.12	66.12	59.62
No scale full model	46.47	62.52	54.50
No scale w/o 3D supervision	41.14	55.39	48.27
No scale w/o custom dataset	42.12	58.56	50.34
No scale w/o scale embedding	42.31	58.28	50.30

Key Findings¶

3D contrastive supervision is the largest performance contributor: Removing it causes a 10.32% mean IoU drop under the +scale setting; feature visualizations reveal blurred boundaries and internal inconsistencies.
Scale prompts yield substantial gains: Adding scale conditioning improves mean IoU from 54.50% to 69.35% (+14.85%) in the interactive segmentation setting.
Critical role of the custom dataset: Replacing it with PartNet training data leads to a notable performance drop, demonstrating the complementary distributional value of the large-scale, high-quality dataset.
Scale embedding enhances decoding robustness: Even without a scale input at inference time, models trained with scale embeddings outperform those without (54.50 vs. 50.30).
Only XYZ coordinates required: Unlike Point-SAM, which requires color, and P3-SAM, which requires normals, S2AM3D achieves state-of-the-art results using coordinates alone.

Highlights & Insights¶

The 2D-3D joint training paradigm is elegantly designed: 2D priors provide generalization capability while 3D contrastive supervision enforces global consistency, with strong complementarity between the two.
The scale-aware decoder achieves continuous granularity control via FiLM combined with bi-directional cross-attention, supporting smooth coarse-to-fine transitions.
The automated data pipeline (annotation → filtering → refinement) is scalable, producing one of the largest 3D part segmentation datasets to date.
A single unified framework handles both interactive segmentation and full segmentation tasks.

Limitations & Future Work¶

Only point prompts and scale signals are supported for interaction; future work could incorporate text instructions for more intuitive semantic control.
The encoder relies on PartField pretrained weights for initialization, limiting the novelty of the encoder architecture itself.
Dataset annotation depends on an automated pipeline, and the generalizability of the filtering strategy (PointNet classifier) has not been thoroughly validated.
Inference speed and memory consumption are not discussed; the tri-plane representation at \(448 \times 512 \times 512\) dimensions for 10,000 sampled points is relatively large.
Evaluation datasets are small in scale (PartObjaverse-Tiny contains only 200 samples); larger-scale evaluation would strengthen the conclusions.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of 2D-3D joint supervision and continuous scale control is creative, though the base encoder architecture follows PartField.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two task settings, ablations, and visualizations are provided, though the evaluation datasets are limited in scale.
Writing Quality: ⭐⭐⭐⭐ — Logical structure is clear, mathematical derivations are rigorous, and figures are of high quality.
Value: ⭐⭐⭐⭐ — Provides a practical granularity-controllable solution for 3D part segmentation; the dataset contribution adds additional value.