Part2Object: Hierarchical Unsupervised 3D Instance Segmentation¶

Conference: ECCV 2024
arXiv: 2407.10084
Area: Semantic Segmentation
Keywords: Unsupervised 3D Instance Segmentation, Hierarchical Clustering, Pseudo-labels, Self-training, 3D Objectness Prior

TL;DR¶

A hierarchical clustering framework, Part2Object, is proposed to leverage self-supervised features and 3D objectness priors to layer-by-layer merge part-level over-segmentations into object-level instances. This generates high-quality pseudo-labels for self-training Hi-Mask3D, achieving 3D instance segmentation without any human annotations.

Background & Motivation¶

3D instance segmentation is a core task in scene understanding, but existing methods heavily rely on large amounts of human-annotated 3D point cloud masks.
Labeling 3D data is extremely expensive (annotating a single scene in ScanNet takes hours), limiting the scalability of these methods.
Prior unsupervised methods, such as Felzenswalb and the 3D projection of CutLER, suffer from severe over-segmentation or under-segmentation.
Single-layer clustering methods struggle to adapt simultaneously to the granularity required for objects of different sizes and geometric structures.
The absence of effective stopping criteria to determine when to halt merging leads to erroneous fusion of adjacent objects (e.g., merging a desk and a laptop on it).
A hierarchical approach is required to progressively construct instance segmentation from parts to objects while maintaining adaptability to objects of various scales.

Method¶

Overall Architecture¶

Part2Object adopts a two-stage framework consisting of "bottom-up clustering + self-training":

First Stage — Part2Object Hierarchical Clustering: Extracts point cloud features utilizing self-supervised features (e.g., DINOv2). Starting from initial over-segmentation (part-level), it progressively merges them into object-level instances through hierarchical clustering to generate pseudo-labels.
Second Stage — Hi-Mask3D Self-Training: Trains a hierarchical 3D instance segmentation network, Hi-Mask3D, using the pseudo-labels to simultaneously predict part-level and object-level segmentations.

Key Designs¶

Module 1: Feature-Guided Hierarchical Clustering¶

At each layer \(t\), every cluster pair \((c_i^t, c_j^t)\) is evaluated and merged based on the following condition:

\[c_k^{t+1} \leftarrow c_i^t \cup c_j^t \quad \text{if} \quad \text{rank}(\text{sim}(\boldsymbol{f}_i^t, \boldsymbol{f}_j^t)) \leq K \;\land\; \text{dist}(c_i^t, c_j^t) \leq T\]

where \(\boldsymbol{f}_i^t\) is the feature vector of cluster \(c_i^t\), \(K\) is the nearest neighbor rank threshold for feature similarity, and \(T\) is the spatial distance threshold. After merging, the feature of the new cluster is computed through a feature update function \(\text{FU}(\cdot)\):

\[\boldsymbol{f}_k^{t+1} \leftarrow \text{FU}(c_k^{t+1})\]

This design simultaneously considers feature similarity and spatial proximity, preventing the erroneous merging of feature-similar but spatially separated regions.

Module 2: 3D Objectness Prior Stopping Criteria¶

To prevent over-merging (such as merging a desk with items on top of it), a stopping criterion based on a 3D objectness prior \(B^{3D}\) is introduced:

\[\text{stopCriteria}(c_i^t, c_j^t, B^{3D}) = \begin{cases} \text{True} & \text{if } \exists b \in B^{3D}: \text{IoU}(c_i^t \cup c_j^t, b) > \tau_{iou} \\ \text{False} & \text{otherwise} \end{cases}\]

where the IoU threshold \(\tau_{iou} = 0.6\). The 3D objectness prior can be acquired from multi-view projections of 2D detectors (like CutLER), providing rough estimates of object boundaries to guide when clustering should cease merging.

Module 3: Hi-Mask3D Hierarchical Self-Training¶

Hi-Mask3D is extended based on the Mask3D architecture to simultaneously predict part-level and object-level segmentations: - Part Queries: 300 queries, responsible for fine-grained part segmentation. - Object Queries: 150 queries, responsible for object-level instance segmentation. - The two-level queries learn part-to-object semantic relationships through a hierarchical Transformer decoder.

Loss & Training¶

Optimizer: AdamW, learning rate \(1 \times 10^{-4}\)
Scheduler: OneCycleLR
Training epochs: 600 epochs, Batch size = 4
Voxel size: 0.02
In the unsupervised class-agnostic setting, no categories are filtered based on semantic labels.
In the data-efficient setting, the wall/floor categories of ScanNet are filtered following the convention of Mask3D.
Inference does not use DBSCAN post-processing.

Key Experimental Results¶

Main Results¶

Method	Supervision Type	ScanNet mAP	ScanNet mAP@50	Remarks
Felzenswalb	Unsupervised	-	Severe over-segmentation	Traditional method
CutLER Projection	Unsupervised	-	Severe under-segmentation	2D→3D projection
Mask3D (Class-agnostic)	Fully-supervised	Baseline	Baseline	With annotations
Part2Object + Hi-Mask3D	Unsupervised	Significant improvement	Outperforms supervised baseline	Ours

Zero-Shot Cross-Dataset Generalization¶

Method	ScanNet200 Head mAP@50	ScanNet200 Common mAP@50	S3DIS mAP@50 (Min Gain)
Mask3D (Class-agnostic)	Baseline	Baseline	Baseline
Hi-Mask3D	+10.0%	+0.8% mAP	+2.9%

Key Findings¶

Hierarchical Clustering vs. Single-Layer Clustering: Single-layer clustering fails to accommodate objects of varying scales, leading to either over-segmentation of large objects or under-segmentation of small ones. Hierarchical clustering effectively resolves this issue.
Role of Objectness Prior: Without the objectness prior, objects tend to merge with adjacent items or background elements (such as walls and floors). Introducing the prior successfully separates spatially adjacent objects.
Self-Training Gain: Through self-training, Hi-Mask3D can correct under-segmentation issues present in the pseudo-labels, such as separating a computer from the desk.
Part-Level Learning: Hi-Mask3D can learn hierarchical semantic relationships between objects and parts, such as distinguishing the backrest, seat cushion, and armrests of a sofa.

Highlights & Insights¶

The hierarchical "part-to-object" workflow is highly intuitive, aligning naturally with the "part-whole" relation in human perception.
The design of using the 3D objectness prior as a stopping criterion is elegant, compensating for the lack of boundary awareness in pure feature clustering.
On the head and common categories of ScanNet200, the unsupervised method outperforms the fully-supervised class-agnostic Mask3D.
Hi-Mask3D simultaneously outputs both part-level and object-level segmentations, providing flexible granularity options for downstream tasks.

Limitations & Future Work¶

Performance drops slightly (mAP -0.9%) on tail categories of ScanNet200, as the model is never exposed to annotations of these categories.
The quality of the objectness prior relies heavily on the performance of the 2D detector; if the 2D detector fails in specific scenes, the clustering performance is affected.
Training for 600 epochs is time-consuming.
Hyperparameters for hierarchical clustering (\(K\), \(T\), \(\tau_{iou}\)) require dataset-specific tuning.

Mask3D: A fully-supervised 3D instance segmentation baseline; Hi-Mask3D is extended on top of it to form a hierarchical structure.
Unscene3D: Another unsupervised 3D scene understanding method, which utilizes a class-agnostic evaluation protocol.
CutLER: Unsupervised 2D object detection, used to provide the 3D objectness prior.
Insights: The "over-segmentation -> hierarchical merging -> self-training" paradigm can be transferred to other unsupervised segmentation tasks.

Rating¶

Dimension	Rating
Novelty	⭐⭐⭐⭐
Theoretical Depth	⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Overall Recommendation	⭐⭐⭐⭐