PartSTAD: 2D-to-3D Part Segmentation Task Adaptation¶
Conference: ECCV 2024
arXiv: 2401.05906
Code: https://github.com/KAIST-Visual-AI-Group/PartSTAD
Area: Segmentation
Keywords: 3D Part Segmentation, Task Adaptation, Few-shot, 2D-to-3D lifting, SAM
TL;DR¶
PartSTAD proposes a task adaptation method for 2D-to-3D part segmentation. By introducing a learnable weight prediction network for GLIP's 2D bounding boxes (optimized targeting 3D mRIoU) and integrating SAM to acquire precise foreground masks, it achieves a 7.0%p improvement in semantic segmentation mIoU and a 5.2%p improvement in instance segmentation mAP50 on PartNet-Mobility (relative to PartSLIP).
Background & Motivation¶
3D part segmentation is a fundamental task for understanding the structure, function, and semantics of 3D shapes. However, 3D annotated data is extremely scarce—the largest part-annotated dataset, PartNet, contains fewer than 30,000 models, whereas 2D image annotations have reached the million scale.
Prior Work: Methods like PartSLIP utilize 2D vision-language models (GLIP) for multi-view rendering \(\rightarrow\) 2D detection \(\rightarrow\) voting aggregation to 3D, and fine-tune GLIP on synthetic data to adapt to rendered images and unnatural text prompts (domain adaptation).
Limitations of Prior Work: - The fine-tuning of PartSLIP only performs domain adaptation (adapting GLIP to synthetic images and part name lists) instead of task adaptation (optimizing towards the final 3D segmentation quality). - 2D bounding boxes inevitably contain noise; the key is how to control the influence of noise on the final 3D segmentation during multi-view integration. - GLIP only outputs bounding boxes instead of segmentation masks, leading to imprecise segmentation boundaries.
Key Insight: To treat 2D-to-3D part segmentation as a task adaptation problem—while keeping the pre-trained weights frozen, a small weight prediction network is trained to optimize the aggregation of 2D boxes using 3D mIoU as the objective function; concurrently, SAM is introduced to obtain precise foreground masks to replace bounding boxes.
Method¶
Overall Architecture¶
The pipeline of PartSTAD is as follows: 1. Render the 3D point cloud into multi-view (10 fixed views) 2D images. 2. Extract 2D bounding boxes for each view using the fine-tuned GLIP. 3. Convert each bounding box into a foreground mask using SAM (SAM Mask Integration). 4. Predict a weight for each box/mask (Weight Prediction Network). 5. Aggregate to the 3D point cloud via weighted voting (2D-to-3D Task Adaptation). 6. Obtain the final segmentation labels based on superpoints.
Both GLIP and SAM are frozen, and only the weight prediction network is trained (per category, 8-shot objects).
Key Designs¶
-
3D mRIoU Loss Function:
-
- Function: Directly use 3D segmentation quality as the adaptation objective.
-
- Mechanism: Since the standard mIoU is non-differentiable, relaxed IoU (mRIoU) is used to relax the predicted labels from \(\{0,1\}\) to \([0,1]\).
-
- Formula: \(\mathcal{L}_{\text{mRIoU}} = 1 - \frac{1}{M}\sum_{j=1}^{M} \frac{\mathbf{l}_j^\top \hat{\mathbf{l}}_j}{\|\mathbf{l}_j\|_1 + \|\hat{\mathbf{l}}_j\|_1 - \mathbf{l}_j^\top \hat{\mathbf{l}}_j}\)
-
- Design Motivation: Cross-entropy loss is less effective than mRIoU in 3D segmentation tasks (as verified by supplementary experiments); mRIoU directly optimizes the evaluation metric itself.
-
-
Bounding Box Weight Prediction:
-
- Function: Predict a weight for each 2D bounding box to control its contribution to the 3D voting.
-
- Mechanism: Since mRIoU is non-differentiable with respect to the bounding box positions, the object positions are not directly adjusted; instead, a positive weight \(W(b)\) is predicted to multiply the voting score.
-
- Modified Voting Formula: \(\tilde{s}_{ij} = \frac{\sum_k \sum_{p \in P_i} V_k(p) \cdot \max_{b} I_b(p) \cdot W(b)}{\sum_k \sum_{p \in P_i} V_k(p)}\)
-
- The final scores are normalized via softmax; the score for the null label is set as a learnable parameter (initial value of 10).
-
- Network Architecture: A two-layer shared MLP with context normalization in between to capture the global box context. The output is processed by a modified ReLU \(\phi(x) = \max(\tau + x, 0)\) (\(\tau=10\) ensures positive initial weights).
-
- Design Motivation: Significant improvements are achieved over the original PartSLIP voting framework with minimal modification (only multiplying weights); weight prediction can suppress noisy boxes and reinforce high-quality ones.
-
-
SAM Mask Integration:
-
- Function: Convert 2D bounding boxes into precise foreground masks using SAM's box-prompted segmentation capability.
-
- Mechanism: Use the bounding boxes as input prompts for SAM to obtain precise foreground segmentation regions, replacing the original rectangular boxes.
-
- Implementation: The point-to-bounding-box membership \(I_b\) becomes point-to-mask membership, while weight prediction still utilizes GLIP's box features.
-
- Design Motivation: GLIP-generated bounding boxes contain a large amount of background; using SAM to extract the foreground significantly improves segmentation boundaries.
-
Loss & Training¶
-
- Loss: 3D mRIoU loss
-
- Training Settings: Per-category training, with 8 annotated 3D objects per category (few-shot).
-
- Trainable Parameters: Only the weight prediction MLP + null label score (an extremely small number of parameters).
-
- GLIP and SAM are completely frozen.
Key Experimental Results¶
Main Results¶
Semantic segmentation mIoU (%) on PartNet-Mobility (10 representative categories):
| Method | Mean mIoU | Storage | Furniture | Table | Chair | Switch | Toilet | Laptop | USB | Remote | Scissors |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SATR | 29.3 | 20.6 | 23.3 | 33.1 | 21.4 | 17.6 | 11.2 | 30.2 | 17.2 | 36.8 | - |
| SATR+SP | 34.8 | 28.9 | 28.0 | 37.7 | 37.0 | 22.1 | 12.4 | 33.4 | 28.0 | 43.0 | - |
| PartSLIP | 58.0 | 52.3 | 44.6 | 82.8 | 52.1 | 50.4 | 31.2 | 52.1 | 36.6 | 61.4 | - |
| PartSTAD | 65.0 | 59.5 | 47.8 | 85.3 | 57.9 | 57.5 | 34.6 | 59.9 | 53.4 | 68.5 | - |
Instance segmentation mAP50 (%):
| Method | Mean mAP | Storage | Furniture | Table | Chair | Toilet | Laptop | USB | Remote | Scissors |
|---|---|---|---|---|---|---|---|---|---|---|
| PartSLIP | 41.6 | 29.1 | 32.6 | 82.2 | 21.2 | 36.2 | 17.8 | 20.9 | 19.9 | 23.6 |
| PartSTAD | 45.6 | 33.8 | 33.7 | 83.6 | 23.5 | 41.5 | 26.5 | 25.7 | 26.2 | 28.0 |
Ablation Study¶
Ablation of semantic segmentation components (mean mIoU across 45 categories):
| Configuration | Mean mIoU | Description |
|---|---|---|
| PartSLIP (Baseline) | 58.0 | No weight prediction, no SAM |
| w/o Weight Prediction | 61.9 | Only SAM added, +3.9 |
| w/o SAM Integration | 62.1 | Only weight prediction added, +4.1 |
| PartSTAD (Full) | 65.0 | Combination of both, +7.0 |
The contributions of each component are quantified through ablation: - Removing weight prediction: mIoU decreases by 3.1%p (\(65.0 \rightarrow 61.9\)). - Removing SAM mask: mIoU decreases by 2.9%p (\(65.0 \rightarrow 62.1\)). - The contributions of the two components are complementary; the joint improvement is greater than the sum of individual improvements.
Key Findings¶
- Task Adaptation > Domain Adaptation: PartSLIP only performs domain adaptation (adapting to synthetic images), whereas PartSTAD performs task adaptation targeting 3D mIoU, yielding an additional 7.0%p improvement.
- Weight Prediction is the Core Contribution: Simply by predicting a scalar weight for each bounding box (a minimal modification), a 4.1%p improvement is achieved.
- SAM Mask Significantly Improves Boundaries: Especially effective for small parts (e.g., Camera, minor parts of Chair) and thin parts (e.g., Clock hands).
- Consistent Improvement across All Categories: Regardless of the object type, PartSTAD exhibits improvements over PartSLIP, with some categories (e.g., Remote) exceeding 15%p.
- mRIoU Loss Outperforms Cross-Entropy: Directly optimizing the evaluation metric itself is more effective.
Highlights & Insights¶
- The Perspective of 'Task Adaptation' is Highly Insightful: It explicitly distinguishes between domain adaptation and task adaptation, the latter being critical in 2D-to-3D lifting scenarios.
- Minimal yet Effective Design: Formulates a simple MLP to predict bounding box weights without modifying any parameters of GLIP itself.
- Clever Handling of mRIoU as the Objective Function: It bypasses the non-differentiability of IoU with respect to discrete parameters, achieving end-to-end optimization by predicting weights instead of coordinates.
- Complementary Combination of SAM and GLIP: GLIP is responsible for semantic detection (identifying what is in each box), while SAM handles precise segmentation (tightening the boxes to foreground boundaries).
Limitations & Future Work¶
- The weight prediction network is trained per-category and is not shared across different categories, which limits scalability.
- Experiments are conducted only on PartNet-Mobility (synthetic CAD models); generalization to real-scanned 3D data is not fully validated.
- 10 fixed views might be insufficient; occlusion and self-occlusion issues are not fully addressed.
- SAM may still generate imprecise masks for extremely small parts.
- The efficacy under extreme few-shot settings (1-2 objects) remains unknown given the 8-shot per category setup.
Related Work & Insights¶
- PartSLIP: The direct foundation of PartSTAD, which proposes a 2D-to-3D pipeline using GLIP fine-tuning + superpoint voting.
- SATR: A similar approach but uses geodesic propagation instead of voting, which yields inferior performance.
- SAM3D: Directly lifts SAM masks into 3D, but lacks semantic labels.
- LoRA / PEFT: The concept of the weight prediction network of ours is highly aligned with PEFT—freezing the pre-trained model and training a minimal number of parameters.
- Insight: The task adaptation concept can be generalized to other 2D-to-3D lifting tasks (e.g., scene segmentation, semantic SLAM).
Rating¶
-
- Novelty: ⭐⭐⭐⭐ (The task adaptation perspective is insightful, and the weight prediction scheme is simple yet elegant)
-
- Experimental Thoroughness: ⭐⭐⭐⭐ (Comprehensive experiments on 45 categories plus ablation studies, but evaluated on only one dataset)
-
- Writing Quality: ⭐⭐⭐⭐ (Clear logic, comprehensive mathematical derivations)
-
- Value: ⭐⭐⭐⭐ (Provides a proper optimization direction for 2D-to-3D segmentation)