Skip to content

SegPoint: Segment Any Point Cloud via Large Language Model

Conference: ECCV 2024
arXiv: 2407.13761
Code: https://heshuting555.github.io/SegPoint
Area: 3D Vision / Point Cloud Segmentation
Keywords: 3D point cloud segmentation, LLM, unified framework, instruction segmentation, geometric feature

TL;DR

SegPoint is proposed, the first model to utilize multimodal LLM reasoning capabilities in a unified framework to complete four tasks: 3D instruction segmentation, referring segmentation, semantic segmentation, and open-vocabulary segmentation. Additionally, the Instruct3D benchmark (2,565 pairs) is constructed, achieving an mIoU of 27.5%.

Background & Motivation

Background: Significant progress has been made in 3D point cloud segmentation (e.g., Mask3D, SPFormer), but each model typically addresses only one specific segmentation task, lacking unification.

Limitations of Prior Work: (a) Existing methods rely on predefined categories or explicit textual descriptions, failing to understand implicit human intent (e.g., "Where can I sit?"); (b) different tasks require different models, which is inefficient and impractical.

Key Challenge: Real-world scenarios require models to understand implicit instructions and perform reasoning, yet current methods lack reasoning capability; meanwhile, a unified framework is needed to handle multiple segmentation tasks.

Key Insight: Leveraging the reasoning and world knowledge of LLMs to understand complex/implicit instructions, combined with a geometric enhancer module to compensate for the limitations of point cloud encoders in dense prediction.

Core Idea: Injecting LLM reasoning capabilities into 3D point cloud segmentation, achieving high-quality point-wise segmentation through geometric enhancement and feature propagation.

Method

Overall Architecture

SegPoint consists of four core components: (1) a pretrained point cloud encoder \(\mathcal{E}\) (Uni3D) to extract point cloud features; (2) a large language model \(\mathcal{F}\) (LLaMA2-7B) to provide reasoning capabilities; (3) a Geometric Enhancer Module (GEM) \(\mathcal{G}\) to extract local geometric information and inject it into the encoder; and (4) a Geometric-guided Feature Propagation (GFP) block \(\mathcal{P}\) to generate high-quality point-wise embeddings for precise segmentation.

The inputs are the point cloud \(\vec{i}_{point} \in \mathbb{R}^{N \times (3+F)}\) and the text instruction \(\vec{i}_{txt}\). The embedding corresponding to the <SEG> token in the LLM output is dot-producted with the point-wise features to generate the segmentation mask.

Key Designs

1. Vanilla Baseline and Its Limitations

- Function: Directly inputs point cloud encoder features into the LLM, detects the `<SEG>` token to generate the mask embedding $\vec{h}_{seg} = \gamma(\vec{y}_{[seg]})$, and performs a dot product with the upsampled point-wise embeddings to obtain the mask $\vec{m} = \vec{h}_{seg} \otimes \text{UpS.}(\vec{f}_{point})$.
- Limitations: (a) The point cloud encoder is trained for scene-level classification and is unsuitable for dense prediction; (b) FPS sampling reduces the points from $N$ to $N_1$, causing detail loss; (c) directly upsampling from $N_1$ to $N$ introduces significant noise.
- Design Motivation: Identifies two core bottlenecks: the lack of local geometric information and poor upsampling quality.

2. Geometric Enhancer Module (GEM)

- Function: Extracts the local geometric context of the entire scene and injects it into the intermediate features of the point cloud encoder via cross-attention.
- Mechanism:
  - GEM consists of 3 KPConv + BN + ReLU blocks, outputting geometric features $\vec{g}_f \in \mathbb{R}^{N \times D}$, which preserves the information of all $N$ points.
  - The geometric features are injected into each block of the encoder via cross-attention: $\hat{\vec{f}_i} = \vec{f}_i + g_i \cdot \text{softmax}\left(\frac{\vec{f}_i \vec{g}_f^T}{\sqrt{D}}\right) \vec{g}_f$
  - A learnable gating factor $g_i$ is initialized to zero, ensuring that the feature distribution of the pretrained weights is not abruptly altered.
- Design Motivation: KPConv is naturally suited for extracting local 3D geometric information (vs. ordinary linear layers); the gating factor protects pretrained weights; this is similar to the concept of using ConvStem to enhance ViT's ability to capture local information in 2D.

3. Geometric-guided Feature Propagation (GFP)

- Function: Performs high-quality upsampling from sparse point features to dense point-wise embeddings.
- Mechanism:
  - High-level features $\vec{f}_3, \vec{f}_4$ are upsampled to $N_3, N_2$ points via PointNet++ propagation.
  - Geometric features $\vec{g}_f$ are downsampled to the same number of points via FPS.
  - The upsampled and downsampled features are concatenated and fused through FC + ReLU layers.
  - The last-layer feature $\vec{f}_5$ is concatenated with the hidden state embedding output by the LLM to perceive multimodal info.
  - **Attentive Propagation**: Cross-attention is employed to exchange information across different point densities: $\hat{\tilde{\vec{f}}}_4 = \tilde{\vec{f}}_4 + \text{softmax}\left(\frac{\tilde{\vec{f}}_4 \vec{f}_{54}^T}{\sqrt{D}}\right)\vec{f}_{54}$
- Design Motivation: To avoid the information loss caused by direct upsampling; geometric features act as "golden reference" to guide the upsampling process.

4. Task Unification and the Instruct3D Dataset

- Function: Handles four segmentation tasks within a unified model using task-specific prompts.
- Semantic segmentation template: "Can you segment the {category} in this point cloud?" $\rightarrow$ "{category} \<SEG\>"
- Referring segmentation template: "Can you segment the object {description}?" $\rightarrow$ "{category} \<SEG\>"
- Instruct3D contains 2,565 instruction-point cloud pairs from 280 scenes in ScanNet++, supporting multi-target and zero-target scenarios.
- Design Motivation: Implicit instructions require reasoning capabilities (e.g., "Where can I sit?" $\rightarrow$ segmenting a chair), which is unsupported by existing datasets.

Loss & Training

Total Loss: \(\mathcal{L} = \lambda_{txt}\mathcal{L}_{txt} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice}\)

  • \(\mathcal{L}_{txt}\): Autoregressive cross-entropy loss (text generation)
  • \(\mathcal{L}_{bce}\): Binary cross-entropy loss (segmentation mask)
  • \(\mathcal{L}_{dice}\): DICE loss (segmentation mask)
  • Weights: \(\lambda_{txt}=1.0, \lambda_{bce}=2.0, \lambda_{dice}=2.0\)
  • All datasets are jointly trained, with fine-tuning on specific datasets during evaluation.

Key Experimental Results

Instruction Segmentation (Instruct3D)

Stage Method Acc mIoU
Two-stage ScanRefer 12.0 6.9
Two-stage M3DRef-CLIP 18.1 12.8
Single BUTD-DETR* 16.3 10.9
Single EDA* 16.6 12.1
Single SegPoint† (vanilla) 21.8 16.1
Single SegPoint 31.6 27.5

SegPoint achieves a +14.7 mIoU gain compared to the best baseline (27.5 vs 12.8).

Semantic Segmentation

Method ScanNet ScanNet200 S3DIS
PTv2 75.4 30.2 71.6
OctFormer 75.7 32.6 -
Swin3D 75.5 - 72.5
SegPoint 74.1 35.3 72.4

Beats the SOTA by +2.7% mIoU on the category-rich ScanNet200.

Referring Segmentation

Method ScanRefer Nr3D Multi3DRefer
M3DRef-CLIP 35.7 27.0 32.6
3D-STMN 39.5 - -
RefMask3D 44.8 - -
SegPoint 41.7 32.2 36.1

Outperforms competitors by +3.5 mIoU on Multi3DRefer.

Ablation Study

GEM GFP Instruct3D mIoU ScanRefer mIoU
16.1 30.3
21.4 35.8
23.2 38.1
27.5 41.7

Key Findings

  • GEM and GFP each make significant independent contributions (+5.3/+7.1 mIoU), and their combination provides even stronger complementary effects.
  • GEM outperforms full fine-tuning, LoRA, and MLP adapters, showing that the improvement does not merely stem from increased parameter size.
  • Even the vanilla baseline (SegPoint†) outperforms all existing methods \(\rightarrow\) validating the effectiveness of LLMs as segmentation engines.
  • In open-vocabulary segmentation, it outperforms several supervised methods (ScanNet++ 19.3 vs. supervised KPConv 30.0), demonstrating strong generalization ability.

Highlights & Insights

  • Elegance of a Unified Framework: For the first time, four 3D segmentation tasks are unified within a single model. Switching between tasks is achieved via task-specific prompts, eliminating the need to design separate models for each task. This approach aligns with the trend of unified segmentation models in the 2D domain (e.g., SAM).
  • Necessity of Geometric Enhancement: Pretrained point cloud encoders (e.g., Uni3D) are designed for scene-level classification, leading to poor performance when directly applied to dense prediction. GEM, by extracting local geometry via KPConv paired with gated injection, adapts to dense prediction at minimal cost.
  • Instruct3D Pioneering a New Task: Extending 3D segmentation from explicit descriptions to implicit instruction understanding requires world knowledge and reasoning capabilities, driving the development of more intelligent 3D perception systems.

Limitations & Future Work

  • The current framework only supports text prompts and does not support non-textual prompts like points or boxes (similar to SAM's prompt mode).
  • Training requires 4×A100 GPUs for approximately 3 days, indicating a high computational cost.
  • The Instruct3D dataset is relatively small (only 2,565 pairs) and needs expansion in the future.
  • Open-vocabulary segmentation relies on GPT-4 for category name matching, introducing an extra dependency.
  • vs. LISA (2D): SegPoint borrows the paradigm of injecting a <SEG> token into the LLM from LISA, but performs end-to-end mask generation without relying on external pretrained segmentation models like SAM.
  • vs. Mask3D/SPFormer: While these Transformer-based methods excel at standard segmentation, they cannot handle language interaction and implicit instructions.
  • vs. 3D-LLM/PointLLM: These methods focus primarily on scene-level understanding, lacking the capability for fine-grained point-wise segmentation.

Rating

  • Novelty: ⭐⭐⭐⭐ First to unify four 3D segmentation tasks + new Instruct3D task
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks across four tasks + detailed ablation
  • Writing Quality: ⭐⭐⭐⭐ Clear problem definition, well-motivated module designs
  • Value: ⭐⭐⭐⭐ Driving the transition of 3D segmentation toward a unified, intelligent framework