SegPoint: Segment Any Point Cloud via Large Language Model¶

Conference: ECCV 2024
arXiv: 2407.13761
Code: https://heshuting555.github.io/SegPoint
Area: 3D Vision / Point Cloud Segmentation
Keywords: 3D point cloud segmentation, LLM, unified framework, instruction segmentation, geometric feature

TL;DR¶

SegPoint is proposed, the first model to utilize multimodal LLM reasoning capabilities in a unified framework to complete four tasks: 3D instruction segmentation, referring segmentation, semantic segmentation, and open-vocabulary segmentation. Additionally, the Instruct3D benchmark (2,565 pairs) is constructed, achieving an mIoU of 27.5%.

Background & Motivation¶

Background: Significant progress has been made in 3D point cloud segmentation (e.g., Mask3D, SPFormer), but each model typically addresses only one specific segmentation task, lacking unification.

Limitations of Prior Work: (a) Existing methods rely on predefined categories or explicit textual descriptions, failing to understand implicit human intent (e.g., "Where can I sit?"); (b) different tasks require different models, which is inefficient and impractical.

Key Challenge: Real-world scenarios require models to understand implicit instructions and perform reasoning, yet current methods lack reasoning capability; meanwhile, a unified framework is needed to handle multiple segmentation tasks.

Key Insight: Leveraging the reasoning and world knowledge of LLMs to understand complex/implicit instructions, combined with a geometric enhancer module to compensate for the limitations of point cloud encoders in dense prediction.

Core Idea: Injecting LLM reasoning capabilities into 3D point cloud segmentation, achieving high-quality point-wise segmentation through geometric enhancement and feature propagation.

Method¶

Overall Architecture¶

SegPoint consists of four core components: (1) a pretrained point cloud encoder \(\mathcal{E}\) (Uni3D) to extract point cloud features; (2) a large language model \(\mathcal{F}\) (LLaMA2-7B) to provide reasoning capabilities; (3) a Geometric Enhancer Module (GEM) \(\mathcal{G}\) to extract local geometric information and inject it into the encoder; and (4) a Geometric-guided Feature Propagation (GFP) block \(\mathcal{P}\) to generate high-quality point-wise embeddings for precise segmentation.

The inputs are the point cloud \(\vec{i}_{point} \in \mathbb{R}^{N \times (3+F)}\) and the text instruction \(\vec{i}_{txt}\). The embedding corresponding to the <SEG> token in the LLM output is dot-producted with the point-wise features to generate the segmentation mask.

Key Designs¶

1. Vanilla Baseline and Its Limitations¶

- Function: Directly inputs point cloud encoder features into the LLM, detects the `<SEG>` token to generate the mask embedding $\vec{h}_{seg} = \gamma(\vec{y}_{[seg]})$, and performs a dot product with the upsampled point-wise embeddings to obtain the mask $\vec{m} = \vec{h}_{seg} \otimes \text{UpS.}(\vec{f}_{point})$.
- Limitations: (a) The point cloud encoder is trained for scene-level classification and is unsuitable for dense prediction; (b) FPS sampling reduces the points from $N$ to $N_1$, causing detail loss; (c) directly upsampling from $N_1$ to $N$ introduces significant noise.
- Design Motivation: Identifies two core bottlenecks: the lack of local geometric information and poor upsampling quality.

2. Geometric Enhancer Module (GEM)¶

- Function: Extracts the local geometric context of the entire scene and injects it into the intermediate features of the point cloud encoder via cross-attention.
- Mechanism:
  - GEM consists of 3 KPConv + BN + ReLU blocks, outputting geometric features $\vec{g}_f \in \mathbb{R}^{N \times D}$, which preserves the information of all $N$ points.
  - The geometric features are injected into each block of the encoder via cross-attention: $\hat{\vec{f}_i} = \vec{f}_i + g_i \cdot \text{softmax}\left(\frac{\vec{f}_i \vec{g}_f^T}{\sqrt{D}}\right) \vec{g}_f$
  - A learnable gating factor $g_i$ is initialized to zero, ensuring that the feature distribution of the pretrained weights is not abruptly altered.
- Design Motivation: KPConv is naturally suited for extracting local 3D geometric information (vs. ordinary linear layers); the gating factor protects pretrained weights; this is similar to the concept of using ConvStem to enhance ViT's ability to capture local information in 2D.

3. Geometric-guided Feature Propagation (GFP)¶

- Function: Performs high-quality upsampling from sparse point features to dense point-wise embeddings.
- Mechanism:
  - High-level features $\vec{f}_3, \vec{f}_4$ are upsampled to $N_3, N_2$ points via PointNet++ propagation.
  - Geometric features $\vec{g}_f$ are downsampled to the same number of points via FPS.
  - The upsampled and downsampled features are concatenated and fused through FC + ReLU layers.
  - The last-layer feature $\vec{f}_5$ is concatenated with the hidden state embedding output by the LLM to perceive multimodal info.
  - **Attentive Propagation**: Cross-attention is employed to exchange information across different point densities: $\hat{\tilde{\vec{f}}}_4 = \tilde{\vec{f}}_4 + \text{softmax}\left(\frac{\tilde{\vec{f}}_4 \vec{f}_{54}^T}{\sqrt{D}}\right)\vec{f}_{54}$
- Design Motivation: To avoid the information loss caused by direct upsampling; geometric features act as "golden reference" to guide the upsampling process.

4. Task Unification and the Instruct3D Dataset¶

- Function: Handles four segmentation tasks within a unified model using task-specific prompts.
- Semantic segmentation template: "Can you segment the {category} in this point cloud?" $\rightarrow$ "{category} \<SEG\>"
- Referring segmentation template: "Can you segment the object {description}?" $\rightarrow$ "{category} \<SEG\>"
- Instruct3D contains 2,565 instruction-point cloud pairs from 280 scenes in ScanNet++, supporting multi-target and zero-target scenarios.
- Design Motivation: Implicit instructions require reasoning capabilities (e.g., "Where can I sit?" $\rightarrow$ segmenting a chair), which is unsupported by existing datasets.

Loss & Training¶

Total Loss: \(\mathcal{L} = \lambda_{txt}\mathcal{L}_{txt} + \lambda_{bce}\mathcal{L}_{bce} + \lambda_{dice}\mathcal{L}_{dice}\)

\(\mathcal{L}_{txt}\): Autoregressive cross-entropy loss (text generation)
\(\mathcal{L}_{bce}\): Binary cross-entropy loss (segmentation mask)
\(\mathcal{L}_{dice}\): DICE loss (segmentation mask)
Weights: \(\lambda_{txt}=1.0, \lambda_{bce}=2.0, \lambda_{dice}=2.0\)
All datasets are jointly trained, with fine-tuning on specific datasets during evaluation.

Key Experimental Results¶

Instruction Segmentation (Instruct3D)¶

Stage	Method	Acc	mIoU
Two-stage	ScanRefer	12.0	6.9
Two-stage	M3DRef-CLIP	18.1	12.8
Single	BUTD-DETR*	16.3	10.9
Single	EDA*	16.6	12.1
Single	SegPoint† (vanilla)	21.8	16.1
Single	SegPoint	31.6	27.5

SegPoint achieves a +14.7 mIoU gain compared to the best baseline (27.5 vs 12.8).

Semantic Segmentation¶

Method	ScanNet	ScanNet200	S3DIS
PTv2	75.4	30.2	71.6
OctFormer	75.7	32.6	-
Swin3D	75.5	-	72.5
SegPoint	74.1	35.3	72.4

Beats the SOTA by +2.7% mIoU on the category-rich ScanNet200.

Referring Segmentation¶

Method	ScanRefer	Nr3D	Multi3DRefer
M3DRef-CLIP	35.7	27.0	32.6
3D-STMN	39.5	-	-
RefMask3D	44.8	-	-
SegPoint	41.7	32.2	36.1

Outperforms competitors by +3.5 mIoU on Multi3DRefer.

Ablation Study¶

GEM	GFP	Instruct3D mIoU	ScanRefer mIoU
✗	✗	16.1	30.3
✓	✗	21.4	35.8
✗	✓	23.2	38.1
✓	✓	27.5	41.7

Key Findings¶

GEM and GFP each make significant independent contributions (+5.3/+7.1 mIoU), and their combination provides even stronger complementary effects.
GEM outperforms full fine-tuning, LoRA, and MLP adapters, showing that the improvement does not merely stem from increased parameter size.
Even the vanilla baseline (SegPoint†) outperforms all existing methods \(\rightarrow\) validating the effectiveness of LLMs as segmentation engines.
In open-vocabulary segmentation, it outperforms several supervised methods (ScanNet++ 19.3 vs. supervised KPConv 30.0), demonstrating strong generalization ability.

Highlights & Insights¶

Elegance of a Unified Framework: For the first time, four 3D segmentation tasks are unified within a single model. Switching between tasks is achieved via task-specific prompts, eliminating the need to design separate models for each task. This approach aligns with the trend of unified segmentation models in the 2D domain (e.g., SAM).
Necessity of Geometric Enhancement: Pretrained point cloud encoders (e.g., Uni3D) are designed for scene-level classification, leading to poor performance when directly applied to dense prediction. GEM, by extracting local geometry via KPConv paired with gated injection, adapts to dense prediction at minimal cost.
Instruct3D Pioneering a New Task: Extending 3D segmentation from explicit descriptions to implicit instruction understanding requires world knowledge and reasoning capabilities, driving the development of more intelligent 3D perception systems.

Limitations & Future Work¶

The current framework only supports text prompts and does not support non-textual prompts like points or boxes (similar to SAM's prompt mode).
Training requires 4×A100 GPUs for approximately 3 days, indicating a high computational cost.
The Instruct3D dataset is relatively small (only 2,565 pairs) and needs expansion in the future.
Open-vocabulary segmentation relies on GPT-4 for category name matching, introducing an extra dependency.

vs. LISA (2D): SegPoint borrows the paradigm of injecting a <SEG> token into the LLM from LISA, but performs end-to-end mask generation without relying on external pretrained segmentation models like SAM.
vs. Mask3D/SPFormer: While these Transformer-based methods excel at standard segmentation, they cannot handle language interaction and implicit instructions.
vs. 3D-LLM/PointLLM: These methods focus primarily on scene-level understanding, lacking the capability for fine-grained point-wise segmentation.

Rating¶

Novelty: ⭐⭐⭐⭐ First to unify four 3D segmentation tasks + new Instruct3D task
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks across four tasks + detailed ablation
Writing Quality: ⭐⭐⭐⭐ Clear problem definition, well-motivated module designs
Value: ⭐⭐⭐⭐ Driving the transition of 3D segmentation toward a unified, intelligent framework