Click-Gaussian: Interactive Segmentation to Any 3D Gaussians¶

Conference: ECCV 2024
arXiv: 2407.11793
Code: https://seokhunchoi.github.io/Click-Gaussian (Project Page)
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Interactive Segmentation, Feature Fields, Contrastive Learning, View Consistency

TL;DR¶

This work proposes Click-Gaussian, which learns a discriminative 3D feature field with two-level granularity (coarse/fine) and combines it with Global Feature-guided Learning (GFL) to address cross-view mask inconsistency. It achieves real-time interactive 3D Gaussian segmentation at only 10ms per click, which is 15-130 times faster than existing methods while significantly improving segmentation accuracy.

Background & Motivation¶

3D Gaussian Splatting (3DGS) has gained significant attention in 3D scene manipulation due to its real-time rendering capability, where accurate segmentation of objects in scenes is a fundamental requirement for applications like editing and virtual reality. Existing 3DGS-based segmentation methods face several limitations of prior work:

Time-consuming post-processing: Methods like SAGA require extensive post-processing on noisy segmentation outputs to obtain clean results, which severely constrains the efficiency benefits of 3DGS real-time rendering.
Difficulty in fine-grained segmentation: Existing methods struggle to provide precise segmentation results, failing to meet the demands of fine-grained 3D scene manipulation.
Cross-view mask inconsistency: 2D segmentation results obtained independently from different views conflict with each other, hindering 3D feature learning.

Key Challenge: How to learn a feature field that is both discriminative in 3D space and view-consistent, without relying on time-consuming post-processing?

Key Insight: Design a two-level granularity feature field representation combined with a global feature-guided learning strategy to fundamentally eliminate the reliance on post-processing.

Method¶

Overall Architecture¶

Based on pre-trained 3DGS, Click-Gaussian augments each 3D Gaussian with a $D$-dimensional feature vector $\mathbf{f}_i \in \mathbb{R}^D$, split into coarse-level and fine-level features. Employing SAM to automatically generate 2D masks across all training views, it constructs two-level masks, trains the feature field via contrastive learning, and introduces GFL to ensure cross-view consistency. Once trained, user clicks can complete segmentation within 10ms.

Key Designs¶

Two-level Granularity Feature Fields:
- Each Gaussian's feature $\mathbf{f}_i$ is decomposed into a coarse-level feature $\mathbf{f}_i^c \in \mathbb{R}^{D^c}$ and a complementary part $\bar{\mathbf{f}}_i^c \in \mathbb{R}^{D-D^c}$.
- The coarse-level feature is directly $\mathbf{f}_i^c$, and the fine-level feature is acquired via concatenation as $\mathbf{f}_i^f = \mathbf{f}_i^c \oplus \bar{\mathbf{f}}_i^c$.
- Granularity Prior: The fine-level feature contains the coarse-level feature, embodying hierarchical dependencies in the real world—if A and B differ at the coarse level, then $a \subset A$ and $b \subset B$ are naturally also different at the fine level.
- In experiments, $D^c=12$ and $D=24$ are set, and both level features are computed in a single forward pass through the 3DGS rasterizer.
Contrastive Learning:
- Positive contrastive loss: For pixel pairs within the same mask region, maximize the cosine similarity of their rendered features. $\mathcal{L}_{\text{pos}}^{\text{cont}} = -\frac{1}{|P_1||P_2|}\sum_l \sum_{p_1}\sum_{p_2} \mathbb{1}[M_{p_1}^l = M_{p_2}^l] \mathbf{S}^l(p_1, p_2)$
- Negative contrastive loss: For pixel pairs from different mask regions, constrain the cosine similarity not to exceed a threshold $\tau^l$ ($\tau^f=0.75$, $\tau^c=0.5$).
- A stop-gradient operation is applied to the fine-level negative contrastive loss to prevent coarse-level features from being mistakenly updated during fine-level differentiation.
Global Feature-guided Learning (GFL):
- Global Feature Candidate Generation: After a specified number of iterations, 2D feature maps are rendered for all training views, and average pooling is performed on each mask region to obtain the average feature set $\mathcal{F}^l$.
- HDBSCAN Clustering: HDBSCAN clustering is performed on $\mathcal{F}^l$ to yield $C^l$ global feature candidates $\tilde{\mathcal{F}}^l$. These cluster centers represent the most typical features in the entire scene, effectively smoothing out cross-view noise.
- GFL Loss: Guides each Gaussian feature close to its most probable global cluster while keeping it away from others. $\mathcal{L}_{\text{pos}}^{\text{GFL}} = -\frac{1}{N}\sum_l\sum_i \mathbb{1}[\tilde{\mathbf{S}}^l(i,c_i^l) > \tau^g] \tilde{\mathbf{S}}^l(i,c_i^l)$
- Clusters are updated periodically, with $\tau^g=0.9$.

Loss & Training¶

Total loss function: $$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{cont}} + \lambda_1 \mathcal{L}_{\text{GFL}} + \lambda_2 \mathcal{L}_{\text{3D-norm}} + \lambda_3 \mathcal{L}_{\text{2D-norm}} + \lambda_4 \mathcal{L}_{\text{spatial}}$$

The regularization terms include: * Hyperspherical Regularization $\mathcal{L}_{\text{3D-norm}}$: Constrains the $L_2$-norm of each Gaussian feature to 1 to prevent overly large feature norms from dominating the rendering. * Rendered Feature Regularization $\mathcal{L}_{\text{2D-norm}}$: Ensures that multiple Gaussian features corresponding to a single pixel align in direction. * Spatial Consistency Regularization $\mathcal{L}_{\text{spatial}}$: Utilizes KD-tree queries for spatial nearest neighbors to enforce similarity between adjacent Gaussian features.

Hyperparameters: $\lambda_1=10.0$, $\lambda_2=0.2$, $\lambda_3=0.2$, $\lambda_4=0.5$. Training takes 3000 iterations, with GFL starting from the 2000th iteration, costing approximately 13 minutes in total (NVIDIA RTX A5000).

Key Experimental Results¶

Main Results¶

LERF-Mask Dataset (mIoU %):

Method	Coarse Avg	Fine Avg	All Avg
Gau-Group	72.8	31.5	49.9
OmniSeg3D	79.4	46.9	60.9
Feature3DGS	65.6	63.5	63.4
GARField	80.9	71.4	75.2
Click-Gaussian	89.1	84.3	85.4

SPIn-NeRF Dataset (mIoU %):

Method	mIoU
MVSeg	90.9
SA3D	92.4
SAGA	88.0
Click-Gaussian	94.0

Ablation Study¶

Configuration	All mIoU	Relative Change
w/o $\mathcal{L}_{\text{2D-norm}}$	83.2	-2.6%
w/o $\mathcal{L}_{\text{3D-norm}}$	80.3	-6.0%
w/o $\mathcal{L}_{\text{spatial}}$	82.0	-4.0%
w/o $\mathcal{L}_{\text{GFL}}$	58.6	-31.3%
w/o prior	82.1	-3.9%
Full Model	85.4	-

Key Findings¶

GFL is the most critical component, with performance dropping by 31.3% when removed, particularly in fine-level segmentation where it drops from 84.3 to 42.3.
The granularity prior boosts fine-level segmentation by 6 percentage points (78.3 $\to$ 84.3).
Hyperspherical regularization is particularly crucial for fine-level segmentation (74.1 vs 84.3).
Inference speed: ~10ms per click, which is 130 times faster than GARField and 15 times faster than SAGA.

Highlights & Insights¶

Elegant granularity prior design: Designing the fine-level feature as a superset of the coarse-level feature gracefully encodes real-world hierarchical dependencies.
GFL fundamentally resolves the cross-view consistency issue: It provides stable supervision signals through global clustering rather than relying on noisy single-view masks.
High practicality: The 10ms-level interactive response time makes true real-time 3D editing possible.
Clever use of stop gradient: Applying stop-gradient to the coarse-level components during fine-level negative contrastive learning avoids gradient conflicts between coarse and fine features.

Limitations & Future Work¶

Dependence on the quality of pre-trained 3DGS; if a single Gaussian represents multiple objects with different semantics but similar colors, feature learning is hindered.
Having only two levels of granularity lacks intermediate stages, which may require multiple interactions for complex scenes demanding multi-level segmentation.
Lack of semantic understanding; learning features purely on appearance and space prevents open-vocabulary segmentation.
Future work can consider extending the approach to dynamic scenes or 4D Gaussian segmentation.

Comparison with SAGA demonstrates that post-processing is a bottleneck, and learning discriminative features end-to-end is a superior direction.
The GFL concept can be generalized to other tasks requiring cross-view consistency (such as 3D editing and semantic understanding).
Inspiration from two-level granularity prior: Encoding explicit hierarchical dependencies in multi-scale learning is more effective than independent learning.

Rating¶

Novelty: ⭐⭐⭐⭐ (Novel designs in GFL and granularity prior, while the overall framework is standard)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Evaluated on multiple datasets, baseline models, comprehensive ablation studies, and application demonstrations)
Writing Quality: ⭐⭐⭐⭐ (Clear logic with well-formulated equations)
Value: ⭐⭐⭐⭐ (10ms interactive segmentation holds high practical value)

Configuration	All mIoU	Relative Change
w/o \(\mathcal{L}_{\text{2D-norm}}\)	83.2	-2.6%
w/o \(\mathcal{L}_{\text{3D-norm}}\)	80.3	-6.0%
w/o \(\mathcal{L}_{\text{spatial}}\)	82.0	-4.0%
w/o \(\mathcal{L}_{\text{GFL}}\)	58.6	-31.3%
w/o prior	82.1	-3.9%
Full Model	85.4	-