ReferSplat: Referring Segmentation in 3D Gaussian Splatting¶
Conference: ICML 2025 Oral
arXiv: 2508.08252
Code: https://github.com/heshuting555/ReferSplat
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Referring Segmentation, Natural Language, Spatial Reasoning, Contrastive Learning
TL;DR¶
ReferSplat proposes the new task of Referring 3D Gaussian Splatting Segmentation (R3DGS). By constructing 3D Gaussian Referring Fields, a Position-Aware Cross-Modal Interaction (PCMI) module, and Gaussian-Text Contrastive Learning (GTCL), it enables target object segmentation (including occluded/invisible objects) in 3DGS scenes berdasarkan natural language descriptions. It achieves SOTA performance on the newly created Ref-LERF dataset and open-vocabulary segmentation benchmarks.
Background & Motivation¶
With its fast training, real-time rendering, and explicit point representation, 3DGS has rapidly become an important method for 3D scene representation. While open-vocabulary 3DGS segmentation has made preliminary progress, it relies solely on fixed-pattern class name inputs. Interaction between free-form natural language and 3D scenes remains largely unexplored, yet it is crucial for embodied AI, autonomous driving, and VR/AR.
Prior open-vocabulary methods exhibit two core limitations: first, a lack of interaction between text queries and Gaussian representations during training—matching occurs at 2D rendered features rather than direct localization in 3D space; second, they ignore positional information—rendered features cannot comprehend spatial relationships. Key Challenge: The requirement of simultaneous 3D spatial reasoning and fine-grained linguistic understanding.
Core Idea: Directly model linguistic interactions in the 3D Gaussian space, conferring a referring feature to each Gaussian to establish direct links with text, and enhance spatial reasoning via position-aware cross-attention.
Method¶
Overall Architecture¶
ReferSplat consists of three core components: (1) 3D Gaussian Referring Fields; (2) Position-Aware Cross-Modal Interaction (PCMI) to enhance spatial reasoning; (3) Gaussian-Text Contrastive Learning (GTCL) to distinguish semantically similar expressions. The training utilizes pseudo-labels generated by a confidence-weighted IoU strategy.
Key Designs¶
-
3D Gaussian Referring Fields:
- Introduces a referring feature \(f_{r,i}\) for each Gaussian, computes the similarity \(m_i = \sum_j f_{r,i} \cdot f_{w,j}\) with word features, and then rasterizes to render 2D masks.
- Design Motivation: Unlike retrieval on 2D rendered features, direct modeling in 3D space allows the model to identify occluded objects leveraging multi-view consistency.
-
Position-Aware Cross-Modal Interaction (PCMI):
- Gaussian positional features: maps center coordinates into positional embeddings using an MLP.
- Text position inference: indirectly obtains textual positions by correlating word features and Gaussian referring features.
- Position-guided attention refines referring features, fusing positional and semantic information.
- Design Motivation: Understanding expressions with spatial relationships requires parallel semantic recognition and spatial reasoning.
-
Gaussian-Text Contrastive Learning (GTCL):
- Selects the top-\(\tau\) responding Gaussian features, averages them as positive Gaussian embeddings, and uses contrastive learning to pull corresponding text close while pushing irrelevant text away.
- Design Motivation: Distinguish between semantically similar descriptions that refer to different objects.
Loss & Training¶
- Total loss: BCE loss + \(\lambda \times\) contrastive loss, where \(\lambda = 0.02\)
- Pseudo-labels: Grounded SAM + confidence-weighted IoU strategy to select the optimal mask
- Two-stage optimization refinement; BERT text embeddings, 45,000 iterations
Key Experimental Results¶
Main Results (Ref-LERF Dataset)¶
| Method | ramen | figurines | teatime | kitchen | avg mIoU |
|---|---|---|---|---|---|
| Grounded SAM | 14.1 | 16.0 | 16.9 | 16.2 | 15.8 |
| LangSplat | 12.0 | 17.9 | 7.6 | 17.9 | 13.9 |
| GOI | 27.1 | 16.5 | 22.9 | 15.7 | 20.5 |
| ReferSplat | 35.2 | 25.7 | 31.3 | 24.4 | 29.2 |
Ablation Study¶
| Configuration | ramen | kitchen | Description |
|---|---|---|---|
| Baseline | 28.4 | 18.5 | Referring Fields only |
| + PCMI | 33.5 | 22.8 | +5.1/+4.3, enhanced spatial reasoning |
| + GTCL | 32.8 | 21.9 | +4.4/+3.4, fine-grained distinction |
| + PCMI + GTCL | 35.2 | 24.4 | Full ReferSplat |
| + Two-stage | 36.9 | 25.2 | Further refinement |
Key Findings¶
- Directly modeling 3D-text relationships via Referring Fields significantly outperforms matching in VLM feature space.
- Naive cross-attention (without positional information) performs worse than the baseline, demonstrating that position-awareness is critical.
- Pseudo-label quality: the confidence-weighted IoU strategy outperforms Top-1 and SAM2 methods.
- SOTA performance is also achieved in open-vocabulary segmentation tasks, demonstrating the transferability of the referring capability.
Highlights & Insights¶
- Defines the new R3DGS task and constructs the Ref-LERF dataset.
- Shifting from 2D rendering matching to direct modeling in 3D space is a conceptual breakthrough.
- The bidirectional design of indirectly inferring text position from Gaussians is highly elegant.
- The confidence-weighted IoU pseudo-label strategy is simple yet effective.
Limitations & Future Work¶
- The Ref-LERF dataset is small in scale (4 scenes, 295 descriptions), requiring further validation of generalization performance.
- The performance ceiling of pseudo-labels is around 50% mIoU, which limits final performance.
- Complex natural language expressions (e.g., negation, conditionals) have not yet been evaluated.
Related Work & Insights¶
- A natural evolution from 2D RES (Referring Expression Segmentation) to 3D point cloud RES, and finally to 3DGS RES.
- The positional inference idea can be generalized to cross-modal interactions in other 3D representations.
- Direct inspiration for language-guided navigation and manipulation in embodied AI.
Supplementary Analysis¶
Characteristics of the Ref-LERF Dataset¶
- The average sentence length exceeds 13.6 words, which is approximately 8 times that of LERF-OVS, emphasizing spatial reasoning and detailed descriptions.
- The word cloud exhibits a high frequency of relative position terms (e.g., placed, near, next) and fine-grained attribute words (e.g., round, surface).
- Each object has approximately 5 descriptions, with 236 for training and 59 for testing.
Transfer Validation on the 3DOVS Task¶
- ReferSplat also achieves SOTA performance on the LERF-OVS open-vocabulary segmentation benchmark.
- LangSplat obtains an average of 51.4 mIoU vs. ReferSplat's higher average score.
- This demonstrates that the combination of Referring Fields, PCMI, and GTCL is also effective for general 3D language understanding.
Pseudo-Label Quality Analysis¶
- Ground Truth was manually annotated to evaluate the quality of the pseudo-labels.
- The confidence-weighted IoU strategy achieves approximately 50% mIoU against the GT.
- Two-stage optimization utilizes the first-stage rendered masks as better pseudo-labels to achieve further improvements.
Rating¶
- Novelty: ⭐⭐⭐⭐ Paradigm innovation of modeling language interaction in 3D Gaussian space along with a new task definition.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies, but small dataset scale.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and well-motivated.
- Value: ⭐⭐⭐⭐ Promotes the advancement of 3DGS understanding toward natural language interaction.