Skip to content

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Conference: ICML 2025 Oral
arXiv: 2508.08252
Code: https://github.com/heshuting555/ReferSplat
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Referring Segmentation, Natural Language, Spatial Reasoning, Contrastive Learning

TL;DR

ReferSplat proposes the new task of Referring 3D Gaussian Splatting Segmentation (R3DGS). By constructing 3D Gaussian Referring Fields, a Position-Aware Cross-Modal Interaction (PCMI) module, and Gaussian-Text Contrastive Learning (GTCL), it enables target object segmentation (including occluded/invisible objects) in 3DGS scenes berdasarkan natural language descriptions. It achieves SOTA performance on the newly created Ref-LERF dataset and open-vocabulary segmentation benchmarks.

Background & Motivation

With its fast training, real-time rendering, and explicit point representation, 3DGS has rapidly become an important method for 3D scene representation. While open-vocabulary 3DGS segmentation has made preliminary progress, it relies solely on fixed-pattern class name inputs. Interaction between free-form natural language and 3D scenes remains largely unexplored, yet it is crucial for embodied AI, autonomous driving, and VR/AR.

Prior open-vocabulary methods exhibit two core limitations: first, a lack of interaction between text queries and Gaussian representations during training—matching occurs at 2D rendered features rather than direct localization in 3D space; second, they ignore positional information—rendered features cannot comprehend spatial relationships. Key Challenge: The requirement of simultaneous 3D spatial reasoning and fine-grained linguistic understanding.

Core Idea: Directly model linguistic interactions in the 3D Gaussian space, conferring a referring feature to each Gaussian to establish direct links with text, and enhance spatial reasoning via position-aware cross-attention.

Method

Overall Architecture

ReferSplat consists of three core components: (1) 3D Gaussian Referring Fields; (2) Position-Aware Cross-Modal Interaction (PCMI) to enhance spatial reasoning; (3) Gaussian-Text Contrastive Learning (GTCL) to distinguish semantically similar expressions. The training utilizes pseudo-labels generated by a confidence-weighted IoU strategy.

Key Designs

  1. 3D Gaussian Referring Fields:

    • Introduces a referring feature \(f_{r,i}\) for each Gaussian, computes the similarity \(m_i = \sum_j f_{r,i} \cdot f_{w,j}\) with word features, and then rasterizes to render 2D masks.
    • Design Motivation: Unlike retrieval on 2D rendered features, direct modeling in 3D space allows the model to identify occluded objects leveraging multi-view consistency.
  2. Position-Aware Cross-Modal Interaction (PCMI):

    • Gaussian positional features: maps center coordinates into positional embeddings using an MLP.
    • Text position inference: indirectly obtains textual positions by correlating word features and Gaussian referring features.
    • Position-guided attention refines referring features, fusing positional and semantic information.
    • Design Motivation: Understanding expressions with spatial relationships requires parallel semantic recognition and spatial reasoning.
  3. Gaussian-Text Contrastive Learning (GTCL):

    • Selects the top-\(\tau\) responding Gaussian features, averages them as positive Gaussian embeddings, and uses contrastive learning to pull corresponding text close while pushing irrelevant text away.
    • Design Motivation: Distinguish between semantically similar descriptions that refer to different objects.

Loss & Training

  • Total loss: BCE loss + \(\lambda \times\) contrastive loss, where \(\lambda = 0.02\)
  • Pseudo-labels: Grounded SAM + confidence-weighted IoU strategy to select the optimal mask
  • Two-stage optimization refinement; BERT text embeddings, 45,000 iterations

Key Experimental Results

Main Results (Ref-LERF Dataset)

Method ramen figurines teatime kitchen avg mIoU
Grounded SAM 14.1 16.0 16.9 16.2 15.8
LangSplat 12.0 17.9 7.6 17.9 13.9
GOI 27.1 16.5 22.9 15.7 20.5
ReferSplat 35.2 25.7 31.3 24.4 29.2

Ablation Study

Configuration ramen kitchen Description
Baseline 28.4 18.5 Referring Fields only
+ PCMI 33.5 22.8 +5.1/+4.3, enhanced spatial reasoning
+ GTCL 32.8 21.9 +4.4/+3.4, fine-grained distinction
+ PCMI + GTCL 35.2 24.4 Full ReferSplat
+ Two-stage 36.9 25.2 Further refinement

Key Findings

  • Directly modeling 3D-text relationships via Referring Fields significantly outperforms matching in VLM feature space.
  • Naive cross-attention (without positional information) performs worse than the baseline, demonstrating that position-awareness is critical.
  • Pseudo-label quality: the confidence-weighted IoU strategy outperforms Top-1 and SAM2 methods.
  • SOTA performance is also achieved in open-vocabulary segmentation tasks, demonstrating the transferability of the referring capability.

Highlights & Insights

  • Defines the new R3DGS task and constructs the Ref-LERF dataset.
  • Shifting from 2D rendering matching to direct modeling in 3D space is a conceptual breakthrough.
  • The bidirectional design of indirectly inferring text position from Gaussians is highly elegant.
  • The confidence-weighted IoU pseudo-label strategy is simple yet effective.

Limitations & Future Work

  • The Ref-LERF dataset is small in scale (4 scenes, 295 descriptions), requiring further validation of generalization performance.
  • The performance ceiling of pseudo-labels is around 50% mIoU, which limits final performance.
  • Complex natural language expressions (e.g., negation, conditionals) have not yet been evaluated.
  • A natural evolution from 2D RES (Referring Expression Segmentation) to 3D point cloud RES, and finally to 3DGS RES.
  • The positional inference idea can be generalized to cross-modal interactions in other 3D representations.
  • Direct inspiration for language-guided navigation and manipulation in embodied AI.

Supplementary Analysis

Characteristics of the Ref-LERF Dataset

  • The average sentence length exceeds 13.6 words, which is approximately 8 times that of LERF-OVS, emphasizing spatial reasoning and detailed descriptions.
  • The word cloud exhibits a high frequency of relative position terms (e.g., placed, near, next) and fine-grained attribute words (e.g., round, surface).
  • Each object has approximately 5 descriptions, with 236 for training and 59 for testing.

Transfer Validation on the 3DOVS Task

  • ReferSplat also achieves SOTA performance on the LERF-OVS open-vocabulary segmentation benchmark.
  • LangSplat obtains an average of 51.4 mIoU vs. ReferSplat's higher average score.
  • This demonstrates that the combination of Referring Fields, PCMI, and GTCL is also effective for general 3D language understanding.

Pseudo-Label Quality Analysis

  • Ground Truth was manually annotated to evaluate the quality of the pseudo-labels.
  • The confidence-weighted IoU strategy achieves approximately 50% mIoU against the GT.
  • Two-stage optimization utilizes the first-stage rendered masks as better pseudo-labels to achieve further improvements.

Rating

  • Novelty: ⭐⭐⭐⭐ Paradigm innovation of modeling language interaction in 3D Gaussian space along with a new task definition.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive ablation studies, but small dataset scale.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and well-motivated.
  • Value: ⭐⭐⭐⭐ Promotes the advancement of 3DGS understanding toward natural language interaction.