4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models¶

Conference: CVPR 2025
arXiv: 2503.10437
Code: https://github.com/4d-langsplat/4d-langsplat
Area: Multimodal VLM
Keywords: 4D Language Field, Gaussian Splatting, Multimodal Large Language Model, Open-Vocabulary Query, Dynamic Scene Understanding

TL;DR¶

This paper proposes 4D LangSplat, which constructs a 4D language field by leveraging multimodal large language models (MLLMs) to generate object-wise video captions. Combined with a status deformable network to model the temporally continuous evolution of semantics, it achieves the first time-sensitive and time-agnostic open-vocabulary queries in dynamic scenes.

Background & Motivation¶

Establishing a language field in 3D scenes to support open-vocabulary queries has become a foundational capability for applications such as robotic navigation and scene editing. By embedding CLIP features into 3D Gaussian representations, LangSplat achieves precise and efficient 3D language fields in static scenes. However, the real world is inherently dynamic.

Limitations of Prior Work: Extending LangSplat to 4D scenes faces two major challenges:

Inability of CLIP to capture temporal dynamics: CLIP is designed for static image-text alignment and cannot comprehend state changes, actions, or temporal context. For instance, "a running dog" versus "a sitting dog" are difficult to distinguish using CLIP features.

Lack of pixel-aligned object-wise video features: Existing vision models (e.g., VideoCLIP) primarily extract global video-level features. Cropping objects often introduces background clutter or fails to distinguish between object motion and camera movement.

Key Challenge: Constructing a precise 4D language field requires pixel-aligned, object-wise, and time-varying feature supervision, which existing vision models cannot provide.

Key Insight: Instead of relying on limited visual features, this work utilizes MLLMs to convert videos into object-wise textual descriptions, which are then encoded by an LLM into sentence embeddings to serve as supervision signals. The core idea is: using textual semantics instead of visual features to supervise the time-varying 4D language field.

Method¶

Overall Architecture¶

First, reconstruct the dynamic RGB scene using 4D-GS (deformable Gaussian field).
Add two language fields to each Gaussian point:
- Time-agnostic semantic field: Learns time-invariant semantics (e.g., "person", "cup") using CLIP features.
- Time-varying semantic field: Learns dynamic semantics (e.g., "pouring coffee") using caption features generated by MLLMs.
Combine the two semantic fields as needed during querying.

Key Designs¶

Multimodal Object-Wise Video Prompting:
- Function: Guides the MLLM to generate temporally consistent, high-quality frame-by-frame captions for each object in the video.
- Mechanism: Consists of two steps—① Visual Prompting: Track objects using SAM+DEVA to obtain temporally consistent object masks, and construct visual prompts for the target object as $\mathcal{P}_{i,t} = \text{Contour}(M_{i,t}) \cup \text{Gray}(M_{i,t}) \cup \text{Blur}(M_{i,t})$ (highlighting with red contours, background graying, and background blurring) to retain scene context while focusing on the target object; ② Textual Prompting: First generate a video-level motion description $\mathcal{D}_i$ using the entire video, and then use it as context to generate frame-specific captions $C_{i,t}$. Finally, use an LLM (e5-mistral-7b) to encode the captions into sentence embeddings $\mathbf{e}_{i,t}$ as pixel-wise supervision.
- Design Motivation: Directly cropping objects to extract visual features introduces background noise or loses motion reference; whereas MLLMs can comprehend dynamic semantics in videos, and textual embeddings naturally share a common embedding space with natural language queries. The three-step visual prompting (contour + graying + blurring) ensures that the MLLM focuses on the target object without ignoring the scene context.
Status Deformable Network:
- Function: Models the temporal evolution of the semantic features of each Gaussian point, constraining them to transition smoothly among a finite set of status prototypes.
- Mechanism: Defines $K$ status prototype features $\{\mathbf{S}_{i,1}, ..., \mathbf{S}_{i,K}\}$ for each Gaussian point $i$. The semantic feature at time $t$ is a linear combination of status prototypes: $$\mathbf{f}_{i,t} = \sum_{k=1}^K w_{i,t,k} \mathbf{S}_{i,k}, \quad \sum_{k=1}^K w_{i,t,k} = 1$$ The weights $w_{i,t,k}$ are predicted by an MLP decoder $\phi$ from the HexPlane spatiotemporal features. The status prototypes and the MLP are optimized jointly.
- Design Motivation: Directly learning arbitrary semantic deformations $\Delta \mathbf{f}$ allows features to change into any semantic state, increasing learning difficulty and disrupting temporal consistency. In reality, objects typically exhibit a finite number of semantic states (e.g., standing $\rightarrow$ walking $\rightarrow$ running), which makes a weighted combination constraint more reasonable. Experiments show that $K=3$ is optimal.
Joint Querying of Dual Semantic Fields:
- Function: Supports both time-agnostic and time-sensitive open-vocabulary queries.
- Mechanism: Time-agnostic queries only utilize the CLIP semantic field to calculate correlations and obtain segmentation masks. Time-sensitive queries first use the time-agnostic field to locate the spatial region, and then use the time-varying field to compute cosine similarities within that region, filtering time intervals using the mean similarity as a threshold.
- Design Motivation: The division of labor between the two fields is distinct—CLIP is suited for capturing stable entity semantics, while MLLM captions excel at capturing dynamic status semantics.

Loss & Training¶

Time-agnostic field: Supervised using CLIP features through feature splatting across three SAM levels.
Time-varying field: Supervised via pixel-wise regression using LLM sentence embeddings.
CLIP and text features are compressed to 3 and 6 dimensions respectively using autoencoders.
Qwen2-VL-7B is used as the MLLM, and e5-mistral-7b is employed for generating sentence embeddings.

Key Experimental Results¶

Main Results: Time-Sensitive Queries (HyperNeRF Dataset)¶

Method	Avg. Acc (%)	Avg. vIoU (%)
LangSplat	54.01	22.65
Deformable CLIP	61.80	44.72
Non-Status Field	87.58	68.57
Ours	90.83	72.26

Time-Agnostic Queries¶

Method	HyperNeRF mIoU	HyperNeRF mAcc	Neu3D mIoU	Neu3D mAcc
Feature-3DGS	36.63	74.02	34.96	87.12
Gaussian Grouping	50.49	80.92	49.93	95.05
LangSplat	74.92	97.72	61.49	91.89
Ours	82.48	98.01	85.11	98.32

Ablation Study¶

Configuration	Key Metric	Description
Blur-only visual prompting	$\Delta_{\text{sim}}$=0.33	A single prompt is insufficient to focus on the object
Blur + Gray	$\Delta_{\text{sim}}$=2.15	Graying the background is helpful
Blur + Gray + Contour	$\Delta_{\text{sim}}$=3.32	Triple visual prompting yields the best performance
Video-only textual prompting	$\Delta_{\text{sim}}$=0.14	Poor performance without image-level prompts
Video + image prompting	$\Delta_{\text{sim}}$=3.32	The combination of both is optimal
K=2 status count	Acc=94.56	Insufficient status count
K=3	Acc=97.82	Optimal status count
K=6	Acc=94.56	Too many statuses degrade performance instead

Key Findings¶

CLIP cannot comprehend dynamic semantics: Deformable CLIP achieves 29.03% lower Acc on time-sensitive queries compared to this method.
Textual supervision from MLLMs is far superior to visual feature supervision, especially at state transition boundaries (e.g., when a cookie just cracks, or when coffee just starts dripping).
The status deformable network improves performance by 3.25% Acc and 4.19% vIoU compared to directly learning the deformation field.
It also outperforms LangSplat on time-agnostic queries due to more accurate modeling of dynamic scenes.

Highlights & Insights¶

Paradigm Shift of Text Replacing Vision: Bypassing visual features and directly using MLLM textual descriptions for supervision elegantly solves the difficulty of obtaining dynamic semantic features.
The visual prompting design (contour + graying + blurring) significantly guides the MLLM to focus on the target object while retaining the scene context.
The Status Deformable Network models semantic evolution using a finite set of status prototypes, offering a simple yet effective structural constraint.
The division of labor between the time-agnostic and time-varying dual semantic fields elegantly leverages the respective strengths of CLIP and MLLMs.

Limitations & Future Work¶

Depends heavily on the quality of captions generated by the MLLM; the MLLM might produce inaccurate descriptions for complex dynamic scenes.
Tracking via SAM+DEVA may fail to keep track of objects under severe occlusions.
The status count $K$ needs to be set manually; adaptive determination would be more ideal.
High computational overhead: Requires invoking the MLLM frame-by-frame and object-by-object to generate captions.
Evaluation on unlabeled dynamic scene datasets remains limited by manual annotations.

LangSplat demonstrated the effectiveness of pairing SAM masks with CLIP features for 3D static language fields; this work extends it to 4D.
The deformable Gaussian field of 4D-GS provides a foundation for dynamic scene reconstruction.
The video understanding capability of MLLMs (such as Qwen2-VL) serves as a key enabling technology for this method.
Key Insight: When visual features do not meet requirements, the problem can be projected into the text domain to leverage the powerful semantic comprehension of LLMs.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to build a 4D language field utilizing MLLM textual supervision, introducing a paradigm innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on two datasets with multiple ablations, though annotated data for dynamic scenes remains limited.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and comprehensive method description.
Value: ⭐⭐⭐⭐ Opens up a new direction for 4D dynamic language fields, holding practical significance for robotics and scene understanding.

Configuration	Key Metric	Description
Blur-only visual prompting	\(\Delta_{\text{sim}}\)=0.33	A single prompt is insufficient to focus on the object
Blur + Gray	\(\Delta_{\text{sim}}\)=2.15	Graying the background is helpful
Blur + Gray + Contour	\(\Delta_{\text{sim}}\)=3.32	Triple visual prompting yields the best performance
Video-only textual prompting	\(\Delta_{\text{sim}}\)=0.14	Poor performance without image-level prompts
Video + image prompting	\(\Delta_{\text{sim}}\)=3.32	The combination of both is optimal
K=2 status count	Acc=94.56	Insufficient status count
K=3	Acc=97.82	Optimal status count
K=6	Acc=94.56	Too many statuses degrade performance instead