LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS¶
Conference: NeurIPS 2025 arXiv: 2507.07136 Code: Project Page Area: 3D Vision Keywords: 3D language field, Gaussian splatting, sparse coding, real-time inference, open-vocabulary query
TL;DR¶
By treating each 3D Gaussian as a sparse code over a global dictionary, LangSplatV2 replaces the heavyweight decoder with a sparse coefficient field, achieving 476.2 FPS high-dimensional feature splatting and 384.6 FPS 3D open-vocabulary querying — a 47× speedup over LangSplat.
Background & Motivation¶
Background: 3D language fields sit at the intersection of vision-language models and 3D environment modeling. LangSplat made significant progress by embedding CLIP features into 3D Gaussian Splatting, achieving a 199× speedup over LERF.
Limitations of Prior Work: LangSplat still falls short of real-time inference (8.2 FPS). Its heavyweight MLP decoder accounts for 97.1% of total inference time, severely limiting applications in AR, robotics, and related domains.
Key Challenge: CLIP features are 512-dimensional high-dimensional vectors; direct splatting is prohibitively expensive (1536-dim is 15× slower than 3-dim). However, compressing features to low dimensions via an encoder-decoder introduces a heavy MLP bottleneck.
Goal: Eliminate the decoder bottleneck and achieve real-time splatting of high-dimensional features.
Key Insight: The observation that millions of Gaussians in a scene encode only a limited number of distinct semantic concepts, making sparse coding a natural and efficient representation.
Core Idea: The language feature of each Gaussian is a sparse linear combination of K basis vectors from a global codebook; sparse coefficients are rendered in place of high-dimensional features.
Method¶
Overall Architecture¶
A \(L\)-dimensional sparse coefficient vector (with only \(K\) non-zero entries) is learned per 3D Gaussian, alongside a shared global codebook of \(L\) basis vectors each of dimension \(D\). At inference: (1) splat the \(K\)-dimensional sparse coefficients → (2) recover the \(D\)-dimensional CLIP feature via matrix multiplication → (3) compute relevancy scores. The MLP decoder is entirely removed.
Key Designs¶
-
3D Sparse Coefficient Field:
- Function: Replace per-Gaussian high-dimensional language features with sparse coefficients and a global codebook.
- Design Motivation: Millions of Gaussians correspond to only a limited set of semantic concepts, making them naturally amenable to sparse representation.
- Mechanism: Each Gaussian's feature is \(\mathbf{f}_i = \mathbf{w}_i \mathcal{S} = \sum_{l=1}^{L} w_{i,l} \mathbf{s}_l\), where \(\mathbf{w}_i \in \mathbb{R}^{L}\) has only \(K\) non-zero entries.
- Key Derivation: \(\mathbf{S} = \sum_{i \in \mathcal{N}} \mathbf{w}_i \mathcal{S} e_i = (\sum_{i \in \mathcal{N}} e_i \mathbf{w}_i) \mathcal{S}\)
- Novelty: Rendering \(D\)-dimensional features is equivalent to first rendering \(L\)-dimensional coefficients and then multiplying by the codebook, fully decoupling rendering dimensionality from feature dimensionality.
-
Efficient Sparse Coefficient Splatting:
- Function: Exploit sparsity to accelerate CUDA alpha-blending.
- Design Motivation: Standard splatting complexity is \(O(|\mathcal{N}| \cdot L)\), which becomes a bottleneck for large \(L\).
- Mechanism: Each Gaussian stores only top-\(K\) indices and coefficient values; alpha-blending operates exclusively on the \(K\) non-zero elements.
- Complexity is reduced from \(O(|\mathcal{N}| \cdot L)\) to \(O(|\mathcal{N}| \cdot K)\).
- In practice, \(K=4\); rendering three semantic scales in parallel yields an effective dimensionality of only 12.
- Novelty: Rendering speed is completely decoupled from feature dimensionality.
-
Global Codebook Learning:
- Function: Learn \(L\) basis vectors of dimension \(D\) for the entire scene.
- Design Motivation: Capture a compact representation of all distinct semantic concepts present in the scene.
- Mechanism: The \(L\)-dimensional sparse coefficients are softmax-normalized, top-\(K\) entries are retained and re-normalized, and the codebook is learned end-to-end jointly with the coefficients.
- Parameters: \(L=64\), \(K=4\), \(D=512\) (CLIP feature dimension).
- Novelty: No loss from dimensionality compression — modeling is performed directly in CLIP space.
Loss & Training¶
- 3D Gaussians are first trained with RGB supervision for 30,000 iterations.
- Gaussian parameters are then frozen and the sparse coefficient field is trained for 10,000 iterations.
- OpenCLIP ViT-B/16 is used for CLIP feature extraction; SAM ViT-H is used for semantic segmentation.
- Three SAM-level semantic granularities are modeled simultaneously.
Key Experimental Results¶
Main Results¶
LERF Dataset — 3D Open-Vocabulary Localization and Segmentation:
| Method | Localization Acc (%) | Segmentation IoU (%) | Speed (FPS) |
|---|---|---|---|
| LERF | — | — | ~0.04 |
| LangSplat | 84.3 | 51.4 | 8.2 |
| GAGS | 81.7 | 54.1 | — |
| LangSplatV2 | 84.1 | 59.9 | 384.6 |
Inference Time Breakdown (ms, A100 GPU):
| Method | Rendering | Decoding | Post-processing | Total | FPS |
|---|---|---|---|---|---|
| LangSplat | 6.0 | 83.1 | 33.0 | 122.1 | 8.2 |
| LangSplat* | 2.0 | 83.1 | 0.5 | 85.6 | 11.7 |
| LangSplatV2 | 2.0 | 0.1 | 0.5 | 2.6 | 384.6 |
Feature Rendering Time on Different GPUs (ms): - RTX 3090 and RTX 4090 run out of memory at feature dimensionality ≥ 1024. - LangSplatV2 rendering time does not grow with feature dimensionality (consistently ≈ 2 ms).
Ablation Study¶
Effect of Codebook Size \(L\) and Sparsity \(K\) (LERF dataset, Overall IoU %):
| L / K | K=2 | K=4 | K=8 |
|---|---|---|---|
| L=32 | 56.8 | 58.1 | 58.5 |
| L=64 | 57.5 | 59.9 | 59.7 |
| L=128 | 57.2 | 59.5 | 59.8 |
(\(L=64\), \(K=4\) achieves the best efficiency–accuracy trade-off.)
Key Findings¶
- Removing the decoder yields a 47× speedup (8.2 → 384.6 FPS) while simultaneously improving segmentation accuracy by 8.5% IoU.
- The accuracy gain stems from eliminating information loss caused by dimensionality compression, as modeling is performed directly in CLIP space.
- \(K=4\) is sufficient for high-quality semantic representation; further increasing \(K\) yields marginal returns.
- LangSplatV2 outperforms LangSplat on the 3D-OVS and Mip-NeRF360 datasets as well.
- Complete decoupling of feature rendering speed from dimensionality is the core technical contribution.
Highlights & Insights¶
- Precise diagnosis of the decoding bottleneck: Quantitative analysis pinpoints that 97.1% of inference time is spent in the decoding stage, motivating a targeted solution.
- Elegant application of sparse coding: The observation that millions of Gaussians map to a finite set of semantic concepts is both intuitive and powerful.
- Concise and compelling mathematical derivation: The linearity of rendering enables a clean decoupling of feature dimensionality from rendering dimensionality.
- Engineering and theory in concert: The CUDA sparse optimization delivers the theoretically predicted speedup in practice.
Limitations & Future Work¶
- Codebook size \(L\) and sparsity \(K\) require manual configuration.
- The sparsity assumption may not hold in large-scale scenes with extremely rich semantics.
- Training proceeds in two stages (RGB → language); end-to-end joint training may yield further improvements.
- Adaptive \(K\) values could be explored, as different regions vary in semantic complexity.
Related Work & Insights¶
- LangSplat's autoencoder compression paradigm stands in sharp contrast to the sparse coding approach proposed here.
- 3D Gaussian compression methods (CompGS, LightGaussian) focus on RGB compression; this work addresses high-dimensional semantic features.
- Sparse coding has a well-established theoretical foundation in classical signal processing; its application to 3D scene understanding represents a meaningful innovation.
- The proposed approach has general utility for any task requiring high-dimensional feature splatting (CLIP / DINOv2 / SAM features).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The sparse coefficient field concept is clear, elegant, and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset evaluation, detailed timing analysis, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐⭐ — The logical flow from problem analysis to method derivation to experimental validation is exceptionally coherent.
- Value: ⭐⭐⭐⭐⭐ — A 47× speedup with improved accuracy directly advances the practical deployment of 3D language fields.