Online Language Splatting¶
Conference: ICCV 2025 arXiv: 2503.09447 Code: https://saimouli.github.io/onlineLang Area: 3D Vision Keywords: 3D Gaussian Splatting, SLAM, Open-Vocabulary, Language Feature Embedding, CLIP, Real-Time Semantic Mapping
TL;DR¶
The first framework to achieve online, near-real-time, open-vocabulary language mapping within a 3DGS-SLAM system. Through three innovations—high-resolution CLIP embedding, two-stage online autoencoder compression, and decoupled color-language optimization—the method surpasses offline state-of-the-art in accuracy while achieving 40×–200× efficiency gains.
Background & Motivation¶
Embedding language features into 3D scene representations is a critical capability for human-robot interaction. Existing Lang-GS methods (e.g., LangSplat, Feature3DGS, LEGaussian) rely on SAM+CLIP for per-frame offline preprocessing, requiring several minutes per frame to generate pixel-level language features, severely limiting practical applicability.
Many real-world tasks—such as a service robot entering a new environment or an AR system enabling instant interaction—demand immediate scene understanding. While SLAM-GS methods (MonoGS, SplaTAM, etc.) can build geometric and appearance maps in near-real-time, they do not incorporate language features. Methods that use pre-annotated semantic maps are constrained to closed vocabularies, lacking open-vocabulary flexibility.
The root cause lies in: how to efficiently integrate high-dimensional language features into 3D Gaussian representations while preserving open-vocabulary capability. This challenge involves three sub-problems:
Real-time high-resolution CLIP embedding: Offline SAM+CLIP is the runtime bottleneck.
Open-vocabulary compression in online settings: Online methods cannot pretrain a compressor on the test scene, introducing a domain gap.
Conflict in joint color-language optimization: The two modalities favor different Gaussian parameters, and joint optimization degrades performance in both.
Method¶
Overall Architecture¶
The system is built on the MonoGS SLAM framework, using 3D Gaussians as the sole mapping primitive. The training pipeline comprises three core modules:
- High-Resolution CLIP Embedding Module: Generates high-resolution language feature maps from RGB images in real time.
- Two-Stage CLIP Compression Module: Compresses 768-dim CLIP features to 15 dimensions.
- Decoupled Color-Language Optimization: Separates gradient pathways for RGB and language.
At inference, the rendered low-dimensional language maps are decoded back to full CLIP features via the two-stage decoder, enabling open-vocabulary query-based object localization.
Key Design 1: High-Resolution CLIP Embedding¶
Instead of offline multi-pass SAM+CLIP, a ConvNeXt-L pixel-level CLIP encoder generates coarse embedding maps (24×24×768), which are then upsampled to 192×192×768 via a lightweight Super-Resolution Decoder (SRD):
- The SRD leverages intermediate features from encoder stages 1 and 2 and progressively enhances resolution through two convolutional upsampling blocks.
- The SRD is trained in a supervised manner on COCO/Omnidata datasets, with labels generated by offline SAM+CLIP.
- Training loss: \(\mathcal{L} = 0.8 \cdot \mathcal{L}_{\text{cosine}} + \mathcal{L}_{\text{L1}} + 0.01 \cdot \mathcal{L}_{\text{TV}}\)
- The entire module (CLIP encoder + SRD) requires only 18ms/frame on an RTX-3090 and consumes 1.6GB of VRAM, with the SRD alone taking 2ms.
- High-resolution feature maps improve localization accuracy for small and distant objects, reducing feature bleeding.
Key Design 2: Two-Stage Online CLIP Compression¶
Directly using 768-dim CLIP vectors is prohibitively expensive; effective compression is essential:
Stage 1 — General-Purpose Compressor: - An 8-layer MLP autoencoder pretrained on a diverse dataset (COCO). - 768-dim → 32-dim, exploiting intrinsic redundancy in language embeddings. - Dimensionality must not be too low, as excessive compression degrades open-vocabulary generalization across domains.
Stage 2 — Online-Learning Autoencoder (OLAE): - A 2-layer MLP, 32-dim → 15-dim. - Motivated by the observation that data variance within a single scene can be captured with fewer dimensions. - Initialization: 200 iterations (6ms/iter); subsequently updated once per frame. - Each iteration additionally samples 2 random keyframes to prevent catastrophic forgetting. - Key advantage: outperforms even in-domain fine-tuned single autoencoders by adapting to the dominant data distribution of the current scene.
Key Design 3: Decoupled Color-Language Optimization¶
Experiments show that jointly optimizing RGB and language channels degrades both, as they share Gaussian parameters (\(\alpha, \mu, \Sigma\)). The root cause is that language features favor larger scales and different rotations (uniform semantic regions), while color requires fine-grained texture.
The proposed solution maintains independent rotation \(R\), scale \(S\), and opacity \(\alpha\) for color and language respectively, while sharing position \(\mu\) (to avoid duplicating Gaussians):
Gradient pathways are fully separated: color gradients update only \(\alpha^c, R^c, S^c\), while language gradients update only \(\alpha^f, R^f, S^f\). Position \(\mu\) is updated only through the color branch, and camera pose estimation also relies solely on the color branch. An auxiliary constraint \(|S_i^f - S_{i\bot}^c|\) prevents the language branch from learning excessively skewed scales.
Loss & Training¶
- SLAM Tracking/Mapping: \(\mathcal{L} = \lambda|C^r - C^{gt}| + (1-\lambda)|D^r - D^{gt}|\) + isotropic scale regularization.
- Language Features: L1 loss to align compressed language maps.
- SRD Training: Cosine + L1 + TV loss.
- Decoupling Constraint: Language scale regularization \(|S^f - \text{sg}(S^c)|\).
Key Experimental Results¶
Main Results: Comparison with Lang-GS SOTA (Replica Dataset)¶
| Method | Online | SRD | OLAE | mIoU ↑ | Loc ↑ | Time/Frame |
|---|---|---|---|---|---|---|
| LangSplat | ✗ | - | - | 0.417 | 0.720 | 2.8 min |
| Feature3DGS | ✗ | - | - | 0.359 | 0.755 | 2.3 min |
| LEGaussian | ✗ | - | - | 0.245 | 0.682 | 32 s |
| Ours (Omni, full) | ✓ | ✓ | ✓ | 0.487 | 0.826 | 0.8 s |
TUM RGB-D Dataset¶
| Method | Scene1 mIoU | Scene1 Loc | Scene2 mIoU | Scene2 Loc | Time/Frame |
|---|---|---|---|---|---|
| LangSplat | 0.646 | 0.850 | 0.538 | 0.783 | 2.1 min |
| Ours | 0.599 | 0.917 | 0.535 | 0.791 | 0.6 s |
SLAM-GS Evaluation (Replica)¶
| Method | Language | PSNR ↑ | SSIM ↑ | LPIPS ↓ | ATE(cm) ↓ |
|---|---|---|---|---|---|
| SplaTAM | ✗ | 33.39 | 0.968 | 0.101 | 0.392 |
| MonoGS | ✗ | 35.72 | 0.950 | 0.075 | 0.420 |
| Ours | ✓ | 35.81 | 0.950 | 0.072 | 0.397 |
Incorporating language mapping does not degrade rendering quality or tracking accuracy relative to the MonoGS baseline—in fact, marginal improvements are observed.
Ablation Study¶
| Configuration | mIoU ↑ | Loc ↑ | PSNR ↑ | ATE(cm) ↓ |
|---|---|---|---|---|
| Joint Optimization | 0.323 | 0.633 | 31.23 | 0.796 |
| Decoupled Optimization | 0.402 | 0.622 | 35.89 | 0.325 |
Decoupled optimization substantially improves rendering quality (+4.66 dB PSNR) and tracking accuracy (59% reduction in ATE), with notable gains in mIoU as well.
Key Findings¶
- Role of SRD: High-resolution feature maps significantly improve mIoU and Loc, especially for small and distant objects.
- Generalization of OLAE: Outperforms in-domain fine-tuned single autoencoders, yielding more stable performance on unseen scenes.
- Efficiency: The entire network pipeline requires only 21ms/frame (CLIP encoding 15ms + SRD 2ms + compression 6ms); the bottleneck is the MonoGS baseline. When integrated with Hi-SLAM, the system achieves 7.05 FPS.
- 3D Localization: Average CD/EMD metrics outperform LangSplat (0.38/0.97 vs. 0.43/5.63).
Highlights & Insights¶
- Precise problem formulation: The paper is the first to explicitly define the online open-vocabulary 3D language mapping problem, identifying three core sub-challenges and addressing each systematically.
- Elegant two-stage compression design: The general-purpose compressor maintains cross-domain generalization, while the online compressor adapts to the current scene—analogous to a "global features + local adaptation" paradigm that balances efficiency and expressiveness across the 768→32→15 compression chain.
- Deep insight into decoupled optimization: Visualization reveals that language and color modalities favor different Gaussian parameters, motivating a design that shares position while separating all other parameters—a simple yet effective engineering contribution.
- Online surpassing offline: Counter-intuitively, the online method achieves comprehensively superior accuracy over offline methods, benefiting from high-resolution CLIP embeddings and scene-adaptive compression.
- Plug-and-play compatibility: The framework can be integrated with different SLAM-GS backends (MonoGS, Hi-SLAM), demonstrating strong generality.
Limitations & Future Work¶
- Limited advantage on TUM RGB-D: Motion blur and low image quality hinder online tracking; offline methods retain an edge via 30k global optimization iterations.
- Overall speed bottlenecked by SLAM baseline: The network modules require only 21ms/frame, but the overall 0.6–0.8s/frame bottleneck lies in MonoGS.
- Limited SRD resolution: Upsampling from 24×24 to 192×192 (8× factor) may be insufficient for higher-resolution inputs.
- Unstable 3D localization on certain categories: Notably for rug and lamp, where EMD gaps are substantial.
- OLAE initialization overhead: The 200-iteration initial training phase may be insufficient for fast-moving scenes.
Related Work & Insights¶
- SLAM-GS: MonoGS, SplaTAM, and RTG-SLAM provide efficient online 3D mapping infrastructure.
- Offline Lang-GS: LangSplat pioneered embedding CLIP features into 3DGS but relies on offline SAM+CLIP preprocessing (~168s/frame).
- Efficient CLIP Encoding: The ConvNeXt pixel-level encoder combined with FeatUp-style upsampling is generalizable to other vision-language tasks.
- Inspiration: The two-stage compression scheme (general + scene-adaptive) has potential for other tasks requiring online embedding of high-dimensional features; the decoupled multi-modal optimization paradigm is equally applicable to integrating other modalities (e.g., audio, touch) into 3DGS.
Rating¶
| Dimension | Score (1–10) | Comment |
|---|---|---|
| Novelty | 8 | First online open-vocabulary 3D language mapping; pioneering problem formulation. |
| Technical Depth | 7 | Each module is well-motivated; decoupled optimization and two-stage compression demonstrate genuine insight. |
| Experimental Thoroughness | 8 | Replica + TUM dual-dataset evaluation, 2D/3D localization + SLAM metrics + ablation. |
| Writing Quality | 8 | Clear structure, rich figures, well-articulated motivation. |
| Value | 8 | Directly targets robotics and AR scenarios; the online + open-vocabulary combination has strong practical utility. |
| Overall | 8 | A highly complete system-level contribution that provides a comprehensive solution to an important problem. |
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD