SGS-SLAM: Semantic Gaussian Splatting for Neural Dense SLAM¶
Conference: ECCV 2024
arXiv: 2402.03246
Code: GitHub
Area: 3D Vision
Keywords: SLAM, 3D Gaussian Splatting, Semantic Segmentation, Dense Reconstruction, Real-time Rendering
TL;DR¶
Ours proposes SGS-SLAM, the first semantic visual SLAM system based on Gaussian Splatting. By optimizing multiple channels to integrate appearance, geometry, and semantic features, it achieves state-of-the-art (SOTA) performance in camera pose estimation, map reconstruction, and semantic segmentation.
Background & Motivation¶
Key Challenge¶
Key Challenge: Background: Existing NeRF-based SLAM methods (such as NICE-SLAM and Co-SLAM) utilize MLPs as implicit representations, which suffer from three core issues: (1) over-smoothed object edges and a lack of fine details; (2) difficulty in decoupling object representations, hindering scene editing; (3) catastrophic forgetting, where new scenes damage the learned model. Concurrently, existing Gaussian SLAM methods (such as SplaTAM) lack semantic understanding capabilities.
Method¶
Overall Architecture¶
SGS-SLAM represents scenes using isotropic 3D Gaussians, where each Gaussian carries position, radius, opacity, RGB color, and semantic color across three channels. The system consists of two core processes: tracking and mapping.
Key Designs¶
Multi-Channel Gaussian Representation: Based on standard Gaussian parameters, a semantic color channel \(s_i = [r_i, b_i, g_i]^T\) is added. The 2D semantic map is rendered using the same volume rendering formula as color and depth.
Semantic-Guided Keyframe Selection: A two-level screening strategy: (1) Geometric overlap ratio filtering: projected sampled Gaussians onto the keyframe view to calculate the overlap ratio \(\eta\), filtering out those below the threshold; (2) Semantic filtering: filtering out keyframes with excessively high semantic map mIoU, prioritizing keyframes from different perspectives. An uncertainty weight based on timestamps \(\mathcal{U}(t) = e^{-\tau t}\) is introduced.
Multi-Channel Joint Optimization: The tracking loss simultaneously contains three channels: depth L1, color L1, and semantic L1. The mapping loss utilizes a weighted SSIM loss to process color and semantic images.
Loss & Training¶
- Tracking Loss: \(\mathcal{L}_{tracking} = \lambda_D|D^{GT} - D| + \lambda_C|C^{GT} - C| + \lambda_S|S^{GT} - S|\)
- Mapping Loss: \(\mathcal{L}_{mapping} = \mathcal{U}_t(\lambda_D|D^{GT} - D| + \lambda_C\mathcal{L}_C + \lambda_S\mathcal{L}_S)\)
- where both \(\mathcal{L}_C\) and \(\mathcal{L}_S\) adopt a hybrid loss of L1+SSIM.
Key Experimental Results¶
Main Results¶
Comparison of rendering quality on the Replica dataset (average of 8 scenes):
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | Depth L1 (cm)↓ | ATE RMSE (cm)↓ |
|---|---|---|---|---|---|
| NICE-SLAM | 24.42 | 0.809 | 0.233 | 1.903 | 2.503 |
| Co-SLAM | 30.24 | 0.939 | 0.252 | 1.513 | 1.059 |
| ESLAM | 29.08 | 0.929 | 0.336 | 1.180 | 0.630 |
| SplaTAM | 33.98 | 0.969 | 0.099 | 0.525 | 0.454 |
| SGS-SLAM | 34.66 | 0.973 | 0.096 | 0.356 | 0.412 |
Semantic Segmentation Results¶
Semantic segmentation accuracy on the Replica dataset (mIoU%):
| Method | Average mIoU↑ | Room0 | Room1 | Room2 | Office0 |
|---|---|---|---|---|---|
| NIDS-SLAM | 82.37 | 82.45 | 84.08 | 76.99 | 85.94 |
| DNS-SLAM | 84.77 | 88.32 | 84.90 | 81.20 | 84.66 |
| SNI-SLAM | 87.41 | 88.42 | 87.43 | 86.16 | 87.63 |
| SGS-SLAM | >90 | - | - | - | - |
SGS-SLAM outperforms all NeRF-based semantic SLAM methods by more than 10% in semantic segmentation.
Key Findings¶
- The depth L1 error is reduced by approximately 32% compared to SplaTAM (0.356 vs 0.525), showing that semantic information facilitates geometric reconstruction.
- The multi-channel optimization strategy allows tracking and mapping to mutually benefit each other.
- The explicit Gaussian representation naturally supports object-level scene editing operations.
Highlights & Insights¶
- Semantic feature loss effectively addresses the deficiencies of traditional depth and color losses in object optimization.
- Appending semantic channels directly to Gaussians is highly elegant, avoiding the complex multi-stage model designs in NeRF methods.
- The semantic-guided keyframe selection strategy effectively prevents reconstruction errors caused by accumulated errors.
Limitations & Future Work¶
- Reliance on 2D semantic priors (provided by datasets or off-the-shelf models).
- The assumption of isotropic Gaussians may limit the representative capability of complex scenes.
- Scalability has not been validated in large-scale scenes.
Related Work & Insights¶
Integrating semantic information into Gaussian SLAM in this manner is simple yet effective, laying the foundation for future work (such as dynamic-scene semantic SLAM). The two-level keyframe selection strategy is highly instructive.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Practicality: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐