Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps¶

Conference: ICCV 2025 arXiv: 2507.03737 Code: Project Page Area: 3D Vision Keywords: 3DGS SLAM, Monocular Vision, Scale Consistency, Pointmap, Outdoor Scenes

TL;DR¶

This paper proposes S3PO-GS, a monocular RGB-only outdoor SLAM system that anchors pose estimation to 3DGS-rendered pointmaps for scale self-consistency, and employs a patch-based dynamic mapping mechanism, achieving high-accuracy localization without cumulative scale drift and high-fidelity novel view synthesis.

Background & Motivation¶

3D Gaussian Splatting (3DGS) has become a popular choice for SLAM due to its high-fidelity real-time rendering, but faces two core challenges in RGB-only outdoor scenes:

Lack of geometric priors: Tracking methods based on differentiable rendering pipelines (e.g., MonoGS) tend to fall into local optima in complex outdoor environments and suffer from convergence difficulties.

Scale drift: Introducing independent tracking modules (e.g., Photo-SLAM, OpenGS-SLAM) supplements geometric constraints but requires maintaining scale alignment between the external module and the 3DGS map. Under large rotations and translations, cumulative errors lead to severe scale drift.

Core Insight: Pre-trained pointmap models (e.g., MASt3R) can provide geometric priors, but their scale is inconsistent with the 3DGS scene. The authors propose not involving the pre-trained model in pose estimation itself, but using it solely to establish pixel-level 2D–3D correspondences — with 3D coordinates derived entirely from 3DGS-rendered pointmaps — fundamentally eliminating the scale alignment problem.

Method¶

Overall Architecture¶

S3PO-GS comprises three core modules: 3DGS scene representation, self-consistent 3DGS Pointmap Tracking, and Patch-based Dynamic Mapping. The system initializes the 3DGS map using MASt3R (optimized for 1000 steps), then performs tracking and mapping for each new frame.

Key Designs¶

Pointmap-Anchored Pose Estimation (PAPE):
- A 3DGS depth map \(D_{ak}\) is rendered from the viewpoint of an adjacent keyframe; the rendered pointmap \(X_{ak}^r\) is obtained by back-projecting through camera intrinsics.
- A pre-trained model (MASt3R/DUSt3R) generates pre-trained pointmaps \(X_{ak}^p, X_n^p\) for the current frame and keyframe; inter-frame pixel correspondences are established via nearest-neighbor matching.
- The correspondence chain \(X_{ak}^r \leftrightarrow I_{ak} \leftrightarrow I_n\) is propagated to yield 2D–3D correspondences.
- RANSAC + PnP is applied to solve the relative pose \(\mathbf{T}_n^{rel}\).
- Key advantage: 3D coordinates originate from 3DGS rendering and are inherently scale-consistent with the scene; the pre-trained model serves only as a "bridge" without participating in pose computation.
Pose Refinement:
- Starting from the PnP-initialized pose, the photometric loss \(L_{pho} = \|I(\mathcal{G}, T) - \bar{I}\|_1\) is minimized through the 3DGS differentiable rendering pipeline.
- The pose \(T \in SE(3)\) is linearized in the Lie algebra \(\mathfrak{se}(3)\), with gradients computed explicitly within the CUDA pipeline.
- Only 5 iterations are required to achieve accuracy comparable to 100 iterations in conventional methods.
Patch-Level Scale Alignment:
- The rendered pointmap \(X^r\) and pre-trained pointmap \(X^p\) are divided into \(P \times P\) patches.
- Patches with similar statistical distributions (mean/standard deviation differences below thresholds) are selected for normalization.
- A set of "correct points" \(CP\) is identified in the normalized space, and the scaling factor is computed as: \(\sigma' = \frac{\mu(X^r[CP])}{\mu(X^p[CP])}\)
- Iterative alignment continues until the scaling factor stabilizes, yielding \(\hat{X}^p = \sigma \times X^p\).
- If insufficient correct points are found, a remedial scaling factor is computed using the aligned pointmap of an adjacent keyframe.
Pointmap Replacement Mechanism:
- When inserting new Gaussians at keyframes, the aligned pre-trained pointmap \(\hat{X}^p\) is used to detect "erroneous points" in the rendered pointmap \(X^r\).
- Points with deviation exceeding the threshold are replaced: \(\hat{X}^r(x) = \hat{X}^p(x)\) if \(|X^r(x) - \hat{X}^p(x)| > \epsilon_m \times \hat{X}^p(x)\)
- Random sparse downsampling controls the number of Gaussians.

Loss & Training¶

The total loss jointly optimizes poses and the Gaussian map over a keyframe window with three terms:

\[\min_{T_k, \mathcal{G}} \sum_{k \in \mathcal{W}} \alpha L_{pho}^k + (1-\alpha) L_{geo}^k + \lambda_{iso} L_{iso}\]

Photometric loss \(L_{pho}\): L1 rendering reconstruction loss
Geometric loss \(L_{geo} = \|X^r - \hat{X}^p\|_1\): pointmap alignment supervision
Isotropy regularization \(L_{iso}\): prevents excessive elongation of Gaussian ellipsoids

Key Experimental Results¶

Main Results¶

Comprehensive evaluation on three outdoor datasets — Waymo, KITTI, and DL3DV:

Dataset	Metric	S3PO-GS (Ours)	OpenGS-SLAM	MonoGS	GlORIE-SLAM
Waymo	ATE↓	0.622	0.839	8.529	0.589
Waymo	PSNR↑	26.73	23.99	21.80	18.83
KITTI	ATE↓	1.048	3.224	9.493	1.134
KITTI	PSNR↑	20.03	15.61	14.78	15.49
DL3DV	ATE↓	0.032	0.141	0.274	0.492
DL3DV	PSNR↑	29.97	24.75	24.99	16.20

NVS PSNR improvements across all datasets are +2.73/+4.42/+4.98; tracking achieves best performance on KITTI and DL3DV, and is comparable to GlORIE-SLAM on Waymo.

Ablation Study¶

Configuration	ATE↓	PSNR↑	Note
w/o pose refinement	1.79	24.45	Tracking degrades without refinement
w/o scale alignment	3.50	23.49	Unaligned pointmaps introduce erroneous supervision
w/o point replacement	1.35	25.59	Incorrect Gaussian insertion degrades map quality
w/o \(L_{geo}\)	3.73	25.70	Lack of geometric supervision reduces translation awareness
Full model	0.62	26.73	—

Convergence iteration ablation (Waymo 405841 scene, ATE):

Iterations	MonoGS	OpenGS-SLAM	S3PO-GS
5	12.6	4.17	0.55
50	2.98	0.85	0.49
100	1.70	0.80	0.46

Key Findings¶

The PAPE module enables pose estimation to converge in only 5 iterations (MonoGS requires 50+).
Directly incorporating pre-trained pointmaps into MonoGS without the proposed processing leads to severe geometric blurring due to scale drift.
Patch-level local alignment is more robust than global alignment, avoiding contamination from outliers.

Highlights & Insights¶

Elegant design: The pre-trained model is used solely as a "matchmaker" (establishing correspondences) without directly participating in pose computation, eliminating scale alignment issues at the architectural level.
Plug-and-play: The approach generalizes to other 3DGS SLAM systems that need to integrate external geometric priors.
Rapid convergence: Stable accuracy is achieved in just 5 iterations, making it highly compatible with real-time systems.

Limitations & Future Work¶

Loop closure detection is not incorporated; drift may still accumulate over long sequences.
Dependence on the inference speed of pre-trained pointmap models may constrain real-time applicability.
Evaluation is limited to static scenes; dynamic objects are not addressed.

MASt3R/DUSt3R: Provide powerful pre-trained pointmap priors, but direct application to SLAM introduces scale inconsistency.
MonoGS: A pioneering 3DGS SLAM system, but suffers from convergence difficulties in outdoor scenes.
GlORIE-SLAM: Achieves high tracking accuracy via inter-frame relationships, but does not support NVS.

Rating¶

Novelty: ⭐⭐⭐⭐ (Scale self-consistent design is conceptually novel)
Technical Depth: ⭐⭐⭐⭐ (Complete tracking-mapping pipeline)
Experimental Thoroughness: ⭐⭐⭐⭐ (Three datasets + thorough ablations)
Practical Value: ⭐⭐⭐⭐ (Directly targets outdoor scenarios such as autonomous driving)