6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting¶

Conference: ICCV 2025 arXiv: 2412.01543 Code: No public code Area: Autonomous Driving Keywords: 6D Object Pose Estimation, Gaussian Splatting, Online Tracking, Model-Free, RGB-D

TL;DR¶

Leveraging the efficient differentiable rendering capability of 2D Gaussian Splatting, this paper proposes a CAD-model-free online 6D object pose estimation and tracking method. By jointly optimizing a Gaussian object field and keyframe poses, it achieves approximately 5× speedup over BundleSDF while maintaining comparable accuracy.

Background & Motivation¶

Model-based methods (e.g., feature matching with CAD models) achieve high accuracy but require pre-prepared CAD models or annotated reference images, exhibiting poor scalability when faced with a large number of unknown objects.
Model-free methods such as BundleSDF achieve state-of-the-art results by jointly optimizing a Neural Object Field and a pose graph, but neural field training is extremely slow (~6.7 seconds per iteration), resulting in an effective tracking frequency of only ~0.4 Hz, which is insufficient for real-time applications.
Core Motivation: Gaussian Splatting's rasterization-based rendering is substantially faster than volume rendering. The question is whether its efficiency can replace neural radiance fields to dramatically improve the speed of 6D pose estimation while preserving accuracy.

Core Problem¶

How to achieve real-time joint optimization of 6D object pose estimation and 3D reconstruction from a monocular RGB-D video stream without CAD models? Specific challenges include: 1. Coarse pose initialization errors can cause Gaussian field optimization to diverge. 2. Unstable control of Gaussian particle count degrades training efficiency. 3. A trade-off between speed and accuracy must be carefully balanced.

Method¶

Overall Architecture¶

6DOPE-GS consists of four core modules that process an RGB-D video stream in a pipeline fashion:

Coarse Pose Initialization:
- SAM2 is used for target object segmentation and tracking.
- LoFTR (a Transformer-based dense feature matcher) extracts feature point correspondences between adjacent frames.
- Coarse inter-frame poses are obtained via RANSAC followed by nonlinear least-squares optimization.
- A keyframe memory pool is maintained, with new frames added based on spatial diversity criteria.
Gaussian Object Field:
- A visual-geometric model of the object is constructed based on 2D Gaussian Splatting (2DGS).
- 2D Gaussian parameters and keyframe poses are jointly optimized.
- Gradients from differentiable rendering are back-propagated to update both Gaussian parameters and camera poses simultaneously.
Dynamic Keyframe Selection: Keyframe filtering based on spatial coverage and reconstruction confidence.
Online Pose Graph Optimization: Real-time pose updates for subsequent incoming frames.

Key Designs¶

① Choosing 2DGS over 3DGS: - 2DGS collapses each Gaussian particle into a planar disk surfel (scale along the z-axis set to zero), defined by two principal axes and a normal direction. - Additional depth distortion and normal consistency constraints are introduced, providing more accurate surface geometry alignment. - Experiments confirm that 2DGS outperforms 3DGS on both pose accuracy and reconstruction quality.

② Dynamic Keyframe Selection — a two-stage filtering strategy: - Spatial coverage maximization: Icosahedral anchor points are placed around the object; keyframes are clustered to the nearest anchor based on pose, and the keyframe with the largest object mask is selected per cluster, ensuring maximum information coverage under sparse viewpoints. - Outlier rejection via reconstruction confidence: The median absolute deviation (MAD) of reconstruction loss is computed at each iteration; viewpoints whose deviation exceeds 3× MAD are flagged as outliers and removed. MAD is more robust to extreme values than mean/standard deviation.

③ Adaptive Density Control via Opacity Percentile Pruning: - Absolute-threshold pruning in vanilla 3DGS causes drastic fluctuations in Gaussian count, leading to training instability. - At fixed optimization intervals, Gaussian particles whose opacity falls below the 5th percentile are pruned until the 95th-percentile opacity exceeds a given threshold. - This retains high-quality Gaussians while eliminating low-importance particles, balancing stability and efficiency. - Gaussian splitting and cloning are triggered when positional gradients exceed a threshold.

④ Joint Optimization Pipeline: - Once the Gaussian object field converges, the 2D Gaussian parameters are frozen and all keyframe poses are individually refined. - The refined poses are subsequently used to guide online pose graph optimization.

Loss & Training¶

Rendering loss: RGB color reconstruction loss + depth reconstruction loss + normal consistency loss (intrinsic to 2DGS) + depth distortion loss.
Pose optimization loss: Gradients are back-propagated through 2DGS differentiable rendering to keyframe pose parameters using PyTorch automatic differentiation.
Online pose graph: Pairwise geometric consistency optimization based on per-pixel dense reprojection error.

Key Experimental Results¶

YCBInEOAT dataset (robotic grasping scenarios, 5 YCB objects, 9 video sequences):

Method	ADD-S(%)	ADD(%)	CD(cm)	Time/frame(s)
BundleSDF	92.82	84.28	0.53	0.82
BundleTrack	92.54	84.91	-	0.21
6DOPE-GS	93.79	87.82	0.15	0.22
MonoGS(RGB-D)	20.16	15.32	2.43	0.29

HO3D dataset (hand-object interaction scenarios, more challenging):

Method	ADD-S(%)	ADD(%)	CD(cm)	Time/frame(s)
BundleSDF	94.86	89.56	0.58	2.10
BundleTrack	93.96	77.75	-	0.29
6DOPE-GS	95.07	84.33	0.41	0.24

Real-time capability: 3–4 Hz with visualization, 4–5 Hz without GUI.
The Gaussian model is updated approximately every 8 seconds.
Hardware: RTX 4090 GPU + i9-12900KF CPU.

Ablation Study¶

Variant	ADD-S(%)	ADD(%)	CD(cm)
Ours (basic) - HO3D	93.52	80.25	0.44
w/o KF Selection	94.44	82.40	0.42
w/o Pruning (vanilla ADC)	92.48	80.87	0.44
Ours (3DGS)	92.51	79.49	0.47
Ours (final)	95.07	84.33	0.41

Both dynamic keyframe selection and percentile pruning yield significant improvements.
2DGS clearly outperforms 3DGS, attributed to the absence of normal/depth regularization in 3DGS.
The combination of both components achieves the best overall performance.

Highlights & Insights¶

Substantial speed advantage: Approximately 5× acceleration over BundleSDF (0.24s vs. 2.1s per frame), marking the first real-time Gaussian-field-based 6D object tracking and reconstruction system (~3.5 Hz).
Elegant dynamic keyframe selection design: Icosahedral anchors combined with MAD-based outlier detection achieve both spatial coverage and robustness.
Percentile pruning is simple yet effective: Replacing absolute thresholds with statistical percentiles avoids drastic fluctuations in Gaussian count.
Well-motivated choice of 2DGS over 3DGS: Surfel representation provides superior surface geometry constraints.
Full real-time system validation: Includes a live demonstration with a ZED2 camera in real-world scenarios.

Limitations & Future Work¶

ADD on HO3D remains lower than BundleSDF (84.33 vs. 89.56): Hand occlusion reduces the available supervision signal, constraining Gaussian optimization.
Gradient computation via Gaussian rasterization is less precise than differentiable ray casting in neural fields: The authors plan to explore ray casting approaches for Gaussian representations.
The Gaussian Object Field is not directly integrated into the online pose graph optimization: Only refined poses are passed downstream, leaving room for tighter coupling.
Dependence on SAM2 segmentation quality: Segmentation failures propagate directly to all subsequent stages.
Limited Gaussian model update frequency (~every 8 seconds) may be insufficient for fast-moving objects.

Method	Model Requirement	Reconstruction	Speed	Key Difference
BundleTrack	Model-free	None	~10 Hz	Pose tracking only, no 3D reconstruction
BundleSDF	Model-free	Neural SDF	~0.4 Hz	Extremely slow neural field training
MonoGS	Model-free	3DGS	~3 Hz	Scene-level SLAM, poor object-level performance
6DOPE-GS	Model-free	2DGS	~4 Hz	Object-level Gaussian field optimization with optimal speed–accuracy balance

Broader implications: - 2DGS vs. 3DGS: For tasks requiring precise surface geometry (e.g., pose estimation, robotic grasping), the surfel representation of 2DGS may be more suitable than 3DGS due to its inherent normal constraints. - Percentile pruning is generalizable to other Gaussian field applications: using statistical distribution characteristics rather than absolute thresholds to control Gaussian density. - The icosahedral + MAD keyframe selection strategy is transferable to other viewpoint selection scenarios (e.g., active perception, next-best-view planning). - This work is complementary to scene-level GS-based SLAM methods (SplaTAM, GS-SLAM, etc.), which target scene-level reconstruction, whereas this paper addresses the object level.

Rating¶

Novelty: ⭐⭐⭐⭐ Applying 2DGS to joint optimization of object-level 6D pose estimation and reconstruction represents a well-motivated contribution; dynamic keyframe selection and percentile pruning are effective engineering contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparisons and ablation studies are conducted on two standard datasets with a real-time system demonstration, though generalization across additional datasets remains unverified.
Writing Quality: ⭐⭐⭐⭐ The paper is well-structured with detailed method descriptions, intuitive pipeline diagrams, and clearly articulated motivation.
Value: ⭐⭐⭐⭐ A 5× speedup is practically significant, providing a practical model-free pose estimation solution for real-time applications such as robotic manipulation.