6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting¶

Conference: ICCV 2025 arXiv: 2412.01543 Code: None Area: Autonomous Driving Keywords: 6D pose estimation, Gaussian splatting, real-time tracking, 3D reconstruction, RGB-D

TL;DR¶

This paper proposes 6DOPE-GS, a model-free online tracking method that jointly optimizes 6D object pose and 3D reconstruction using 2D Gaussian Splatting (2DGS). Through dynamic keyframe selection and opacity-percentile-based density control, it achieves a 5× speedup while maintaining state-of-the-art accuracy.

Background & Motivation¶

6D object pose estimation is a fundamental task in AR, autonomous driving, and robotic manipulation. Existing methods fall into two main categories:

Model-based methods: Rely on CAD models or reference images, with poor generalization to unseen objects.

Model-free methods: Such as BundleSDF, which jointly optimizes pose and reconstruction via a Neural Object Field, but incurs prohibitive computational cost (~2.1 s per frame), making real-time deployment infeasible.

Although BundleSDF reports near real-time pose optimization (~10 Hz), its neural field training is far from real-time (~6.7 s per round), yielding an overall tracking frequency of only ~0.4 Hz. This severely limits applicability in dynamic scenarios such as hand-held object manipulation.

Core Motivation: Leverage the efficient differentiable rendering capability of Gaussian Splatting to replace slow neural implicit field training, enabling truly online joint pose estimation and 3D reconstruction.

Method¶

Overall Architecture¶

The 6DOPE-GS pipeline consists of the following stages:

Object Segmentation: SAM2 is applied to continuously segment the target object in the video stream.
Feature Matching: LoFTR establishes inter-frame point correspondences.
Coarse Pose Initialization: RANSAC computes an initial pose and constructs a keyframe pool.
Gaussian Object Field Joint Optimization: 2DGS jointly optimizes keyframe poses and object reconstruction.
Online Pose Graph Optimization: Poses of incoming frames are continuously updated using the optimized keyframe poses.

Key Designs¶

1. Gaussian Object Field

2D Gaussian Splatting (2DGS) is adopted over 3DGS for object modeling: - 2DGS compresses each Gaussian into a 2D planar disk (surfel) by setting the z-axis scale to zero. - It provides more accurate surface normals and depth estimates. - Gradients are back-propagated to keyframe pose parameters through differentiable rendering, enabling joint optimization.

2. Dynamic Keyframe Selection

The vertices and face centers of an icosahedron serve as anchor points, approximating a uniform spherical distribution of viewing directions.
Initial keyframes are clustered by pose to each anchor, and the frame with the largest object mask in each cluster is selected.
During joint optimization, outlier keyframes are filtered using the Median Absolute Deviation (MAD) of reconstruction error.
Viewpoints exceeding 3× MAD are identified as outliers and discarded.

3. Adaptive Density Control via Opacity Percentile

At fixed optimization intervals, Gaussians whose opacity falls in the bottom 5th percentile are pruned.
Pruning continues until the 95th-percentile opacity exceeds a threshold.
Compared to the absolute-threshold pruning in vanilla 3DGS, this strategy is more stable and avoids drastic fluctuations in the number of Gaussians.

Loss & Training¶

Rendering loss: RGB reconstruction loss + depth reconstruction loss + normal consistency loss.
Keyframe poses receive gradients from the rendering loss via automatic differentiation (PyTorch) and are updated accordingly.
Once the Gaussian Object Field converges, Gaussian parameters are frozen and all keyframe poses are jointly refined.
Online pose graph optimization uses dense pixel-wise re-projection error.

Key Experimental Results¶

Main Results¶

YCBInEOAT dataset:

Method	ADD-S (%)	ADD (%)	CD (cm)	Time/frame (s)
BundleTrack	92.54	84.91	-	0.21
BundleSDF	92.82	84.28	0.53	0.82
MonoGS (RGB-D)	20.16	15.32	2.43	0.29
6DOPE-GS	93.79	87.82	0.15	0.22

HO3D dataset:

Method	ADD-S (%)	ADD (%)	CD (cm)	Time/frame (s)
BundleSDF	94.86	89.56	0.58	2.10
BundleTrack	93.96	77.75	-	0.29
6DOPE-GS	95.07	84.33	0.41	0.24

Ablation Study¶

HO3D ablation:

Configuration	ADD-S (%)	ADD (%)	CD (cm)
Ours (basic)	93.52	80.25	0.44
w/o KF Selection	94.44	82.40	0.42
w/o Pruning	92.48	80.87	0.44
Ours (3DGS)	92.51	79.49	0.47
Ours (final, 2DGS)	95.07	84.33	0.41

Key Findings¶

Compared to BundleSDF, 6DOPE-GS achieves approximately 5× speedup (0.22 s vs. 0.82 s) while achieving higher ADD-S.
2DGS substantially outperforms 3DGS (ADD-S 95.07 vs. 92.51), attributed to the normal and depth regularization in 2DGS that encourages Gaussians to conform to object surfaces.
Both dynamic keyframe selection and percentile-based pruning are indispensable: removing either component degrades performance.
MonoGS performs poorly on object-level tracking, demonstrating that scene-level SLAM methods cannot be directly applied to object tracking.
In real-time demonstrations, the tracking frequency reaches 3.5–5 Hz, with the Gaussian field updated every 8 seconds.

Highlights & Insights¶

First method to apply Gaussian Splatting to model-free 6D object tracking and reconstruction, demonstrating the significant potential of GS in object-level SLAM.
The icosahedron anchor strategy in dynamic keyframe selection is elegant, ensuring spatial coverage of viewing directions.
MAD-based outlier keyframe filtering is more robust than simple thresholding and represents good engineering practice.
The overall approach transfers the paradigm of scene-level GS-SLAM to the object level, opening a new research direction.

Limitations & Future Work¶

The ADD score on HO3D remains lower than that of BundleSDF; insufficient supervision signal due to hand occlusion is the primary cause.
Gaussian rasterization is less accurate than differentiable ray casting for gradient computation under large or out-of-plane rotations.
The optimized 2D Gaussians are not directly used in online pose graph optimization (only the optimized poses are), resulting in insufficient coupling.
The method depends on the segmentation quality of SAM2 and may fail under severe occlusion.
Only single-object tracking is currently supported; extension to multi-object scenarios requires further work.

The keyframe pose graph optimization framework is inherited from BundleTrack/BundleSDF, with GS replacing the slow Neural Field.
The distinction from GS-SLAM methods such as MonoGS lies in the object-centric rather than scene-level formulation of 6DOPE-GS.
The advantage of the 2DGS surfel representation for accurate depth and normal modeling is validated in this work.

Rating¶

Novelty: ⭐⭐⭐⭐ — First application of 2DGS to model-free 6D object tracking, with well-designed keyframe selection and density control.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, detailed ablations, and real-time demonstrations, though evaluation on larger-scale benchmarks is lacking.
Writing Quality: ⭐⭐⭐⭐ — Clear structure with complete method descriptions.
Value: ⭐⭐⭐⭐ — Real-time model-free tracking offers high practical value and advances the application of GS in robotics.