OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects¶
Conference: NeurIPS 2025 arXiv: 2510.20605 Code: https://markhh.com/OnlineSplatter Area: 3D Vision Keywords: Online 3D Reconstruction, 3D Gaussian Splatting, Pose-Free Reconstruction, Free-Moving Objects, Memory Module
TL;DR¶
This paper proposes OnlineSplatter, a feed-forward online 3D reconstruction framework that requires no camera poses, depth priors, or global optimization. It achieves constant-time incremental reconstruction of free-moving objects via a dual-key memory module combining appearance-geometry latent keys and orientation keys.
Background & Motivation¶
Real-time monocular reconstruction of free-moving objects is a fundamental challenge in computer vision, with applications in robotics, AR, and beyond. Existing methods suffer from the following limitations:
Optimization-based methods (BARF, BundleSDF, Fmov): Require global bundle adjustment and cannot run online in real time; BundleSDF additionally requires ground-truth depth input.
Diffusion-based generative methods (LRM, InstantMesh): Rely on learned priors to "hallucinate" unobserved regions, making them unsuitable for perception tasks.
Feed-forward point-map methods (DUSt3R, NoPoSplat, Spann3R): Assume static scenes and treat moving objects as outliers; implicitly rely on large background surfaces.
Key Challenge: Online reconstruction requires causal processing (updating upon each arriving frame), yet existing methods either demand global optimization, assume static scenes, or require additional sensors.
Key Insight: The paper designs an object-centric feed-forward framework that defines a canonical coordinate system from the first frame and incrementally fuses temporal information in constant time via a dual-key memory module, without requiring poses, depth, or background information.
Method¶
Overall Architecture¶
At each timestep \(t\): an RGB frame \(V_t\) is input → online video segmentation yields an object mask → a dual encoder extracts patch features → the OnlineSplatter Transformer jointly processes the reference frame, current frame, and memory tokens → outputs pixel-aligned 3D Gaussians → updates the object memory. The entire pipeline is feed-forward with constant time complexity.
Key Designs¶
-
Dual-Encoder Image Feature Extraction:
- A frozen DINO encoder provides strong self-supervised appearance cues.
- A trainable encoder of the same architecture captures complementary geometric cues.
- Features are concatenated: \(f_{vt} = \text{Concat}(\text{Encoder}_1^I(V_t'), \text{Encoder}_2^I(V_t'))\)
- Design Motivation: DINO offers strong visual priors but lacks 3D awareness.
-
Dual-Key 3D Object Memory: Core contribution.
- Latent key \(\mathbf{k}_t^{(L)}\): Learned by a lightweight encoder from patch features, capturing visual-geometric cues.
- Orientation key \(\mathbf{k}_t^{(D)}\): Derived from a pretrained zero-shot 3D orientation estimator, converted to a unit direction vector.
- Value \(\mathbf{v}_t^{(L)}\): Encoded from Transformer output tokens.
- Dual-purpose retrieval:
- Orientation-aligned retrieval: Retrieves memory entries with similar latent and orientation keys (current-viewpoint information).
- Orientation-complementary retrieval: Retrieves entries with similar latent keys but opposite orientation keys (complementary-viewpoint information).
- Similarity formula: \(s_{i,t}^{(\text{align})} = (\mathbf{q}_t^{(L)\top}\mathbf{k}_i^{(L)}) \cdot \mathbf{q}_t^{(D)\top}\mathbf{k}_i^{(D)} \cdot \frac{1}{\tau_t}\)
-
Memory Sparsification Mechanism:
- When memory reaches capacity \(S\), the 20% least useful entries are pruned.
- Two criteria are jointly considered: utilization (accumulated cross-attention weights) and spatial coverage (mean angular distance of orientation keys).
- Low-utilization entries are removed from a high-coverage subset, balancing the retention of unique viewpoints and the removal of redundant ones.
-
Gaussian Decoding and Rendering:
- Transformer outputs are decoded into \(4N\) Gaussians: \(\mathbf{G}_{obj,t}^{4N} = \{\mathbf{G}_{mem,t}^{2N}, \mathbf{G}_{ref,t}^{N}, \mathbf{G}_{src,t}^{N}\}\)
- Non-accumulative: each step directly outputs a complete object representation, avoiding global aggregation.
- Frame-level subsets are rendered from their respective viewpoints, encouraging each Gaussian group to specialize on the corresponding visible region.
Loss & Training¶
- Two-stage training: Warm-up (no memory module, 250K steps) → Main (with memory module, 500K steps).
- Photometric loss \(\mathcal{L}_\text{photo}\): MSE between ground-truth and rendered images, plus a background penalty term.
- Geometric loss \(\mathcal{L}_\text{geo}\): Ray alignment \(\mathcal{L}_\text{ray}\) + relative depth \(\mathcal{L}_\text{depth}\).
- Training data: 100K objects from Objaverse; diverse trajectories generated via custom scripts.
Key Experimental Results¶
Main Results (GSO Dataset)¶
| Method | Phase | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|
| FreeSplatter-dist4 | Late | 23.751 | 0.873 | 0.120 |
| NoPoSplat-dist3 | Late | 24.141 | 0.863 | 0.125 |
| OnlineSplatter | Late | 31.737 | 0.969 | 0.075 |
| FreeSplatter-dist4 | Early | 22.365 | 0.874 | 0.119 |
| OnlineSplatter | Early | 26.329 | 0.921 | 0.084 |
Late-phase results on HO3D: PSNR 27.928 vs. best baseline 22.947 (+4.981).
Ablation Study¶
| Configuration | Early \(\mathcal{M}_{avg}\)↑ | Mid \(\mathcal{M}_{avg}\)↑ | Late \(\mathcal{M}_{avg}\)↑ |
|---|---|---|---|
| Full model | 0.699 | 0.734 | 0.810 |
| w/o latent key | 0.545 | 0.582 | 0.596 |
| w/o orientation key | 0.699 | 0.701 | 0.723 |
| w/o staged training | 0.545 | 0.582 | 0.588 |
| w/o ray loss | 0.562 | 0.599 | 0.682 |
| random pruning | 0.697 | 0.728 | 0.764 |
Key Findings¶
- OnlineSplatter achieves PSNR gains of +7.596 (GSO) and +4.981 (HO3D) in the Late phase, substantially outperforming all baselines.
- Performance improves consistently as more observations accumulate, whereas baselines frequently plateau or fluctuate.
- Removing the latent key causes the largest performance drop (−0.214); the orientation key primarily affects later phases (−0.087), confirming their complementarity.
- Staged training is critical: single-stage training yields only 0.588 Late-phase performance vs. 0.810 with staged training.
Highlights & Insights¶
- Constant-time online reconstruction: Each frame incurs \(O(1)\) update cost independent of sequence length, making the method genuinely suitable for real-time applications.
- Complementarity of the dual-key design: The latent key captures "what is relevant" while the orientation key encodes "from where it is observed"—together they enable comprehensive spatial coverage.
- Non-accumulative paradigm: Unlike conventional methods that accumulate predictions and then globally optimize, the proposed approach directly outputs a complete representation at each step, fundamentally eliminating redundancy and optimization overhead.
Limitations & Future Work¶
- Only rigid objects are supported; non-rigid deformable objects are not handled.
- The quality of the initial frame affects subsequent reconstruction (severe occlusion or blur in the first frame degrades overall performance).
- Converting the output 3DGS representation to explicit meshes remains challenging.
- Resolution is limited to 256×256; scaling to higher resolutions requires further investigation.
Related Work & Insights¶
- vs. BundleSDF: BundleSDF requires ground-truth depth and keyframe-matching optimization; OnlineSplatter operates as a purely RGB feed-forward method.
- vs. DUSt3R/NoPoSplat: These methods assume static scenes, whereas OnlineSplatter is specifically designed for free-moving objects.
- vs. FreeSplatter: FreeSplatter processes four frames at a time and requires a frame selection strategy; OnlineSplatter naturally accumulates information through its memory module.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The dual-key memory design is novel; integrating orientation estimation into memory retrieval is particularly elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on both synthetic and real datasets with phased assessment, comprehensive ablations, and mesh comparisons.
- Writing Quality: ⭐⭐⭐⭐ Method description is clear, though the dense notation requires some adaptation on first reading.
- Value: ⭐⭐⭐⭐⭐ The first truly pose-free online object reconstruction feed-forward framework, with direct applicability to robotic perception.