LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos¶
Paper Information¶
- Conference: ICCV 2025
- arXiv: 2508.14041
- Code: Project Page
- Area: 3D Vision
- Keywords: 3D Gaussian Splatting, unposed long video reconstruction, octree anchor, incremental joint optimization, pose estimation
TL;DR¶
LongSplat targets casually captured long videos without known camera poses. It proposes an incremental joint optimization framework that simultaneously optimizes camera poses and 3DGS, introduces a robust pose estimation module based on MASt3R priors, and designs an adaptive octree anchor formation mechanism, collectively addressing pose drift, inaccurate geometry initialization, and memory constraints.
Background & Motivation¶
Casually captured long videos are an important source of 3D content, yet they pose unique challenges for novel view synthesis:
Pose estimation difficulty: COLMAP frequently fails on casual videos; foundation models such as MASt3R accumulate errors and drift over long sequences.
Memory constraints: COLMAP-free methods such as CF-3DGS encounter out-of-memory (OOM) issues at large scales.
Complex trajectories: LocalRF produces fragmented reconstructions under irregular camera motion.
Lack of global consistency: Incremental methods are prone to local optima.
Method¶
Overall Architecture¶
LongSplat adopts a fully incremental pipeline: 1. Initialization: MASt3R global alignment point cloud → octree anchor 3DGS 2. Global optimization: joint optimization of all poses and Gaussians 3. Per-frame insertion: PnP pose estimation + photometric refinement + anchor expansion 4. Alternating local–global optimization
Key Design 1: Octree Anchor Formation¶
Unlike the fixed-resolution voxels in Scaffold-GS, LongSplat adaptively subdivides voxels based on point cloud density:
Voxels with density below \(\tau_{\text{prune}}\) are pruned, while those exceeding \(\tau_{\text{split}}\) are recursively subdivided up to a maximum level \(L\). The spatial scale of each anchor is proportional to its voxel size: \(s_v \propto \epsilon_v\). An overlap check against existing anchors prevents duplication.
Key Design 2: Robust Pose Estimation Module¶
For each incoming frame \(t\):
- PnP Initialization: 2D–3D correspondences are established using MASt3R 2D matches and back-projected depth from the previous rendered frame; pose is estimated via PnP+RANSAC:
- Photometric Refinement: Minimizes the discrepancy between the rendered image and the actual frame:
- Depth Scale Correction: Aligns the scale between MASt3R depth and rendered depth:
-
Occlusion-Aware Expansion: Newly exposed regions are detected via forward warping, back-projected into 3D points, and converted into octree anchors.
-
Fallback Mechanism: When PnP fails, a global re-optimization is triggered before retrying.
Key Design 3: Visibility-Adaptive Local Window¶
Co-visibility between frames is measured via IoU over anchor visibility sets:
Frames falling below threshold \(\tau\) are excluded from the local optimization window, ensuring Gaussians receive balanced multi-view supervision.
Total Loss¶
Key Experimental Results¶
Main Results: Free Dataset¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | Notes |
|---|---|---|---|---|
| COLMAP + Scaffold-GS | Failed | - | - | Pose estimation failure |
| CF-3DGS | - | - | - | OOM |
| LocalRF | Low | Low | High | Fragmented reconstruction |
| MASt3R + Scaffold-GS | Medium | Medium | Medium | Inaccurate poses |
| LongSplat | Best | Best | Best | Robust |
LongSplat consistently outperforms all baselines across all scenes.
Ablation Study: Key Components¶
| Octree Anchor | Pose Refinement | Incremental Opt. | PSNR↑ |
|---|---|---|---|
| ✗ (fixed voxel) | ✗ | ✗ | Baseline |
| ✓ | ✗ | ✗ | Gain + memory saving |
| ✓ | ✓ | ✗ | Significant gain |
| ✓ | ✓ | ✓ | Best |
Octree anchors substantially reduce memory usage while maintaining quality; pose refinement is the largest contributor to performance improvement.
Key Findings¶
- COLMAP completely fails on 14 out of 19 Free scenes, whereas LongSplat succeeds on all.
- Octree anchors reduce memory by 30–50% compared to fixed voxels.
- The PnP fallback mechanism is triggered on approximately 5–10% of frames, significantly improving robustness.
- LongSplat also achieves state-of-the-art results on Tanks and Temples and the Hike dataset.
Highlights & Insights¶
- End-to-end unposed reconstruction: No dependency on COLMAP or accurate camera calibration.
- Incremental design: Frame-by-frame processing with controllable memory footprint, suitable for long sequences.
- MASt3R as a soft prior: Used as a flexible initialization rather than a hard constraint; errors are progressively corrected via joint optimization.
- Comprehensive robustness mechanisms: The combination of PnP fallback, photometric refinement, and scale correction ensures stable reconstruction.
Limitations & Future Work¶
- The quality of MASt3R estimates for the initial frames has a substantial impact on the overall reconstruction.
- PnP may be unstable under extremely fast motion or pure rotation.
- Global optimization becomes slower as the number of frames grows.
- When camera intrinsics are unknown, the method relies on focal length estimates from MASt3R.
Related Work & Insights¶
- CF-3DGS: Progressive unposed 3DGS optimization
- LocalRF: Localized radiance field construction
- MASt3R / DUSt3R: 3D foundation models
- Scaffold-GS / Octree-GS: Anchor-based / octree-based 3DGS
Rating¶
- Novelty: ⭐⭐⭐⭐ — A complete system integrating octree anchors, incremental joint optimization, and PnP fallback
- Practicality: ⭐⭐⭐⭐⭐ — Directly handles casually captured smartphone videos with high practical value
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dataset validation with comprehensive ablation studies
- Writing Quality: ⭐⭐⭐⭐ — System description is clear with intuitive pipeline diagrams