Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction¶
Conference: ICCV 2025 arXiv: 2504.07961 Code: https://geo4d.github.io Area: 3D Vision Keywords: 4D reconstruction, video diffusion models, multi-modal geometry, point map, disparity map, ray map, dynamic scenes
TL;DR¶
This paper adapts a pretrained video diffusion model (DynamiCrafter) into a monocular 4D dynamic scene reconstructor that simultaneously predicts three complementary geometric modalities — point maps, disparity maps, and ray maps. Through a multi-modal alignment and fusion algorithm combined with sliding-window inference, the model generalizes zero-shot to real videos despite being trained exclusively on synthetic data, substantially outperforming current state-of-the-art video depth estimation methods.
Background & Motivation¶
Core Problem¶
Feed-forward 4D reconstruction from monocular video — directly recovering the 3D geometry of dynamic scenes (including camera motion and moving objects) from a single-camera video — is an extremely challenging yet consequential problem in computer vision, with broad applications in video understanding, computer graphics, and robotics.
Limitations of Prior Work¶
Iterative optimization methods (NeRF/3DGS-based): require per-video optimization, incurring large computational overhead and depending on accurate monocular depth priors (e.g., MegaSaM, Uni4D).
Feed-forward methods (e.g., MonST3R): extend DUSt3R to dynamic scenes but rely on highly customized architectures and large quantities of 3D-annotated real training data. Such annotations are extremely difficult to obtain for dynamic scenes; falling back to synthetic data introduces a domain gap.
Depth diffusion models (e.g., DepthCrafter): estimate depth only, without recovering complete 4D geometry (camera motion + 3D structure).
Key Insight¶
Video generation models (as world simulator proxies) implicitly encode an understanding of camera motion, perspective effects, and object dynamics, yet produce only pixels rather than actionable 3D information. The core idea of Geo4D is to make this implicit 3D understanding explicit — by fine-tuning a video diffusion model to directly output geometric modalities.
Why Multiple Modalities?¶
A single viewpoint-invariant point map encodes complete 4D geometry but has limited dynamic range: distant objects and the sky (infinite depth) cannot be represented. This motivates the inclusion of: - Disparity maps: zero naturally represents infinite distance, offering better dynamic range - Ray maps: encode camera parameters, are defined for every pixel, and are independent of scene geometry
The three modalities are theoretically redundant but practically complementary — their fusion substantially improves robustness.
Method¶
Overall Architecture¶
Given a monocular video \(\mathcal{I}=\{I^i\}_{i=1}^N\), the network \(f_\theta\) simultaneously predicts three geometric modalities per frame: $\(f_\theta: \{I^i\}_{i=1}^N \mapsto \{(D^i, X^i, r^i)\}_{i=1}^N\)$ where \(D^i \in \mathbb{R}^{H \times W \times 1}\) is the disparity map, \(X^i \in \mathbb{R}^{H \times W \times 3}\) is the viewpoint-invariant point map (in the coordinate frame of the first frame), and \(r^i \in \mathbb{R}^{H \times W \times 6}\) is the Plücker-coordinate ray map. No camera parameters are required as input.
Key Designs¶
1. Multi-Modal Latent Encoding¶
Built upon DynamiCrafter's VAE encoder-decoder: - Disparity maps and ray maps: directly reuse the pretrained image encoder-decoder without modification - Point maps: the VAE decoder is fine-tuned using an uncertainty-weighted reconstruction loss: $\(\mathcal{L} = -\sum_{uv} \ln \frac{1}{\sqrt{2}\sigma_{uv}} \exp \frac{-\sqrt{2}\ell_1(\mathcal{D}(\mathcal{E}(X))_{uv}, X_{uv})}{\sigma_{uv}}\)$ where \(\sigma\) is the uncertainty predicted by an auxiliary branch of the decoder. The encoder is kept frozen to minimize changes to the latent space. Point maps are normalized to \([-1,1]\) to be compatible with the pretrained encoder.
2. Video Conditioning Injection (Dual Stream)¶
- Global stream: each frame \(I^i\) is encoded via CLIP and injected into each U-Net block via cross-attention through a lightweight query transformer
- Local stream: spatial features extracted by the VAE encoder are concatenated channel-wise with the noisy latents of the three geometric modalities
3. Multi-Modal Alignment Fusion (Core Inference Algorithm)¶
During inference, long videos are divided into overlapping clips using a sliding window (\(V=16\) frames, stride \(s=4\)). Global consistency is achieved by jointly optimizing the following four losses:
Point map alignment (group-wise extension of DUSt3R): $\(\mathcal{L}_p = \sum_{g \in \mathcal{G}} \sum_{i \in g} \sum_{uv} \left\| \frac{X^i_{uv} - \lambda_p^g P_p^g X^{i,g}_{uv}}{\sigma^{i,g}_{uv}} \right\|_1\)$ recovering per-frame camera intrinsics \(K_p^i\), rotation \(R_p^i\), center \(o_p^i\), and disparity \(D_p^i\).
Disparity map alignment: $\(\mathcal{L}_d = \sum_{g} \sum_{i \in g} \|D_p^i - \lambda_d^g D_d^{i,g} - \beta_d^g\|_1\)$
Ray map alignment (camera trajectory alignment): $\(\mathcal{L}_c = \sum_{g} \sum_{i \in g} (\|R_p^{i\top} R_c^g R_c^{i,g} - I\|_f + \|\lambda_c^g o_c^{i,g} + \beta_c^g - o_p^i\|_2)\)$
Trajectory smoothness regularization: $\(\mathcal{L}_s = \sum_{i=1}^N (\|R_p^{i\top} R_p^{i+1} - I\|_f + \|o_p^{i+1} - o_p^i\|_2)\)$
Final objective: \(\mathcal{L}_{all} = \alpha_1 \mathcal{L}_p + \alpha_2 \mathcal{L}_d + \alpha_3 \mathcal{L}_c + \alpha_4 \mathcal{L}_s\)
Loss & Training¶
- Trained exclusively on 5 synthetic datasets (Spring, BEDLAM, PointOdyssey, TarTanAir, VirtualKitti)
- Progressive training schedule: single-modality point map training → multi-resolution training → gradual incorporation of ray maps and disparity maps
- 4× H100 GPUs, approximately one week
- DDIM 5-step sampling at inference
Key Experimental Results¶
Main Results: Video Depth Estimation¶
| Method | Sintel AbsRel↓ | Sintel δ<1.25↑ | Bonn AbsRel↓ | Bonn δ<1.25↑ | KITTI AbsRel↓ | KITTI δ<1.25↑ |
|---|---|---|---|---|---|---|
| Depth-Anything-V2 | 0.367 | 55.4 | 0.106 | 92.1 | 0.140 | 80.4 |
| DepthCrafter | 0.270 | 69.7 | 0.071 | 97.2 | 0.104 | 89.6 |
| MonST3R | 0.335 | 58.5 | 0.063 | 96.4 | 0.104 | 89.5 |
| Geo4D | 0.205 | 73.5 | 0.059 | 97.2 | 0.086 | 93.7 |
Geo4D achieves consistent improvements across all three benchmarks. Compared to DepthCrafter, which shares the same backbone, AbsRel is reduced by 24% on Sintel and 17.3% on KITTI.
Camera Pose Estimation¶
| Method | Sintel ATE↓ | Sintel RPE-R↓ | TUM ATE↓ | TUM RPE-R↓ |
|---|---|---|---|---|
| MonST3R | 0.108 | 0.732 | 0.063 | 1.217 |
| Geo4D | 0.185 | 0.547 | 0.073 | 0.635 |
Geo4D is the first method to estimate camera parameters for dynamic scenes using a generative model. Rotation estimation (RPE-R) substantially outperforms discriminative methods, while translation estimation is competitive.
Ablation Study: Multi-Modal Training and Inference¶
| Training Modalities | Inference Modalities | Sintel AbsRel↓ | ATE↓ | RPE-R↓ |
|---|---|---|---|---|
| Point map only | Point map only | 0.232 | 0.335 | 0.731 |
| All three | Point map only | 0.223 | 0.237 | 0.566 |
| All three | Disparity map only | 0.211 | — | — |
| All three | Full fusion | 0.205 | — | — |
Key Findings: - Multi-modal training improves point-map-only inference, demonstrating an auxiliary task effect - Disparity maps achieve the best pure depth metrics due to superior dynamic range - Full three-modality fusion yields the best overall performance
Highlights & Insights¶
- Paradigm innovation: the first work to demonstrate that a general-purpose video diffusion model can be effectively repurposed as a 4D geometry reconstructor, without requiring a customized 3D architecture
- Elegant complementary multi-modal design: point maps encode complete structure but suffer from limited dynamic range; disparity maps handle distant regions; ray maps handle camera parameters — each modality covers the weaknesses of the others
- Zero-shot generalization from synthetic data: the strong priors embedded in the video generation model enable reliable generalization to real videos despite purely synthetic training data
- Uncertainty-driven alignment: the uncertainty \(\sigma\) predicted by the VAE decoder participates directly in the multi-modal fusion optimization, automatically down-weighting unreliable predictions
Limitations & Future Work¶
- Scale ambiguity of point maps — monocular video cannot determine absolute scale, recovering only up-to-scale geometry
- Inference speed is constrained by diffusion sampling (DDIM 5-step accelerates inference but is still not real-time)
- Robustness to extreme dynamic scenes (rapid occlusion or appearance of objects) requires further evaluation
- Error accumulation in the sliding-window strategy over very long videos
Related Work & Insights¶
- Dynamic scene reconstruction: DUSt3R → MonST3R → Easi3R (progressive extension from static to dynamic)
- Geometric diffusion models: Marigold (image depth), DepthCrafter (video depth), Aether (depth + ray map), GeometryCrafter (point map VAE)
- Video foundation models: DynamiCrafter, SVD, etc., serving as backbones for 3D/4D understanding
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Complete pipeline from video diffusion to 4D geometry with a strongly original multi-modal design
- Technical Depth: ⭐⭐⭐⭐⭐ — Comprehensive mathematical framework for multi-modal encoding, decoding, and alignment
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-benchmark comparisons and complete ablations; efficiency analysis is lacking
- Value: ⭐⭐⭐⭐ — Strong zero-shot generalization capability, though inference speed requires improvement