Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context¶
Conference: CVPR 2026 arXiv: 2602.21929 Code: None Area: Video Generation Keywords: scene-consistent video generation, geometry context, autoregressive generation, camera control, 3D reconstruction
TL;DR¶
This paper proposes the Geometry-as-Context (GaC) framework, which replaces the non-differentiable operators (3D reconstruction + rendering) in reconstruction-based scene video generation with a unified autoregressive video generation model. By embedding geometric information (depth maps) as interleaved context into the generation sequence, GaC enables end-to-end training and mitigates accumulated errors.
Background & Motivation¶
Scene-consistent video generation aims to explore 3D scenes along camera trajectories while maintaining high 3D consistency. Existing methods fall into two categories: - Video-based methods (CameraCtrl, VMem, etc.): rely solely on video models to maintain consistency; memory retrieval struggles with complex scenes and large camera motions. - Reconstruction-based methods (SceneScape, ViewCrafter, GEN3C, etc.): iteratively execute "geometry estimation → 3D reconstruction → rendering → inpainting," but suffer from two fundamental issues: 1. Non-differentiable operators: back-projection and rendering operations in inverse rendering are non-differentiable, blocking gradient propagation. 2. Non-end-to-end training: geometry prediction and image inpainting rely on separate models, so accumulated errors cannot be mitigated through learning.
Unlike accumulated errors in long-range video generation that can be alleviated via autoregressive training, accumulated errors in reconstruction-based methods are difficult to eliminate due to non-differentiable operations and model separation. This constitutes the core problem addressed in this paper.
Method¶
Overall Architecture¶
GaC "flattens" the iterative pipeline of reconstruction-based methods into a single autoregressive video generation framework: a unified DiT model handles geometry estimation, viewpoint transformation simulation, and image inpainting simultaneously. The input sequence interleaves RGB frames and geometry frames: \(\{I_i, \text{<Geometry>}, G_i, \text{<Image>}, I_{i+1}, \cdots\}\), where text tokens instruct the model whether to generate geometry or RGB next.
Key Designs¶
-
Geometry as Context (Variant #1): Simplifies the original four-step iteration (geometry estimation → back-projection → rendering → inpainting) to: \(\{G_i, I_{i+1}\} = \varrho(\{I_i, G_i\}, P_{i+1})\). The model first estimates the geometry \(G_i\) of the current frame, then generates the next RGB frame based on \(G_i\) and the target pose \(P_{i+1}\). Incorporating geometry context: (a) shortens sequence length for improved efficiency; (b) endows the model with 3D awareness to enhance scene consistency; (c) the large modality gap between RGB and geometry helps the model distinguish between tasks.
-
Camera Gated Attention (CGA): Enhances the model's utilization of camera pose. The Plücker ray-encoded camera pose \(r_i\) is added to the self-attention query, and a gating matrix is generated to modulate the attention output:
-
\(\{Q_{res}, Gate\} = \text{Linear}_2(Q + r_i)\)
- \(O = \text{SDPA}(Q + Q_{res}, K, V)\)
- \(O = \text{Linear}_3(O * \sigma(Gate))\)
This design enables the model to distinguish the different roles of camera pose in geometry prediction vs. novel view synthesis.
- Geometry Dropout: During training, geometry context in the interleaved sequence is randomly dropped at rate \(r\); dropped frames degrade to pure image-to-image generation (Variant #3). Benefits: (a) reduces sequence length to improve training efficiency; (b) allows inference to produce RGB outputs without geometry prediction; (c) the model maintains scene consistency with or without geometry context. Training time is halved from 24 s/step to 11 s/step, and inference time is halved from 4.6 s/img to 2.2 s/img, with negligible performance degradation.
Loss & Training¶
- Base model: Bagel-7B (supporting text-image interleaved modeling)
- Training data: RealEstate10K (66,033 video clips)
- 8-frame sequence training; the first 1–4 frames serve as context views, the remaining as target views
- Every 4 consecutive views are tiled into a grid frame to enhance consistency (resolution \(640 \times 352\))
- Images encoded with FLUX-VAE
- Trained on 8× H100 GPUs for 40,000 steps (~2 days)
- Context-as-memory strategy used at inference to select context views; no classifier-free guidance
Key Experimental Results¶
Main Results¶
| Dataset | Metric | GaC (Ours) | Voyager | GEN3C | ViewCrafter |
|---|---|---|---|---|---|
| RE10K | PSNR↑ | 19.01 | 18.70 | 18.12 | 16.72 |
| RE10K | SSIM↑ | 0.656 | 0.616 | 0.624 | 0.585 |
| RE10K | LPIPS↓ | 0.354 | 0.395 | 0.402 | 0.417 |
| RE10K | FID↓ | 55.76 | 65.12 | 66.20 | 80.47 |
| RE10K | \(R_{err}\)↓ | 0.024 | 0.035 | 0.027 | 0.022 |
| RE10K | \(T_{err}\)↓ | 0.270 | 0.596 | 0.344 | 0.327 |
| T&T | PSNR↑ | 15.77 | 15.24 | 15.32 | 12.59 |
| RE10K (round-trip) | PSNR↑ | 16.34 | 15.80 | 15.28 | 15.77 |
| RE10K (round-trip) | FID↓ | 64.31 | 79.81 | 80.03 | 72.14 |
Ablation Study¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ | FID↓ | \(T_{err}\)↓ | Note |
|---|---|---|---|---|---|---|
| None (Variant #3) | 16.34 | 0.551 | 0.412 | 89.03 | 0.351 | No geometry context |
| Warped img (V#2) | 18.33 | 0.671 | 0.383 | 59.12 | 0.299 | Rendered image as context |
| Geometry (V#1) | 19.01 | 0.656 | 0.354 | 55.76 | 0.270 | Geometry as context |
| w/o CGA | 18.57 | 0.581 | 0.461 | 68.42 | 0.469 | CGA removed |
| w/ CGA | 19.01 | 0.656 | 0.354 | 55.76 | 0.270 | Full method |
| w/o Geo Dropout | 19.23 | 0.660 | 0.342 | 57.18 | 0.248 | No dropout (marginally better but 2× slower) |
| w/ Geo Dropout | 19.01 | 0.656 | 0.354 | 55.76 | 0.270 | ~2× speedup |
Key Findings¶
- Geometry as context vs. no context: PSNR improves by 2.67 and FID decreases by 33.27, demonstrating the critical role of explicit 3D information.
- CGA reduces translation error \(T_{err}\) from 0.469 to 0.270 (a 42% reduction), substantially improving camera control precision.
- Geometry Dropout achieves ~2× speedup in both training and inference with negligible performance loss.
- Depth maps vs. point maps as geometry: performance is comparable, but depth maps are slightly superior (smaller modality gap to natural images, easier for the VAE to encode).
- In round-trip trajectory evaluation, GaC faithfully recovers objects upon return (e.g., a disappeared monitor), demonstrating long-range 3D memory capability.
Highlights & Insights¶
- Elegant unified framework: Flattening the iterative reconstruction pipeline into a single autoregressive DiT model fundamentally resolves the issues of non-differentiable operations and non-end-to-end training.
- Geometry Dropout achieves dual benefits: It reduces computational cost while enabling the model to flexibly choose whether to output geometry information at inference time.
- CGA is an elegant design: Query modulation combined with gated output allows a single model to distinguish the role of camera pose across different sub-tasks.
- Round-trip trajectory robustness: GaC demonstrates strong scene memory and consistency on forward-and-return trajectories.
Limitations & Future Work¶
- All methods exhibit significant performance degradation on round-trip trajectories; long-range context memory strategies require further improvement.
- Training exclusively on RealEstate10K limits generalization to more diverse scenes (outdoor, in-the-wild), necessitating more varied data.
- The resolution of \(640 \times 352\) is relatively low; high-resolution scene generation remains unexplored.
- FID on Tanks-and-Temples under round-trip trajectories is inferior to Voyager, indicating room for improvement in large-motion scenarios.
- The base model Bagel-7B is large, and inference cost remains non-trivial (2.2 s/img).
Related Work & Insights¶
- ViewCrafter: An iterative method combining point clouds and video diffusion; the unified framework proposed in this paper is more elegant and incurs smaller accumulated errors.
- GEN3C/Voyager: Introduce point clouds/3DGS as 3D representations but remain constrained by non-differentiable rendering.
- ReCamMaster: A camera control method based on frame-dimension concatenation; GaC inherits this idea while incorporating geometry context.
- Insights: The paradigm of "internalizing non-differentiable operations as capabilities of the generative model" is generalizable to a broader range of 3D vision tasks; text-guided multi-task scheduling (geometry vs. RGB generation) constitutes an effective design paradigm for interleaved multimodal models.
Rating¶
- Novelty: ⭐⭐⭐⭐ Flattening the iterative reconstruction pipeline into autoregressive generation is an elegant contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, round-trip trajectories, and thorough ablations, though training data is limited in diversity.
- Writing Quality: ⭐⭐⭐⭐ Motivation is thoroughly analyzed, variant analysis is clear, and algorithmic descriptions are complete.
- Value: ⭐⭐⭐⭐ Provides a new paradigm for scene video generation; the end-to-end philosophy has broad applicability.
- Value: TBD