GenFusion: Closing the Loop between Reconstruction and Generation via Videos¶
Conference: CVPR 2025
arXiv: 2503.21219
Code: https://genfusion.sibowu.com
Area: 3D Vision
Keywords: 3D Reconstruction, Video Diffusion Models, Sparse-view Reconstruction, 3D Gaussian Splatting, Cyclic Fusion
TL;DR¶
Proposes GenFusion, which uses a reconstruction-driven video diffusion model to fix 3D reconstruction artifacts and generate content in unobserved regions. It designs a cyclic fusion pipeline to iteratively incorporate generation results into the training set, achieving high-quality 3D scene reconstruction and content expansion under sparse-view settings.
Background & Motivation¶
Background: 3D reconstruction (NeRF/3DGS) and 3D generation are two rapidly evolving fields, but a significant "conditioning gap" exists between them. Scalable 3D scene reconstruction typically requires dense view inputs, whereas 3D generation usually requires only a single image or no input image. The former generates floaters and background collapse artifacts in under-observed regions, while the latter, despite being able to generate content from scratch, has far lower scene-level reconstruction quality and view coverage than densely captured reconstructions.
Limitations of Prior Work: Sparse-view 3D reconstruction faces inherently under-constrained problems, where an infinite number of photo-consistent interpretations can match the input images. Existing regularization methods (sparsity, smoothness, monocular depth guidance) improve the situation but still struggle on trajectories far from the training views. Feed-forward reconstruction methods (such as pixelSplat, MVSplat) typically saturate in performance with only 4-8 input images. Methods like ReconFusion perform well on view interpolation but still struggle on trajectories that deviate significantly from the input views.
Key Challenge: The misalignment between 3D constraints and generative priors—reconstruction requires dense view coverage to resolve ambiguity, while generative models possess rich priors but lack 3D consistency constraints. How can the two complement each other for mutual benefit?
Goal: To explore how 3D reconstruction and generation can complement each other in a scalable manner, relaxing the constraints on the number of input views, and achieving high-quality novel-view synthesis and scene expansion from sparse-view or even occluded inputs.
Key Insight: Connecting reconstruction and generation in video space—starting from artifact-rich RGB-D renderings to train a video diffusion model to learn inpainting capabilities, and then iteratively incorporating the inpainted results into the training set of the 3D reconstruction via a cyclic loop. The key insight is that systematic training data can be generated by masking 75% of pixels during 3D reconstruction.
Core Idea: Simulating out-of-view artifacts using masked reconstruction to train the video diffusion model's inpainting capabilities, and then iteratively incorporating the generated videos into the reconstruction training set via cyclic fusion, achieving a positive feedback loop between reconstruction and generation.
Method¶
Overall Architecture¶
Divided into two phases: The pre-training phase fine-tunes a video diffusion model (based on DynamiCrafter) on a large-scale dataset (DL3DV-10K) to learn the capability of restoring clean videos from artifact-rich RGB-D renderings. The zero-shot generalization phase performs cyclic fusion on novel scenes: reconstruction \(\rightarrow\) rendering artifact-rich video \(\rightarrow\) diffusion model inpainting \(\rightarrow\) incorporating inpainted results into the training set \(\rightarrow\) continuing reconstruction, executed iteratively.
Key Designs¶
-
Masked 3D Reconstruction Training Data Generation:
- Function: Generate "artifact input - ground truth output" training pairs for the video diffusion model.
- Mechanism: Divide input video frames into 4 non-overlapping patches (top-left, top-right, bottom-left, bottom-right). Only one patch (e.g., top-left or bottom-right) per scene is kept for 3D reconstruction (2DGS), while the remaining 75% of pixels are masked. Then, fully-rendered RGB-D videos along the original camera trajectory are generated as artifact inputs, while the original videos serve as ground-truth outputs.
- Design Motivation: Simply downsampling frames uniformly only simulates view interpolation, while central splitting leaves most of the content unobserved. Masked reconstruction cleverly simulates the effect of a narrow field-of-view camera, producing artifact patterns (floaters, holes, black regions) highly consistent with actual out-of-view renderings while retaining enough context to facilitate extrapolation.
-
Reconstruction-Driven Video Diffusion Model:
- Function: Generate realistic, artifact-free RGB-D videos from artifact-rich RGB-D videos.
- Mechanism: Fine-tune based on DynamiCrafter, replacing the RGB VAE with a pre-trained RGB-D VAE (LDM3D) to introduce geometric info without altering the diffusion architecture. The artifact RGB-D video is encoded and concatenated with frame-by-frame noise as sequential conditioning input. Meanwhile, CLIP features from the nearest input views are used to provide global scene information. Two-stage training: the coarse stage with \(320 \times 512\) resolution for 30K steps, and the fine stage with \(512 \times 960\) resolution for 34K steps.
- Design Motivation: Although replacing the RGB VAE with an RGB-D VAE alters the pre-trained latent space, experiments show that the FID actually improves (25.40 vs 26.16), because depth information provides geometric constraints for inter-frame consistency. Sequential conditioning inputs allow the rich visual details of the rendered video to directly guide the generation.
-
Cyclic Fusion Optimization Pipeline:
- Function: Iteratively incorporate inpainted videos into the 3D reconstruction training set to achieve progressive scene expansion and artifact removal.
- Mechanism: Based on the 2DGS representation, perform a loop every K iterations: sample new trajectories \(\rightarrow\) render RGB-D videos \(\rightarrow\) diffusion model inpainting \(\rightarrow\) incorporate inpainted results into the supervision set. Two types of trajectories are sampled: interpolation between adjacent input views and spiral/spherical paths across all camera poses. For large unobserved regions, new Gaussian points are automatically added through unreliable depth detection (accumulated opacity \(T < \tau_T\) or depth difference \(|D-\hat{D}| > \tau_D\)). The generation loss weight uses a sinusoidal annealing strategy \(\lambda(k) = \sin(\frac{k-K_{start}}{K_{end}-K_{start}} \cdot \pi)\).
- Design Motivation: One-time global optimization easily allows the generative prior to overwhelm the reconstruction constraints or vice versa. Cyclic iteration lets the two progressively reinforce each other, forming a positive feedback loop. Sinusoidal annealing prevents the generation loss from dominating too early or too late, and new trajectory sampling ensures comprehensive coverage of views and angles.
Loss & Training¶
The total loss is \(\mathcal{L} = \mathcal{L}_{recon} + \lambda \mathcal{L}_{gen}\), where \(\mathcal{L}_{recon} = \lambda_{l_1} \mathcal{L}_{l1} + \lambda_{SSIM} \mathcal{L}_{SSIM} + \lambda_{mono} \mathcal{L}_{mono}\). \(\mathcal{L}_{mono}\) is a scale-invariant depth loss ensuring that the rendered depth is consistent with the depth predicted by the diffusion model. Inference uses DDIM with 25 sampling steps and a CFG scale of 3.2.
Key Experimental Results¶
Main Results (Mip-NeRF360 Sparse-view)¶
| Method | 3-view PSNR | 6-view PSNR | 9-view PSNR | Average PSNR |
|---|---|---|---|---|
| GenFusion | 15.29 | 17.16 | 18.36 | 16.93 |
| ReconFusion | 15.50 | 16.93 | 18.19 | 16.87 |
| 3DGS | 13.06 | 14.96 | 16.79 | 14.94 |
| 2DGS | 13.07 | 15.02 | 16.67 | 14.92 |
| FSGS | 14.17 | 16.12 | 17.94 | 16.08 |
Ablation Study (Video Diffusion Model Design)¶
| Configuration | FID ↓ | Description |
|---|---|---|
| RGB VAE, \(512 \times 320\) | 26.16 | Baseline |
| RGB-D VAE, \(512 \times 320\) | 25.40 | Replaced with RGB-D VAE, FID actually improves |
| RGB-D VAE, 48 frames | 29.35 | Increasing frame count degrades quality |
| RGB-D VAE, \(960 \times 512\) | 22.55 | Increasing resolution improves results significantly |
Key Findings¶
- GenFusion demonstrates for the first time that Gaussian Splatting can achieve performance on par with state-of-the-art NeRF methods (ReconFusion) in sparse-view settings, whereas GS previously lagged far behind NeRF under such conditions.
- Replacing the RGB VAE with the RGB-D VAE improves FID instead of reducing quality, indicating that depth information has a positive impact on video consistency.
- Increasing the resolution from \(512 \times 320\) to \(960 \times 512\) leads to significant FID improvement (\(25.40 \rightarrow 22.55\)).
- GenFusion shows a clear advantage over 3DGS/2DGS/FSGS in out-of-view rendering on the DL3DV and TnT datasets, validating the effectiveness of the cyclic fusion pipeline.
Highlights & Insights¶
- The idea of generating training data via masked reconstruction is extremely clever: Simply masking 75% of the pixels simulates out-of-view artifacts, enabling the generation of massive training pairs from large-scale video datasets without any extra annotations. This concept can be generalized to any scenario requiring degradation-ground truth pairs of training data.
- Cyclic fusion creates a positive feedback loop: Reconstruction and generation are not mutually exclusive but can theoretically reinforce each other in a positive feedback loop, a philosophy that is inspiring for the 3D vision field.
- Using video as a bridge between reconstruction and generation: Instead of directly connecting the two in 3D space, information is passed through video rendering as a natural intermediate representation, preserving the modularity and simplicity of the method.
Limitations & Future Work¶
- The inference speed of the video diffusion model is slow, and cyclic fusion requires multiple rounds of diffusion sampling, resulting in a high overall time cost.
- The video length limit of 16 frames constrains the view coverage of a single generation.
- When scenes are completely unobserved (e.g., the backside of indoor scenes), the 3D consistency of the generated content cannot be guaranteed.
- The method has only been validated on objects and indoor/outdoor scenes, leaving dynamic scenes unexplored.
- The choice of \(K_{start}\) and \(K_{end}\) in the sinusoidal annealing strategy may need to be adjusted for different scenes.
Related Work & Insights¶
- vs ReconFusion: Also utilizes generative priors to guide 3D reconstruction, but ReconFusion uses SDS loss based on NeRF + image diffusion, whereas GenFusion uses direct photometric loss based on GS + video diffusion, which is more efficient and stable. GenFusion also supports scene expansion rather than just view interpolation.
- vs ViewCrafter: ViewCrafter also utilizes 3D information (point clouds) for video generation but does not involve cyclic fusion and progressive scene expansion.
- vs Feed-Forward Reconstruction Methods (pixelSplat/MVSplat/DepthSplat): Feed-forward methods are limited to a small number of input views (<10), while GenFusion's cyclic fusion can theoretically scale to an arbitrary number of views.
Rating¶
- Novelty: ⭐⭐⭐⭐ The ideas of masked reconstruction and cyclic fusion are novel and practical, but the overall method is assembled from several existing components.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across three datasets and multiple settings (3/6/9 views, masked inputs), but lacking some common baselines and execution time comparisons.
- Writing Quality: ⭐⭐⭐⭐ Clear narrative logic and well-described methods, though some details are slightly redundant.
- Value: ⭐⭐⭐⭐ Demonstrates for the first time that GS can rival state-of-the-art NeRF under sparse views; the cyclic fusion approach has broad applicability.