Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation¶
Conference: CVPR 2026 arXiv: 2511.00503 Code: Project Page Area: Video Generation Keywords: 4D Generation, 3D Gaussian Splatting, Video Diffusion Models, Deformable Gaussian Fields, Feed-Forward Generation
TL;DR¶
This paper proposes Diff4Splat, a feed-forward framework that unifies video diffusion models with deformable 3D Gaussian fields into an end-to-end trainable model, enabling direct generation of dynamic 4D scene representations from a single image in approximately 30 seconds — roughly 60× faster than optimization-based methods.
Background & Motivation¶
Dynamic 3D scene generation (4D generation) is a core challenge in computer vision, with broad applications in immersive content creation, robotics, and simulation. Existing approaches face fundamental trade-offs:
Multi-stage pipeline methods: These first generate video via video diffusion models and then perform 3D reconstruction. They are slow, error-prone, and lack end-to-end control. For instance, DimensionX requires several GPU hours, and Mosca takes around half an hour.
Feed-forward generation methods: While efficient, most are limited to generating 2D video frames or static 3D scenes, and fail to capture explicit dynamic 3D geometry.
Core gap: There is no unified framework that can directly and efficiently synthesize explicit, controllable scene representations.
Diff4Splat is motivated by bridging this gap — unifying generation and representation within a single forward pass to achieve both feed-forward efficiency and explicit 3D representation.
Method¶
Overall Architecture¶
Given a single input image \(\mathbf{I}_0\), a camera trajectory \(\mathcal{P}\) (in Plücker coordinates), and an optional text prompt \(\mathbf{C}_{ctx}\), Diff4Splat predicts a deformable 3D Gaussian field in a single forward pass. The framework consists of four core components: (1) a video diffusion model that generates 3D-aware latent tensors; (2) a Latent Dynamic Reconstruction Model (LDRM) that transforms latent features into Gaussian parameters; (3) a deformable Gaussian field to represent dynamics; and (4) unified multi-task supervision.
Key Designs¶
-
Large-scale 4D data pipeline: Seven synthetic datasets (TartanAir, MatrixCity, PointOdyssey, etc.) and two real-world datasets (RealEstate10K, Stereo4D) are integrated, yielding approximately 130,000 high-quality 4D training scenes. For real-world datasets, VideoDepthAnything and MegaSaM are used to recover metric-scale depth, with least-squares alignment applied to map relative depth to metric depth. This addresses the lack of metric-scale annotations in real-world data.
-
Latent Dynamic Reconstruction Model (LDRM): Built upon a pretrained video diffusion model (CogVideoX), the LDRM generates 3D-aware latent tensors \(\mathbf{z} \in \mathbb{R}^{n \times h \times w \times c}\). It consists of 16 standard Transformer blocks that process concatenated latent features and camera pose tokens, followed by a lightweight decoder that regresses 3D Gaussian attributes. The core design motivation is to avoid per-scene optimization by leveraging the generative prior of the diffusion model to directly predict 3D structure.
-
Deformable Gaussian fields: Building on static 3DGS, an inter-frame deformation model is introduced. For each Gaussian at timestep \(t\), the model predicts a displacement \(\Delta\boldsymbol{\mu}_p^t\), a rotation adjustment \(\Delta\mathbf{q}_p^t\), and a scale modification \(\Delta\mathbf{s}_p^t\). The deformation parameter dimensionality is \(K_d=10\). The LDRM jointly outputs a Gaussian feature map and a deformation map. Pruning based on an opacity threshold (\(\tau=0.005\)) is applied during both training and inference.
-
Progressive training strategy:
- Stage 1 (40K iterations): Pretrains the LDRM on static scenes at low resolution (256×256) with the deformation module frozen, using only photometric and geometric losses.
- Stage 2 (40K iterations): Still with the deformation module frozen, refines reconstruction fidelity at high resolution (512×512).
- Stage 3 (20K iterations): Unfreezes the full model and fine-tunes on dynamic datasets using the complete loss, including the motion loss.
Loss & Training¶
The total loss is a weighted sum of four terms:
- Flow Matching loss \(\mathcal{L}_{FM}\): Applied exclusively to the video diffusion model parameters; fine-tunes on 4D annotated data to align the latent space.
- Photometric loss \(\mathcal{L}_{photo}\): MSE + LPIPS (\(\lambda_p=0.5\)), optimizing appearance consistency between rendered and ground-truth images.
- Geometric loss \(\mathcal{L}_{geo}\): Pearson correlation loss on depth + total variation smoothness loss, with weight \(\lambda_{geo}=0.5\).
- Motion loss \(\mathcal{L}_{motion}\): Based on 3D point tracking data (directly available for synthetic data; obtained via CoTracker for real data), combining L2 loss with L1 regularization, weight \(\lambda_{motion}=2.0\).
Training uses AdamW with a learning rate of \(10^{-5}\), conducted on 32 A100 GPUs for approximately 7 days.
Key Experimental Results¶
Main Results¶
| Method | FVD↓ | KVD↓ | CLIP-Score↑ | Reconstruction Time |
|---|---|---|---|---|
| CameraCtrl | 478.2 | 8.11 | 19.37 | 20s |
| AC3D | 339.4 | 6.34 | 20.67 | 28s |
| AC3D + Mosca† | 236.0 | 2.01 | 20.21 | 45min |
| Diff4Splat | 210.2 | 2.32 | 23.12 | 30s |
| Method | Avg Matches↑ | Subj. Consist.↑ | Bg. Consist.↑ | Time↓ |
|---|---|---|---|---|
| AC3D + Mosca† | 4500.7 | 86.23 | 90.43 | 45min |
| Diff4Splat | 5114.2 | 88.32 | 89.89 | 30s |
| Method | RPE (Translation)↓ | RPE (Rotation)↓ | NVS | Depth | Real-time Interaction |
|---|---|---|---|---|---|
| AC3D | 3.001 | 0.810 | ✓ | ✗ | ✗ |
| Ours | 0.012 | 0.008 | ✓ | ✓ | ✓ |
Ablation Study¶
| Configuration | FVD↓ | KVD↓ | Avg Matches↑ | Note |
|---|---|---|---|---|
| w/o motion loss | 351.4 | 3.35 | 4821.6 | Significant performance drop without motion loss |
| Full model | 210.2 | 2.32 | 5114.2 | Full model achieves best performance |
Key Findings¶
- Feed-forward generation requires only 30 seconds, approximately 90× faster than optimization-based methods (Mosca: 45 minutes).
- The proposed method outperforms optimization-based approaches on both video quality (FVD) and geometric consistency (Avg Matches).
- The explicit 3DGS representation reduces camera pose error by 250× (RPE Translation: 3.001 → 0.012).
- The deformable Gaussian field is critical for eliminating ghosting artifacts in dynamic scenes.
- The progressive training strategy saves 3× training time compared to direct dynamic training, while yielding superior results.
Highlights & Insights¶
- Paradigm innovation: The first work to unify video diffusion models with deformable 3DGS in a feed-forward framework, entirely eliminating per-scene optimization.
- Efficiency leap: 30 seconds vs. 45 minutes, bringing dynamic 3D scene generation to practical usability for the first time.
- Data pipeline: A large-scale 4D dataset of 130,000 scenes with metric-scale annotations is constructed, with plans for open-sourcing.
- Versatility: A single model simultaneously supports video generation, novel view synthesis, depth extraction, and real-time interaction.
- Biological analogy: The spatial relationship head operates analogously to molecular gradients guiding cell differentiation in embryonic development.
Limitations & Future Work¶
- Training costs remain high (32× A100, 7 days), making rapid iteration difficult.
- The design relies on CogVideoX's latent space; scalability to higher resolutions or longer sequences remains to be validated.
- Metric depth for real-world data depends on the accuracy of VideoDepthAnything and MegaSaM, introducing the risk of error propagation.
- The current framework only supports generation from a single image; extension to multi-view input conditioning is worth exploring.
- Motion is represented as simple displacement + rotation + scale deformation, which may be insufficient for topological changes (e.g., object appearance or disappearance).
Related Work & Insights¶
- CogVideoX serves as the video diffusion backbone, demonstrating the potential of video generation priors for 3D understanding.
- The combination of 3DGS and deformation fields provides high-quality, real-time rendering for dynamic scenes.
- The progressive training strategy (static → high-resolution → dynamic) represents effective engineering practice for complex tasks.
- The framework is extensible to downstream applications such as robotics simulation and VR/AR content creation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First feed-forward 4D generation framework unifying diffusion models with deformable 3DGS.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-dimensional evaluation is comprehensive, though comparisons with more feed-forward 4D methods are lacking.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured with detailed method descriptions.
- Value: ⭐⭐⭐⭐⭐ — Substantial efficiency gains with strong practical utility.