Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis¶
Conference: ICCV 2025 arXiv: 2507.23785 Code: GVFDiffusion.github.io Area: 3D Vision Keywords: 4D generation, video-to-4D, Gaussian variation field, diffusion model, 3D Gaussian Splatting
TL;DR¶
This paper proposes a video-to-4D generation framework that encodes animation data directly into a compact Gaussian variation field latent space via a Direct 4DMesh-to-GS Variation Field VAE, and trains a temporally-aware diffusion model to generate dynamic 3D content. The framework achieves high-fidelity 4D synthesis in 4.5 seconds and demonstrates strong generalization to real-world video inputs.
Background & Motivation¶
4D generation—creating dynamic 3D content—represents the next frontier following image, video, and 3D generation. Real-world phenomena inherently combine spatial and temporal dynamics, yet training robust 4D diffusion models faces two major technical challenges:
High cost of large-scale 4D dataset construction: Direct approaches require fitting an independent dynamic Gaussian splatting (4DGS) representation to each 3D animation sequence, typically taking tens of minutes per instance (6 minutes for 4DGaussians, 30+ minutes for K-planes), making them computationally expensive and difficult to scale.
Difficulty of directly modeling high-dimensional representations: Simultaneously representing 3D shape, appearance, and motion typically requires over 100K tokens, making direct diffusion modeling extremely challenging.
Limitations of prior work: - Optimization-based methods (Consistent4D, STAG4D, etc.) rely on SDS distillation, requiring over an hour and suffering from spatiotemporal inconsistencies. - Feed-forward methods (L4GM) reconstruct 4DGS from 2D-generated multi-view images, but multi-view inconsistencies in 2D generation degrade quality. - Native 4D diffusion models are absent; existing approaches indirectly leverage 2D/3D priors.
Mechanism: The paper decomposes 4D generation into canonical 3DGS generation (leveraging existing 3D models) and Gaussian Variation Field modeling. By directly encoding 3D animation data, per-instance fitting is bypassed, and high-dimensional motion information is compressed into a compact latent space.
Method¶
Overall Architecture¶
Given an input video \(\mathcal{I} = \{I_t\}_{t=1}^T\), the goal is to generate a 3DGS sequence \(\mathcal{G} = \{G_t\}_{t=1}^T\). This is decomposed into: - Canonical GS \(G_1\): a static 3DGS generated from the first frame using a pretrained 3D model. - Gaussian Variation Field \(\mathcal{V} = \{\Delta G_t\}_{t=1}^T\): temporal variations of each Gaussian attribute relative to \(G_1\).
The framework comprises two main components: (1) the Direct 4DMesh-to-GS Variation Field VAE, and (2) the Gaussian Variation Field diffusion model.
Key Designs¶
- Direct 4DMesh-to-GS Variation Field VAE:
Encoding: - Converts mesh animation sequences into point clouds \(\mathcal{P} = \{P_t \in \mathbb{R}^{N \times 3}\}_{t=1}^T\) (\(N = 8192\)). - Computes displacement fields \(\Delta P_t = P_t - P_1\). - Obtains canonical GS via a pretrained Mesh-to-GS encoder: \(G_1 = \mathcal{D}_{GS}(\mathcal{E}_{GS}(M_1))\).
Mesh-guided Interpolation (key innovation): For each canonical Gaussian position \(\bm{p}_1^i\), K nearest neighbors are identified and the displacement field is interpolated using adaptive radius-based weighting:
\(\bm{w}_{i,k} = \exp(-\frac{\beta \bm{d}_{i,k}}{r_i^2}), \quad r_i = \sqrt{\frac{1}{K}\sum_{k=1}^K \bm{d}_{i,k}}\)
\(\Delta \bm{p}_{t,i}^{interp} = \sum_{k=1}^K \frac{\bm{w}_{i,k}}{\sum_k \bm{w}_{i,k}} \Delta P_{t,n(i,k)}\)
FPS sampling is then applied to the interpolated displacements to obtain motion-aware queries \(\Delta \bm{p}_t^{fps} \in \mathbb{R}^{L \times 3}\), which are encoded into a latent representation \(\bm{z} \in \mathbb{R}^{T \times L \times C}\) (\(L = 512\), \(C = 16\)) via cross-attention, compressing the sequence length from \(N = 8192\) to \(L = 512\).
Decoding: The latent representation is processed via self-attention, then decoded into variation fields \(\Delta G_t = \{\Delta \bm{p}_t, \Delta \bm{s}_t, \Delta \bm{q}_t, \Delta \bm{c}_t, \Delta \alpha_t\}\) using all canonical GS parameters as queries through cross-attention.
- Gaussian Variation Field Diffusion Model:
Built on a Diffusion Transformer (DiT) architecture, with the following core innovations: - Temporal self-attention layers: Added alongside standard spatial self-attention to capture inter-frame motion coherence. - Dual conditioning injection: Visual features \(\mathcal{C}^v\) (extracted via DINOv2) and canonical GS geometric features \(\mathcal{C}^{GS}\) are injected via cross-attention. - Positional prior embedding: Positional encodings based on canonical GS positions \(\bm{p}_1^{fps}\) to enhance the model's awareness of spatial position-to-variation-field correspondences.
Velocity prediction parameterization is adopted, with training objective:
\(\mathcal{L}_{simple} = \mathbb{E}_{s, \bm{z}^0, \bm{\epsilon}} \left[ \|\hat{\bm{v}}_\theta(\alpha_s \bm{z}^0 + \sigma_s \bm{\epsilon}, s, \mathcal{C}) - \bm{v}^s\|_2^2 \right]\)
- Mesh-guided Loss:
Aligns predicted Gaussian displacements with pseudo-GT displacements from mesh interpolation, serving as a critical supervision signal for motion reconstruction:
\(\mathcal{L}_{mg} = \sum_{t=1}^T \|\Delta \mathbf{p}_t - \Delta \bm{p}_t^{interp}\|_2^2\)
Loss & Training¶
VAE training: Two-stage pipeline—first fine-tuning the canonical GS decoder for 150K iterations, then jointly training all modules for 200K iterations. Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{img} + \lambda_{mg}\mathcal{L}_{mg} + \lambda_{kl}\mathcal{L}_{kl}\), where \(\mathcal{L}_{img}\) combines L1 + LPIPS + SSIM.
Diffusion model training: Trained on 24-frame sequences for 1,300K iterations using a cosine noise schedule with 1,000 timesteps. Inference supports 32-frame generation.
Key Experimental Results¶
Main Results (Video-to-4D Generation)¶
| Method | PSNR↑ | LPIPS↓ | SSIM↑ | CLIP↑ | FVD↓ | Time↓ |
|---|---|---|---|---|---|---|
| Consistent4D | 16.20 | 0.146 | 0.880 | 0.910 | 935.19 | ~1.5hr |
| SC4D | 15.93 | 0.164 | 0.872 | 0.870 | 833.15 | ~20min |
| STAG4D | 16.85 | 0.144 | 0.887 | 0.893 | 1008.40 | ~1hr |
| DreamGaussian4D | 15.24 | 0.162 | 0.868 | 0.904 | 799.56 | ~15min |
| L4GM | 17.03 | 0.128 | 0.891 | 0.930 | 529.10 | 3.5s |
| Ours | 18.47 | 0.114 | 0.901 | 0.935 | 476.83 | 4.5s |
The proposed method achieves the best performance on all quality metrics: PSNR improves by 1.44 dB over L4GM, FVD decreases by 9.9% (476.83 vs. 529.10), and the total generation time is only 4.5 seconds (3.0s for canonical GS + 1.5s for variation field diffusion).
Ablation Study¶
VAE component ablation:
| Config | Encoder Query | Mesh-guided Loss | Variation Attrs | PSNR↑ | LPIPS↓ | SSIM↑ |
|---|---|---|---|---|---|---|
| A. Baseline | \(\bm{p}_t^{fps}\) | ✗ | \(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\) | 23.25 | 0.0678 | 0.936 |
| B. +mesh loss | \(\bm{p}_t^{fps}\) | ✓ | \(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\) | 26.17 | 0.0544 | 0.950 |
| C. +motion query | \(\Delta\bm{p}_t^{fps}\) | ✓ | \(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\) | 28.58 | 0.0478 | 0.958 |
| D. Full (Ours) | \(\Delta\bm{p}_t^{fps}\) | ✓ | All 5 attrs | 29.28 | 0.0439 | 0.964 |
Diffusion model ablation:
| Method | PSNR↑ | LPIPS↓ | SSIM↑ | CLIP↑ | FVD↓ |
|---|---|---|---|---|---|
| w/o positional embedding | 17.86 | 0.121 | 0.897 | 0.931 | 547.20 |
| Full model | 18.47 | 0.114 | 0.901 | 0.935 | 476.83 |
Key Findings¶
- Mesh-guided loss is critical for motion learning: Config A→B yields a 2.92 dB PSNR gain, addressing the core challenge of lacking GT Gaussian motion supervision.
- Motion-aware queries substantially outperform static positional queries: Config B→C yields an additional 2.41 dB PSNR gain.
- Color and opacity variations also contribute meaningfully: Adding \(\Delta\bm{c}_t\) and \(\Delta\alpha_t\) provides a further 0.70 dB improvement.
- Positional prior embedding is essential for the diffusion model: It strengthens the model's awareness of spatial position-to-variation-field correspondences.
Highlights & Insights¶
- Bypassing per-instance fitting: By encoding Gaussian variation fields directly from mesh animations in a single forward pass, the high cost of 4DGS reconstruction is entirely avoided.
- Efficient 4D decomposition: Decomposing the 4D problem into canonical 3DGS generation and variation field modeling reduces the dimensionality burden on the diffusion model.
- Elegant motion-aware encoding design: Mesh-guided interpolation bridges mesh motion and Gaussian motion, and motion-aware queries substantially improve encoding quality.
- Synthesis-to-real generalization: The model is trained exclusively on synthetic data yet generalizes to in-the-wild videos, demonstrating the transferability of the learned motion priors.
Limitations & Future Work¶
- The training set contains only 34K animated objects, with scale limited by the scarcity of high-quality animation assets.
- Inference requires canonical GS generation from a pretrained 3D model, introducing a dependency on third-party model quality.
- Current support is limited to 32-frame animation; long-sequence generation may require autoregressive or sliding-window strategies.
- The capability to model complex topological changes (e.g., object splitting or merging) remains to be validated.
Related Work & Insights¶
- The perceiver-based encoding in 3DShape2VecSet inspired the cross-attention encoding architecture adopted in this work.
- Trellis's structured latent representation provides a strong foundation for high-quality canonical GS generation.
- The decomposition strategy for 4D problems (canonical + variation) is generalizable to other dynamic 3D tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — A pioneering contribution toward native 4D diffusion models; the VAE-based direct encoding strategy circumvents per-instance fitting.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative comparisons and clear ablations, though the test set covers only 100 objects.
- Writing Quality: ⭐⭐⭐⭐ — Framework is clearly described with complete technical details and well-designed ablation experiments.
- Value: ⭐⭐⭐⭐⭐ — A practical solution for 4D generation with real application potential given the 4.5-second generation time.