Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis¶

Conference: ICCV 2025 arXiv: 2507.23785 Code: GVFDiffusion.github.io Area: 3D Vision Keywords: 4D generation, video-to-4D, Gaussian variation field, diffusion model, 3D Gaussian Splatting

TL;DR¶

This paper proposes a video-to-4D generation framework that encodes animation data directly into a compact Gaussian variation field latent space via a Direct 4DMesh-to-GS Variation Field VAE, and trains a temporally-aware diffusion model to generate dynamic 3D content. The framework achieves high-fidelity 4D synthesis in 4.5 seconds and demonstrates strong generalization to real-world video inputs.

Background & Motivation¶

4D generation—creating dynamic 3D content—represents the next frontier following image, video, and 3D generation. Real-world phenomena inherently combine spatial and temporal dynamics, yet training robust 4D diffusion models faces two major technical challenges:

High cost of large-scale 4D dataset construction: Direct approaches require fitting an independent dynamic Gaussian splatting (4DGS) representation to each 3D animation sequence, typically taking tens of minutes per instance (6 minutes for 4DGaussians, 30+ minutes for K-planes), making them computationally expensive and difficult to scale.

Difficulty of directly modeling high-dimensional representations: Simultaneously representing 3D shape, appearance, and motion typically requires over 100K tokens, making direct diffusion modeling extremely challenging.

Limitations of prior work: - Optimization-based methods (Consistent4D, STAG4D, etc.) rely on SDS distillation, requiring over an hour and suffering from spatiotemporal inconsistencies. - Feed-forward methods (L4GM) reconstruct 4DGS from 2D-generated multi-view images, but multi-view inconsistencies in 2D generation degrade quality. - Native 4D diffusion models are absent; existing approaches indirectly leverage 2D/3D priors.

Mechanism: The paper decomposes 4D generation into canonical 3DGS generation (leveraging existing 3D models) and Gaussian Variation Field modeling. By directly encoding 3D animation data, per-instance fitting is bypassed, and high-dimensional motion information is compressed into a compact latent space.

Method¶

Overall Architecture¶

Given an input video \(\mathcal{I} = \{I_t\}_{t=1}^T\), the goal is to generate a 3DGS sequence \(\mathcal{G} = \{G_t\}_{t=1}^T\). This is decomposed into: - Canonical GS \(G_1\): a static 3DGS generated from the first frame using a pretrained 3D model. - Gaussian Variation Field \(\mathcal{V} = \{\Delta G_t\}_{t=1}^T\): temporal variations of each Gaussian attribute relative to \(G_1\).

The framework comprises two main components: (1) the Direct 4DMesh-to-GS Variation Field VAE, and (2) the Gaussian Variation Field diffusion model.

Key Designs¶

Direct 4DMesh-to-GS Variation Field VAE:

Encoding: - Converts mesh animation sequences into point clouds \(\mathcal{P} = \{P_t \in \mathbb{R}^{N \times 3}\}_{t=1}^T\) (\(N = 8192\)). - Computes displacement fields \(\Delta P_t = P_t - P_1\). - Obtains canonical GS via a pretrained Mesh-to-GS encoder: \(G_1 = \mathcal{D}_{GS}(\mathcal{E}_{GS}(M_1))\).

Mesh-guided Interpolation (key innovation): For each canonical Gaussian position \(\bm{p}_1^i\), K nearest neighbors are identified and the displacement field is interpolated using adaptive radius-based weighting:

\(\bm{w}_{i,k} = \exp(-\frac{\beta \bm{d}_{i,k}}{r_i^2}), \quad r_i = \sqrt{\frac{1}{K}\sum_{k=1}^K \bm{d}_{i,k}}\)

\(\Delta \bm{p}_{t,i}^{interp} = \sum_{k=1}^K \frac{\bm{w}_{i,k}}{\sum_k \bm{w}_{i,k}} \Delta P_{t,n(i,k)}\)

FPS sampling is then applied to the interpolated displacements to obtain motion-aware queries \(\Delta \bm{p}_t^{fps} \in \mathbb{R}^{L \times 3}\), which are encoded into a latent representation \(\bm{z} \in \mathbb{R}^{T \times L \times C}\) (\(L = 512\), \(C = 16\)) via cross-attention, compressing the sequence length from \(N = 8192\) to \(L = 512\).

Decoding: The latent representation is processed via self-attention, then decoded into variation fields \(\Delta G_t = \{\Delta \bm{p}_t, \Delta \bm{s}_t, \Delta \bm{q}_t, \Delta \bm{c}_t, \Delta \alpha_t\}\) using all canonical GS parameters as queries through cross-attention.

Gaussian Variation Field Diffusion Model:

Built on a Diffusion Transformer (DiT) architecture, with the following core innovations: - Temporal self-attention layers: Added alongside standard spatial self-attention to capture inter-frame motion coherence. - Dual conditioning injection: Visual features \(\mathcal{C}^v\) (extracted via DINOv2) and canonical GS geometric features \(\mathcal{C}^{GS}\) are injected via cross-attention. - Positional prior embedding: Positional encodings based on canonical GS positions \(\bm{p}_1^{fps}\) to enhance the model's awareness of spatial position-to-variation-field correspondences.

Velocity prediction parameterization is adopted, with training objective:

\(\mathcal{L}_{simple} = \mathbb{E}_{s, \bm{z}^0, \bm{\epsilon}} \left[ \|\hat{\bm{v}}_\theta(\alpha_s \bm{z}^0 + \sigma_s \bm{\epsilon}, s, \mathcal{C}) - \bm{v}^s\|_2^2 \right]\)

Mesh-guided Loss:

Aligns predicted Gaussian displacements with pseudo-GT displacements from mesh interpolation, serving as a critical supervision signal for motion reconstruction:

\(\mathcal{L}_{mg} = \sum_{t=1}^T \|\Delta \mathbf{p}_t - \Delta \bm{p}_t^{interp}\|_2^2\)

Loss & Training¶

VAE training: Two-stage pipeline—first fine-tuning the canonical GS decoder for 150K iterations, then jointly training all modules for 200K iterations. Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{img} + \lambda_{mg}\mathcal{L}_{mg} + \lambda_{kl}\mathcal{L}_{kl}\), where \(\mathcal{L}_{img}\) combines L1 + LPIPS + SSIM.

Diffusion model training: Trained on 24-frame sequences for 1,300K iterations using a cosine noise schedule with 1,000 timesteps. Inference supports 32-frame generation.

Key Experimental Results¶

Main Results (Video-to-4D Generation)¶

Method	PSNR↑	LPIPS↓	SSIM↑	CLIP↑	FVD↓	Time↓
Consistent4D	16.20	0.146	0.880	0.910	935.19	~1.5hr
SC4D	15.93	0.164	0.872	0.870	833.15	~20min
STAG4D	16.85	0.144	0.887	0.893	1008.40	~1hr
DreamGaussian4D	15.24	0.162	0.868	0.904	799.56	~15min
L4GM	17.03	0.128	0.891	0.930	529.10	3.5s
Ours	18.47	0.114	0.901	0.935	476.83	4.5s

The proposed method achieves the best performance on all quality metrics: PSNR improves by 1.44 dB over L4GM, FVD decreases by 9.9% (476.83 vs. 529.10), and the total generation time is only 4.5 seconds (3.0s for canonical GS + 1.5s for variation field diffusion).

Ablation Study¶

VAE component ablation:

Config	Encoder Query	Mesh-guided Loss	Variation Attrs	PSNR↑	LPIPS↓	SSIM↑
A. Baseline	\(\bm{p}_t^{fps}\)	✗	\(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\)	23.25	0.0678	0.936
B. +mesh loss	\(\bm{p}_t^{fps}\)	✓	\(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\)	26.17	0.0544	0.950
C. +motion query	\(\Delta\bm{p}_t^{fps}\)	✓	\(\Delta\bm{p},\Delta\bm{s},\Delta\bm{q}\)	28.58	0.0478	0.958
D. Full (Ours)	\(\Delta\bm{p}_t^{fps}\)	✓	All 5 attrs	29.28	0.0439	0.964

Diffusion model ablation:

Method	PSNR↑	LPIPS↓	SSIM↑	CLIP↑	FVD↓
w/o positional embedding	17.86	0.121	0.897	0.931	547.20
Full model	18.47	0.114	0.901	0.935	476.83

Key Findings¶

Mesh-guided loss is critical for motion learning: Config A→B yields a 2.92 dB PSNR gain, addressing the core challenge of lacking GT Gaussian motion supervision.
Motion-aware queries substantially outperform static positional queries: Config B→C yields an additional 2.41 dB PSNR gain.
Color and opacity variations also contribute meaningfully: Adding \(\Delta\bm{c}_t\) and \(\Delta\alpha_t\) provides a further 0.70 dB improvement.
Positional prior embedding is essential for the diffusion model: It strengthens the model's awareness of spatial position-to-variation-field correspondences.

Highlights & Insights¶

Bypassing per-instance fitting: By encoding Gaussian variation fields directly from mesh animations in a single forward pass, the high cost of 4DGS reconstruction is entirely avoided.
Efficient 4D decomposition: Decomposing the 4D problem into canonical 3DGS generation and variation field modeling reduces the dimensionality burden on the diffusion model.
Elegant motion-aware encoding design: Mesh-guided interpolation bridges mesh motion and Gaussian motion, and motion-aware queries substantially improve encoding quality.
Synthesis-to-real generalization: The model is trained exclusively on synthetic data yet generalizes to in-the-wild videos, demonstrating the transferability of the learned motion priors.

Limitations & Future Work¶

The training set contains only 34K animated objects, with scale limited by the scarcity of high-quality animation assets.
Inference requires canonical GS generation from a pretrained 3D model, introducing a dependency on third-party model quality.
Current support is limited to 32-frame animation; long-sequence generation may require autoregressive or sliding-window strategies.
The capability to model complex topological changes (e.g., object splitting or merging) remains to be validated.

The perceiver-based encoding in 3DShape2VecSet inspired the cross-attention encoding architecture adopted in this work.
Trellis's structured latent representation provides a strong foundation for high-quality canonical GS generation.
The decomposition strategy for 4D problems (canonical + variation) is generalizable to other dynamic 3D tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — A pioneering contribution toward native 4D diffusion models; the VAE-based direct encoding strategy circumvents per-instance fitting.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive quantitative comparisons and clear ablations, though the test set covers only 100 objects.
Writing Quality: ⭐⭐⭐⭐ — Framework is clearly described with complete technical details and well-designed ablation experiments.
Value: ⭐⭐⭐⭐⭐ — A practical solution for 4D generation with real application potential given the 4.5-second generation time.