PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation¶

Conference: ECCV 2024
arXiv: 2404.13026
Code: physdreamer.github.io
Area: Video Generation
Keywords: Physical Simulation, 3D Gaussians, Video Generation Priors, Material Property Estimation, Interactive Dynamics

TL;DR¶

Estimates spatially-varying Young's modulus material fields for static 3D Gaussian objects by leveraging physical dynamics priors implicit in video generation models, enabling physically plausible interactive 3D dynamics synthesis.

Background & Motivation¶

Background: In recent years, 3D vision has made remarkable progress in high-quality static 3D asset reconstruction (e.g., 3D Gaussian Splatting, NeRF), with some methods even extending to 4D assets to generate unconditional dynamics. However, existing methods cannot synthesize action-conditioned dynamics that respond to novel physical interactions such as external forces.
Limitations of Prior Work: The core challenge of synthesizing action-conditioned dynamics lies in understanding the physical material properties (e.g., stiffness) of objects. However, measuring the material properties of real-world objects is extremely difficult and lacks ground-truth data. Real-world objects often possess complex, spatially-varying material properties, making estimation even harder.
Key Challenge: Physical simulation requires known material parameters (such as Young's modulus \(E\)) to correctly simulate dynamics, but these parameters cannot be directly obtained from appearance. Manually setting parameters cannot guarantee physical plausibility.
Goal: How to estimate physical material properties for static 3D objects without ground-truth material data, enabling them to respond to arbitrary interactions in a physically plausible manner.
Key Insight: Humans can easily imagine how an object reacts to forces (e.g., a rose swaying in a gentle breeze), an ability originating from prior knowledge gained from massive physical world observations. Video generation models, trained on large-scale video datasets, implicitly capture the relationship between appearance and dynamics.
Core Idea: Distill physical dynamics priors from pretrained video generation models, and backpropagate through differentiable physical simulation and rendering to optimize material fields so that the simulated videos match reference videos generated by the video model.

Method¶

Overall Architecture¶

The pipeline of PhysDreamer consists of three stages: 1. Reference Video Generation: Render static images of the 3D Gaussians from a specific viewpoint, and use Stable Video Diffusion to generate motion reference videos. 2. Physical Parameter Optimization: Optimize the material field \(E(\bm{x})\) and initial velocity field \(\bm{v}_0(\bm{x})\) via differentiable MPM simulation + differentiable rendering, aligning the rendered video with the reference video. 3. Interactive Motion Synthesis: Utilize the estimated material field to generate physically plausible 3D dynamics under arbitrary external forces through MPM simulation.

Key Designs¶

Module 1: Continuum Mechanics and Elastic Material Model

The Fixed Corotated hyperelastic material model is adopted, and the strain energy density function is defined as:

\[\psi(\mathbf{F}) = \mu \left(\sum_{i=1}^{d}(\sigma_i - 1)^2\right) + \frac{\lambda}{2}(\det(\bm{F}) - 1)^2\]

where \(\sigma_i\) are the singular values of the deformation gradient, and the relations between the Lamé parameters, Young's modulus \(E\), and Poisson's ratio \(\nu\) are:

\[\mu = \frac{E}{2(1+\nu)}, \quad \lambda = \frac{E\nu}{(1+\nu)(1-2\nu)}\]

Young's modulus \(E\) determines the material stiffness: high \(E\) leads to small-amplitude, high-frequency motion (rigid), while low \(E\) leads to large-amplitude, low-frequency motion (soft).

Module 2: Differentiable MPM Simulation and Parameter Optimization

3D Gaussian particles are used as the spatial discretization of the MPM, simulating dynamics through P2G (particle-to-grid) and G2P (grid-to-particle) transfer cycles. A single-step simulation can be expressed as:

\[\bm{x}^{t+1}, \bm{v}^{t+1}, \bm{F}^{t+1}, \bm{C}^{t+1} = \mathcal{S}(\bm{x}^t, \bm{v}^t, \bm{F}^t, \bm{C}^t, \bm{\theta}, \Delta t)\]

Physical parameters \(\bm{\theta}\) include mass, Young's modulus, Poisson's ratio, and volume. The material and velocity fields are parameterized using a triplane + a three-layer MLP.

Module 3: K-Means Subsampling for Accelerated Simulation

High-fidelity rendering requires millions of particles, making the simulation of all particles computationally prohibitive. A set of driving particles \(\{Q_q\}_{q=1}^Q\) (\(Q \ll P\)) is created via K-Means clustering, and simulations are run only on these driving particles. During rendering, each 3D Gaussian interpolates its position and rotation by fitting a rigid body transformation to the 8 nearest driving particles.

Loss & Training¶

Pixel-level Matching Loss:

\[L^t = \lambda L_1(\hat{I}^t, I^t) + (1-\lambda) L_{\text{D-SSIM}}(\hat{I}^t, I^t), \quad \lambda = 0.1\]

Total Variation Regularization (encouraging spatial smoothness):

\[L_{\text{tv}} = \sum_{i,j} \|\bm{u}_{i+1,j} - \bm{u}_{i,j}\|_2^2 + \|\bm{u}_{i,j+1} - \bm{u}_{i,j}\|_2^2\]

Two-Stage Optimization Strategy: - Phase 1: Randomly initialize and freeze Young's modulus, optimizing only the initial velocity using the first three frames. - Phase 2: Freeze the initial velocity and optimize the spatially-varying Young's modulus; gradients are backpropagated only to the previous frame to prevent gradient explosion.

Key Experimental Results¶

Main Results¶

User study (2AFC protocol, 100 participants, 800 judgment samples):

Comparison	Motion Realism Preference	Visual Quality Preference
Ours vs PhysGaussian	80.8%	65.0%
Ours vs DreamGaussian4D	63.5%	70.0%
Ours vs Real Footage	53.7%	37.3%

Ablation Study¶

Setting	User Preference for Multi-view Supervision (Visual Quality)	User Preference for Multi-view Supervision (Motion Realism)
Single-view Reference Video	Baseline	Baseline
Dual-view Reference Video	81.0% prefer dual-view	86.0% prefer dual-view

Key Findings¶

PhysGaussian, lacking a material property estimation mechanism, produces large-amplitude, unphysically slow motion.
DreamGaussian4D generates periodic, constant small-amplitude motion, failing to simulate realistic damping effects.
In the Alocasia scene, 86% of users considered PhysDreamer more realistic than the real footage—possibly because MPM generates lower frequency, smoother motion for thin geometries, and humans tend to prefer smoother motion.
Multi-view reference videos are extremely important for objects with severe self-occlusions (e.g., Alocasia).

Highlights & Insights¶

Clever Prior Distillation Concept: Utilizing video generation models as a proxy for "physical intuition", bypassing the difficult problem of material property measurement.
Complete Differentiable Pipeline: End-to-end differentiable from simulation to rendering and loss computation, supporting integrated optimization.
Physical Consistency: The estimated material field can be reused under arbitrary external forces and is not limited to specific motions.
Subsampling Strategy: Effectively reduces computational complexity, making million-particle scenes feasible.

Limitations & Future Work¶

Requires manual specification of foreground objects, background segmentation, and boundary condition setups.
High computational cost: Even with subsampling, generating one second of video takes approximately 1 minute (on an NVIDIA V100).
Limited to elastic objects; does not support collisional interactions.
The quality of the video generation model directly affects the accuracy of material estimation.
3D object discovery could be introduced to automate foreground extraction.

PhysGaussian: Integrates MPM simulation into 3D Gaussians, but relies on manually specified material parameters.
DreamGaussian4D: Synthesizes 4D content from video generation models by distilling deformation fields, but does not support physical interaction.
Generative Image Dynamics: Learns modal bases in the image space using diffusion models to achieve 2D image interaction.
PAC-NeRF / DANO: Combines physical simulation with implicit representations, but lacks the ability to distill material parameters from generative models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — For the first time distilling physical material properties from video generation models.
Experimental Thoroughness: ⭐⭐⭐⭐ — Rigorous user study design, but lacks comparisons with quantitative metrics.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation and detailed methodology explanations.
Value: ⭐⭐⭐⭐ — Direct prospects for application in virtual reality and gaming fields.