PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation¶
Conference: NeurIPS 2025 arXiv: 2509.20358 Code: Project Page Area: Image/Video Generation Keywords: Physics-driven video generation, diffusion models, 3D point trajectories, material simulation, force control
TL;DR¶
PhysCtrl employs diffusion models to learn the physical dynamics distribution of four material types (elastic, sand, plasticine, and rigid bodies), representing dynamics as 3D point trajectories. A diffusion model incorporating spatiotemporal attention and physics constraints is trained on 550K synthetic animations; the generated trajectories drive a pretrained video model to achieve high-fidelity physics video generation controllable by force and material parameters.
Background & Motivation¶
Background: Modern video generation models can produce photorealistic video but lack physical plausibility and 3D controllability.
Limitations of Prior Work: Traditional physics simulators (e.g., MPM) are computationally expensive, sensitive to hyperparameters, and numerically unstable; directly coupling simulators with video models requires manual parameter tuning and may necessitate switching between different simulators.
Key Challenge: How can physical plausibility be maintained while avoiding the limitations of traditional simulators?
Goal: Embed physical priors into a diffusion model to support fast forward/inverse inference, using physical parameters and external forces as control signals.
Key Insight: Address two fundamental questions — what representation is suitable for controlling video models? → 3D point trajectories; how to embed multi-material physical priors? → spatiotemporal attention diffusion model + physics constraints.
Core Idea: Use a diffusion model to learn the latent distribution of physical dynamics, with 3D point trajectories as a bridge between the physical world and video generation.
Method¶
Overall Architecture¶
Single image → SAM segmentation + SV3D multi-view + LGM 3D point cloud reconstruction → conditional diffusion model generates 3D point trajectories → projected 2D as tracking video → DaS video model generates the final video.
Key Designs¶
-
Spatiotemporal Attention Diffusion Model:
- Function: Learns the physical dynamics distribution \(p(\mathcal{P}|c)\) for four material types.
- Mechanism: 2048 points × 24 frames of trajectories; conditions include initial point cloud, force, application point, Young's modulus, Poisson's ratio, ground height, and material type. Spatial-Temporal Attention Block: spatial attention (intra-frame self-attention among points, physical condition tokens injected via AdaLN) → temporal attention (cross-frame self-attention for each point).
- Design Motivation: Simulates particle dynamics — integrating neighboring particle information before temporal propagation, reflecting the P2G/G2P cycle of MPM.
-
Physics-Constrained Training Loss:
- Function: Incorporates the deformation gradient update formula from MPM as an explicit physics constraint.
- Mechanism: \(\mathcal{L}_{phys}\) enforces \(\mathbf{F}_p^{f+1} \approx g(\hat{\mathbf{x}}_p^f)\mathbf{F}_p^f\) (deformation gradient consistency), complemented by \(\mathcal{L}_{floor}\) to prevent penetration.
- Design Motivation: Physics loss serves as regularization to ensure trajectory physical plausibility.
-
Large-Scale Synthetic Dataset:
- Function: 550K animations covering four material types (150K elastic + 100K each of sand, plasticine, rigid body, and gravity).
- Mechanism: High-quality 3D objects from ObjaverseXL + MPM/rigid body simulators; 2048 points × 24 frames.
- Design Motivation: Diverse data is foundational for learning the physical distribution.
Loss & Training¶
\(\mathcal{L} = \mathcal{L}_{diff} + \lambda_{vel}\mathcal{L}_{vel} + \lambda_{phys}\mathcal{L}_{phys} + \lambda_{floor}\mathcal{L}_{floor}\). Base: 6 layers, 256 dimensions; Large: 12 layers, 512 dimensions. AdamW lr=1e-4. DDIM 25 steps ≈ 1–3 seconds; 4 steps ≈ 0.13–0.48 seconds.
Key Experimental Results¶
Main Results¶
| Method | SA↑ | PC↑ | VQ↑ |
|---|---|---|---|
| DragAnything | 2.9 | 2.8 | 2.8 |
| ObjCtrl-2.5D | 1.5 | 1.3 | 1.4 |
| Wan2.1 | 3.8 | 3.7 | 3.6 |
| CogVideoX | 3.2 | 3.2 | 3.1 |
| PhysCtrl | 4.5 | 4.5 | 4.3 |
| Method | vIoU↑ | CD↓ | Corr↓ |
|---|---|---|---|
| Motion2VecSets | 24.92% | 0.2160 | 0.1064 |
| MDM | 53.78% | 0.0159 | 0.0240 |
| Ours | 77.59% | 0.0028 | 0.0015 |
Ablation Study¶
| Configuration | vIoU↑ | CD↓ | Note |
|---|---|---|---|
| w/o spatial attention | 33.76% | 0.2348 | Spatial interaction is critical |
| w/o temporal attention | 53.63% | 0.0480 | Temporal consistency is essential |
| w/o physics loss | 76.30% | 0.0030 | Physics constraints further improve results |
| Full model | 77.59% | 0.0028 | — |
Key Findings¶
- User study: Physical plausibility preference rate of 81%, far exceeding all baselines.
- Removing spatial attention causes a sharp drop in vIoU (77.59% → 33.76%).
- Poisson's ratio \(\nu\) has negligible influence on generated trajectories (consistent with PhysDreamer).
- High-quality trajectories can be obtained with as few as 4 diffusion steps.
Highlights & Insights¶
- Parameterizes physics simulation as a conditional generation problem, avoiding the limitations of traditional simulators.
- 3D point trajectories serve as an intermediate representation — both flexible and general, while directly controllable for video models.
- The spatiotemporal attention design elegantly mirrors the computational structure of physics simulation.
- Supports inverse problems; Young's modulus estimation requires only ~2 minutes.
Limitations & Future Work¶
- Primarily handles single objects; multi-object interaction is only preliminarily explored.
- Limited to four material types; fluids are not included.
- Video model priors may conflict with physics-based trajectories.
- Handling of thin structures remains insufficient.
Related Work & Insights¶
- vs PhysGaussian: Scene-specific and requires high-quality 3D reconstruction; PhysCtrl learns a general prior.
- vs PhysGen/PhysMotion: Relies on simulators to generate dynamics; PhysCtrl embeds the prior directly into a diffusion model.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to learn multi-material physical dynamics distributions via a diffusion model.
- Experimental Thoroughness: ⭐⭐⭐⭐ GPT-4o evaluation + user study + complete ablations.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and well-defined.
- Value: ⭐⭐⭐⭐ An important step toward physically controllable video generation.