Seeing the Wind from a Falling Leaf¶
Conference: NeurIPS 2025 arXiv: 2512.00762 Code: Project Page Area: Video Generation Keywords: Invisible Force Field Recovery, Differentiable Physics Simulation, Inverse Graphics, 3D Gaussian, Causal Triplane
TL;DR¶
An end-to-end differentiable inverse graphics framework is proposed that jointly models object geometry/physical properties, force field representations, and physical processes to recover invisible force fields (e.g., wind fields) from video via backpropagation, while supporting physics-based video generation and editing.
Background & Motivation¶
-
Background: Computer vision has long pursued motion modeling from video, yet the invisible physical interactions (forces) underlying motion remain largely unexplored. System identification methods only estimate a small number of physical parameters (e.g., mass, friction coefficients).
-
Limitations of Prior Work: Force estimation is far more challenging than physical parameter estimation — forces are full vectors that can exist throughout 3D space in dense and complex configurations. Time integration in physical simulators leads to gradient explosion/vanishing, making backpropagation unstable.
-
Key Challenge: Inferring invisible forces from visible motion is an ill-posed inverse problem with incomplete information. Conventional photometric loss provides insufficient gradients, and dense 3D scene flow is highly noisy.
-
Goal: Recover force fields driving object motion from video input alone, without manually specifying forces or environmental conditions.
-
Key Insight: Construct a fully differentiable "perception → physics → optimization" pipeline, replacing dense scene flow with sparse keypoint tracking to substantially reduce the optimization space dimensionality and stabilize gradients.
-
Core Idea: 3D Gaussians (Lagrangian particles) + Causal Triplane (Eulerian force field) + MPM simulator + sparse tracking objective = an end-to-end differentiable pipeline from video to force fields.
Method¶
Overall Architecture¶
Four modules: (1) Object modeling (3D Gaussians + VLM physical properties) → (2) Force field representation (Causal Triplane) → (3) Physical process (differentiable MPM simulator) → (4) Sparse tracking optimization (backpropagation to recover force fields).
Key Designs¶
1. 3D Gaussian-based Object Modeling + VLM Physical Properties
- Function: Unified representation of object geometry, appearance, and physical properties.
- Mechanism: Each 3D Gaussian \(G = \{\mathbf{x}, \mathbf{v}, \Sigma, \sigma, SH, \mathbf{D}, m, E, \nu\}\) encodes position, velocity, shape, appearance, and physical attributes (mass \(m\), Young's modulus \(E\), Poisson's ratio \(\nu\)). Gaussians are initialized from the first frame only (via a metric depth model + Gaussian splatting). Physical properties are inferred by GPT-4V from object type recognized in the image and assigned commonsense values; Grounded SAM is used to segment each Gaussian's object assignment.
- Design Motivation: 3D Gaussians as Lagrangian particles are naturally compatible with MPM, and VLM commonsense physical knowledge is sufficiently robust for common object types.
2. Causal Triplane Force Field Representation
- Function: High-fidelity modeling of spatiotemporal continuity and causal dependencies of force fields.
- Mechanism: Force is defined as \(\mathbf{f}(\mathbf{x}, t) = \mathcal{D}(\gamma(\mathbf{x}) + \varphi(t; \varphi(t-1)))\), where \(\gamma(\cdot)\) denotes triplane spatial features, \(\varphi(\cdot)\) is a recurrent temporal encoder (initializing the current-timestep MLP with the previous timestep's weights), and \(\mathcal{D}\) is a feature decoder. The recurrent dependency of the temporal encoder achieves causal evolution of forces.
- Design Motivation: Compared to other 4D representations (K-Planes, HexPlane), the causal triplane decouples space and time, is computationally efficient, and naturally models the temporal causality of forces.
3. 4D Sparse Tracking Objective
- Function: Stabilize differentiable physics optimization and reduce the prediction space dimensionality.
- Mechanism: CoTracker is used to obtain sparse pixel keypoint motions \(\mathbf{p}^t \to \mathbf{p}^{t+1}\), which are back-projected to 3D as \(\mathbf{P}^t\). Robust 3D keypoint motions \(\mathbf{P}^t \to \mathbf{P}^{t+1}\) are obtained by minimizing reprojection error with an ARAP constraint (\(\mathcal{L}_{arap}\)). Keypoints control all Gaussian motions via barycentric interpolation: \(\hat{\mathbf{x}} = \alpha_i \mathbf{P}_i + \alpha_j \mathbf{P}_j + \alpha_k \mathbf{P}_k\).
- Design Motivation: Photometric loss suffers from gradient vanishing, and dense 3D scene flow is highly noisy. Sparse keypoints substantially reduce the prediction space (from \(N\) Gaussians to \(N_{key}\) keypoints), and pixel-level tracking by CoTracker is more reliable than inter-frame depth estimation.
Loss & Training¶
- \(\mathcal{L}_{motion} = |\hat{\mathbf{x}}^{t+1} - \mathbf{x}^{t+1}|\): tracking motion matching
- \(\mathcal{L}_{space}\): spatial total variation regularization
- \(\mathcal{L}_{time} = |\varphi_\theta^{t+1} - \varphi_\theta^t|\): temporal smoothness regularization
Key Experimental Results¶
Main Results¶
Force Recovery on Synthetic Scenes
| Material Type | Object | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Magnitude Error (%) ↓ | Direction Error (°) ↓ |
|---|---|---|---|---|---|---|
| Elastic | Lego | 33.70 | 0.98 | 0.01 | 19.53 | 7.02 |
| Elastic | Ficus | 25.92 | 0.94 | 0.03 | 23.97 | 11.55 |
| Elastic | Sunflower | 34.08 | 0.99 | 0.01 | 14.38 | 7.85 |
| Elastoplastic | Toy | 41.35 | 0.99 | 0.00 | 29.19 | 8.11 |
| Elastoplastic | Chair | 40.10 | 0.99 | 0.00 | 33.31 | 23.40 |
| Viscoplastic | Hotdog | 30.63 | 0.96 | 0.02 | 15.09 | 11.63 |
Ablation Study¶
| Method | PSNR ↑ | Magnitude Error (%) ↓ | Direction Error (°) ↓ |
|---|---|---|---|
| Point force representation | 20.57 | 95.91 | 76.48 |
| Dense scene flow objective | — | Poor | Poor |
| Photometric objective | — | Gradient vanishing | Failed |
| Sparse tracking + Causal Triplane | 33.70 | 19.53 | 7.02 |
Key Findings¶
- Elastic materials yield the best force recovery (magnitude error 14–24%, direction error 7–12°).
- Elastoplastic materials exhibit larger direction errors (23.4° on Chair) due to additional uncertainty introduced by plastic deformation.
- VLM-estimated physical properties are sufficiently robust for force recovery — even imprecise estimates yield reasonable force fields.
- Force visualizations on real-world videos are physically plausible, and recovered force fields can be applied to new objects for physics-based simulation.
Highlights & Insights¶
- This work is the first to recover distributed force fields from video (rather than contact forces or a small number of parameters), making the problem formulation itself a significant contribution.
- The pairing of 3D Gaussians (Lagrangian) and triplane (Eulerian) representations perfectly matches the MPM formalism.
- The sparse tracking objective is the key enabler of practical differentiable physics optimization — it simultaneously reduces optimization dimensionality and improves tracking robustness.
- The application scenario is novel: applying recovered force fields to new objects enables physics-driven video editing.
Limitations & Future Work¶
- The sparse tracking objective is primarily suited to objects undergoing bending deformation or small deformations.
- 3D Gaussians initialized from a single frame are incomplete (occluded surface information is missing).
- Physical property estimation relies on VLM commonsense knowledge, which may be inaccurate for unconventional objects.
- Multi-object collision interactions are not currently handled.
- Computational cost is high due to per-frame optimization of force field parameters.
Related Work & Insights¶
- This work extends the tradition of differentiable physics inverse problems (e.g., GradSim, PAC-NeRF), advancing from parameter estimation to force field recovery.
- Methods such as PhysDreamer and Physics3D use generative models to drive physical animation but require manually specified forces — the proposed method recovers forces automatically.
- Insight: Combining the rendering capability of 3DGS with the physical simulation capability of MPM may catalyze a new paradigm of physics-aware video generation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The problem formulation of recovering force fields from video is highly pioneering.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers synthetic and real-world scenes with diverse materials and complete ablations, though quantitative evaluation is limited to synthetic data.
- Writing Quality: ⭐⭐⭐⭐⭐ The opening with a Rossetti verse is elegant, and the technical exposition is clear.
- Value: ⭐⭐⭐⭐⭐ Bridges vision and physics, opening a new research direction from perception to force field recovery.