PhysInOne: Visual Physics Learning and Reasoning in One Suite¶
Conference: CVPR 2026 arXiv: 2604.09415 Code: https://vlar-group.github.io/PhysInOne.html Area: Multimodal VLM / Physical Reasoning Keywords: physics learning, synthetic dataset, world model, video generation, physical reasoning
TL;DR¶
PhysInOne is a large-scale synthetic dataset comprising 153,810 dynamic 3D scenes and 2 million annotated videos, covering 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new benchmark for physically-aware world models.
Background & Motivation¶
Background: Current AI models exhibit a severe deficit in understanding the physical world — AI-generated videos frequently violate basic physical laws (e.g., objects falling upward, sudden velocity changes). Existing physics datasets are extremely small (hundreds to thousands of samples), impeding progress in physical learning.
Limitations of Prior Work: Large-scale, high-quality training data covering diverse physical objects, scenes, and phenomena are lacking. Existing datasets either focus on a single physical phenomenon (e.g., collisions) or employ simple geometric primitives, failing to reflect the complexity of the real world.
Key Challenge: Physically-aware AI must learn the joint effects of multiple physical phenomena across diverse scenes, yet existing datasets are far too small to support this requirement.
Goal: To construct a synthetic physics dataset that is orders of magnitude larger than existing ones, covering the vast majority of physical phenomena encountered in everyday life.
Key Insight: Systematically identifying 71 key physical phenomena from undergraduate physics textbooks, then using physics engines to generate dynamic 3D scenes that rigorously conform to physical laws.
Core Idea: Large-scale synthetic physics data combined with multi-object complex interactions and comprehensive ground-truth annotations, providing data infrastructure for physically-aware world models.
Method¶
Overall Architecture¶
The PhysInOne construction pipeline consists of: (1) identifying 71 phenomena across 4 domains from physics textbooks; (2) designing 153,810 multi-object interaction 3D scenes; (3) recording 13 videos per scene (12 fixed cameras + 1 moving camera); (4) manually annotating textual descriptions; and (5) automatically generating geometric, semantic, motion, and physical property annotations.
Key Designs¶
-
Systematic Physical Phenomenon Coverage:
- Function: Ensure the dataset covers all visually relevant physical phenomena encountered in everyday life.
- Mechanism: Based on Fundamentals of Physics and related research, the work focuses on four domains — mechanics, optics, fluid dynamics, and magnetism. Thermodynamics and acoustics are excluded (non-visual or requiring additional sensor data). A total of 71 key phenomena are identified, including gravity, reflection, buoyancy, and magnetic attraction.
- Design Motivation: Prior datasets typically cover only 1–9 physical phenomena; PhysInOne aims for near-complete coverage.
-
Multi-Object Complex Scene Design:
- Function: Reflect the real-world characteristic that multiple physical phenomena occur simultaneously or sequentially.
- Mechanism: Each scene contains multiple objects interacting under multiple physical phenomena against complex backgrounds. All dynamics strictly follow Newton's laws, conservation of mass, conservation of angular momentum, Hooke's law, and other fundamental physical principles. Complex geometric objects are used rather than simple primitives.
- Design Motivation: Physical phenomena in the real world are often coupled; datasets covering a single phenomenon are insufficient to train models with generalizable physical understanding.
-
Comprehensive Annotation System:
- Function: Support diverse downstream tasks and evaluations.
- Mechanism: Each scene provides complete annotations including 3D meshes, motion trajectories, 2D masks, material properties, depth maps, camera poses, and textual descriptions. The 2-million-video annotation scale is orders of magnitude larger than all existing physics datasets combined.
- Design Motivation: Comprehensive annotations make PhysInOne not only a training corpus but also a thorough benchmark for evaluating physical understanding capabilities.
Loss & Training¶
PhysInOne is a dataset rather than a model. The paper demonstrates fine-tuning results on four downstream applications, each using the standard training strategy corresponding to its respective task.
Key Experimental Results¶
Main Results¶
| Application | Model | After PhysInOne Fine-tuning | Effect |
|---|---|---|---|
| Physically-aware video generation | SVD / CogVideoX / WAN | Significant improvement in physical plausibility | Motion better conforms to physical laws |
| Future frame prediction | TiNeuVox / DefGS, etc. | Improved prediction quality | Enhanced spatiotemporal consistency |
| Physical property estimation | Various models | Key gaps exposed | Intrinsic property estimation remains difficult |
| Motion transfer | Various models | Improved performance | Physically plausible motion transfer |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Without PhysInOne fine-tuning | Frequent physics violations | Base models lack physical knowledge |
| Fine-tuning on small subset | Partial improvement | Data volume positively correlates with physical understanding |
| Full fine-tuning | Best performance | Large-scale data yields significant gains |
Key Findings¶
- Fine-tuning on PhysInOne significantly improves the physical plausibility of video generation, validating the value of large-scale synthetic physics data.
- Base models still exhibit fundamental gaps in estimating intrinsic physical properties (mass, friction, etc.).
- Complex multi-phenomenon scenes remain the most challenging for current models; training on single phenomena is insufficient for generalization.
Highlights & Insights¶
- Order-of-magnitude scale advantage: 153K scenes and 2 million videos, surpassing the largest prior datasets by orders of magnitude.
- Systematic physical coverage: 71 phenomena cover the vast majority of everyday physics, positioning PhysInOne as standardized training data for physical AI.
- Exposing critical gaps: Experiments simultaneously reveal both the progress and the fundamental limitations of base models in physical reasoning, charting directions for future research.
Limitations & Future Work¶
- A domain gap between synthetic data and real-world physics remains.
- Thermodynamics and acoustics are excluded; non-visual physical phenomena are not covered.
- Textual descriptions rely on manual annotation, which may become a bottleneck for further scaling.
Related Work & Insights¶
- vs. CLEVRER: CLEVRER covers only collision phenomena with 10K scenes. PhysInOne covers 71 phenomena across 150K scenes.
- vs. Physion++: Physion++ covers 9 phenomena but uses simple objects; PhysInOne employs complex geometry and multi-object interactions.
Rating¶
- Novelty: ⭐⭐⭐⭐ Breakthrough improvement in scale and coverage
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across four application tasks
- Writing Quality: ⭐⭐⭐⭐ Clear organization and well-motivated contributions
- Value: ⭐⭐⭐⭐⭐ Infrastructure-level contribution to physical AI