Skip to content

PhysInOne: Visual Physics Learning and Reasoning in One Suite

Conference: CVPR 2026 arXiv: 2604.09415 Code: https://vlar-group.github.io/PhysInOne.html Area: Multimodal VLM / Physical Reasoning Keywords: physics learning, synthetic dataset, world model, video generation, physical reasoning

TL;DR

PhysInOne is a large-scale synthetic dataset comprising 153,810 dynamic 3D scenes and 2 million annotated videos, covering 71 fundamental physical phenomena across mechanics, optics, fluid dynamics, and magnetism, establishing a new benchmark for physically-aware world models.

Background & Motivation

Background: Current AI models exhibit a severe deficit in understanding the physical world — AI-generated videos frequently violate basic physical laws (e.g., objects falling upward, sudden velocity changes). Existing physics datasets are extremely small (hundreds to thousands of samples), impeding progress in physical learning.

Limitations of Prior Work: Large-scale, high-quality training data covering diverse physical objects, scenes, and phenomena are lacking. Existing datasets either focus on a single physical phenomenon (e.g., collisions) or employ simple geometric primitives, failing to reflect the complexity of the real world.

Key Challenge: Physically-aware AI must learn the joint effects of multiple physical phenomena across diverse scenes, yet existing datasets are far too small to support this requirement.

Goal: To construct a synthetic physics dataset that is orders of magnitude larger than existing ones, covering the vast majority of physical phenomena encountered in everyday life.

Key Insight: Systematically identifying 71 key physical phenomena from undergraduate physics textbooks, then using physics engines to generate dynamic 3D scenes that rigorously conform to physical laws.

Core Idea: Large-scale synthetic physics data combined with multi-object complex interactions and comprehensive ground-truth annotations, providing data infrastructure for physically-aware world models.

Method

Overall Architecture

The PhysInOne construction pipeline consists of: (1) identifying 71 phenomena across 4 domains from physics textbooks; (2) designing 153,810 multi-object interaction 3D scenes; (3) recording 13 videos per scene (12 fixed cameras + 1 moving camera); (4) manually annotating textual descriptions; and (5) automatically generating geometric, semantic, motion, and physical property annotations.

Key Designs

  1. Systematic Physical Phenomenon Coverage:

    • Function: Ensure the dataset covers all visually relevant physical phenomena encountered in everyday life.
    • Mechanism: Based on Fundamentals of Physics and related research, the work focuses on four domains — mechanics, optics, fluid dynamics, and magnetism. Thermodynamics and acoustics are excluded (non-visual or requiring additional sensor data). A total of 71 key phenomena are identified, including gravity, reflection, buoyancy, and magnetic attraction.
    • Design Motivation: Prior datasets typically cover only 1–9 physical phenomena; PhysInOne aims for near-complete coverage.
  2. Multi-Object Complex Scene Design:

    • Function: Reflect the real-world characteristic that multiple physical phenomena occur simultaneously or sequentially.
    • Mechanism: Each scene contains multiple objects interacting under multiple physical phenomena against complex backgrounds. All dynamics strictly follow Newton's laws, conservation of mass, conservation of angular momentum, Hooke's law, and other fundamental physical principles. Complex geometric objects are used rather than simple primitives.
    • Design Motivation: Physical phenomena in the real world are often coupled; datasets covering a single phenomenon are insufficient to train models with generalizable physical understanding.
  3. Comprehensive Annotation System:

    • Function: Support diverse downstream tasks and evaluations.
    • Mechanism: Each scene provides complete annotations including 3D meshes, motion trajectories, 2D masks, material properties, depth maps, camera poses, and textual descriptions. The 2-million-video annotation scale is orders of magnitude larger than all existing physics datasets combined.
    • Design Motivation: Comprehensive annotations make PhysInOne not only a training corpus but also a thorough benchmark for evaluating physical understanding capabilities.

Loss & Training

PhysInOne is a dataset rather than a model. The paper demonstrates fine-tuning results on four downstream applications, each using the standard training strategy corresponding to its respective task.

Key Experimental Results

Main Results

Application Model After PhysInOne Fine-tuning Effect
Physically-aware video generation SVD / CogVideoX / WAN Significant improvement in physical plausibility Motion better conforms to physical laws
Future frame prediction TiNeuVox / DefGS, etc. Improved prediction quality Enhanced spatiotemporal consistency
Physical property estimation Various models Key gaps exposed Intrinsic property estimation remains difficult
Motion transfer Various models Improved performance Physically plausible motion transfer

Ablation Study

Configuration Key Metric Description
Without PhysInOne fine-tuning Frequent physics violations Base models lack physical knowledge
Fine-tuning on small subset Partial improvement Data volume positively correlates with physical understanding
Full fine-tuning Best performance Large-scale data yields significant gains

Key Findings

  • Fine-tuning on PhysInOne significantly improves the physical plausibility of video generation, validating the value of large-scale synthetic physics data.
  • Base models still exhibit fundamental gaps in estimating intrinsic physical properties (mass, friction, etc.).
  • Complex multi-phenomenon scenes remain the most challenging for current models; training on single phenomena is insufficient for generalization.

Highlights & Insights

  • Order-of-magnitude scale advantage: 153K scenes and 2 million videos, surpassing the largest prior datasets by orders of magnitude.
  • Systematic physical coverage: 71 phenomena cover the vast majority of everyday physics, positioning PhysInOne as standardized training data for physical AI.
  • Exposing critical gaps: Experiments simultaneously reveal both the progress and the fundamental limitations of base models in physical reasoning, charting directions for future research.

Limitations & Future Work

  • A domain gap between synthetic data and real-world physics remains.
  • Thermodynamics and acoustics are excluded; non-visual physical phenomena are not covered.
  • Textual descriptions rely on manual annotation, which may become a bottleneck for further scaling.
  • vs. CLEVRER: CLEVRER covers only collision phenomena with 10K scenes. PhysInOne covers 71 phenomena across 150K scenes.
  • vs. Physion++: Physion++ covers 9 phenomena but uses simple objects; PhysInOne employs complex geometry and multi-object interactions.

Rating

  • Novelty: ⭐⭐⭐⭐ Breakthrough improvement in scale and coverage
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across four application tasks
  • Writing Quality: ⭐⭐⭐⭐ Clear organization and well-motivated contributions
  • Value: ⭐⭐⭐⭐⭐ Infrastructure-level contribution to physical AI