\(PhyWorldBench\): A Comprehensive Evaluation of Physical Realism in Text-to-Video Models¶
Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=rlZeILv3fm
Code: To be confirmed
Area: Video Generation / Physical Realism Evaluation
Keywords: Text-to-Video Generation, Physical Realism, Benchmark, Multimodal Evaluation, Anti-Physics Scenes
TL;DR¶
PhyWorldBench constructs a large-scale benchmark covering 50 physical sub-phenomena, 1,050 prompts, and 12 mainstream text-to-video models. Using human evaluation and a context-aware MLLM evaluator system, it reveals significant shortcomings of current video generation models in realistic physics, complex interactions, and anti-physics instruction following.
Background & Motivation¶
Background: Text-to-video generation models have progressed rapidly in visual quality, subject consistency, and cinematic language over the past two years. Models such as Sora, Kling, Pika, Gen-3, Hunyuan, Wanx, and Open-Sora can generate visually appealing videos. However, visual realism does not equate to physical realism: an apple may look sharp with beautiful lighting, but its falling trajectory, collision, velocity changes, and fragmentation process may still completely defy real-world laws.
Limitations of Prior Work: Existing T2V benchmarks often focus on image quality, text alignment, temporal consistency, or composition. Physics-related benchmarks usually cover only a few physical categories or concentrate on a small area of motion/dynamic common sense. This leads to two blind spots: first, models might perform well on simple motions but fail in complex physics like fluids, rigid bodies, energy conservation, scale effects, and human/animal motion; second, it is difficult to distinguish whether a model truly understands physical laws or is just replicating common visual patterns from training data.
Key Challenge: The training objectives of video generation models lean towards pixel distribution, semantic alignment, and visual continuity, whereas physical laws are a set of constraints across time, objects, and scales. Models can "look like a video" through smooth motion, cinematic composition, or rationalized prompts, but they do not necessarily handle sudden fragmentation, multi-force interactions, anti-gravity, or energy transformation scenarios that require physical causality.
Goal: The authors aim to build a sufficiently broad, detailed, and reproducible physical realism evaluation benchmark. It must cover basic and composite physics while including "Anti-Physics" scenarios that intentionally violate reality. It should provide not only prompts but also clear Yes/No standards, utilizing both large-scale human evaluation and low-cost MLLM-based approximations.
Key Insight: PhyWorldBench decomposes "physical realism" into checkable standards: whether the correct objects and events exist, and whether physical phenomena occur according to realistic laws. This decomposition is more direct than subjective quality scoring and separates semantic failures from physical failures.
Core Idea: Establish a systematic benchmark for interrogating the physical realism of text-to-video models using structured physical categories, three prompt variants, human annotation standards, and a context-aware MLLM evaluator.
Method¶
Overall Architecture¶
PhyWorldBench is not a new video generation model, but a system comprising a "dataset + evaluation protocol + automated evaluator + model diagnosis." The input consists of text prompts designed around physical phenomena; the outputs are videos generated by various T2V models; the evaluation phase first checks semantic alignment and then verifies if the video meets key physical standards corresponding to the prompt.
The workflow consists of four steps: defining physical categories based on literature and expert consensus; constructing multiple prompts for each sub-category; writing Basic Standards and Key Standards for each prompt; and performing human evaluation on 12,600 videos from 12 models while validating a zero-shot MLLM evaluation scheme (CAP) to reduce future costs.
Key Designs¶
1. Hierarchical Physical Taxonomy: Decomposing physical realism into 50 measurable sub-categories
The most significant engineering effort is the systematic cataloging of physical reality. PhyWorldBench defines 10 main physical categories, each further divided into 5 sub-categories, totaling 50 sub-classes. Categories include Object Motion and Kinematics, Interaction Dynamics, Energy Conservation, Fluid and Particle Dynamics, Rigid Body Dynamics, Lighting and Shadows, Deformations and Elasticity, Scale and Proportions, Human and Animal Motion, and Anti-Physics.
This design ensures the benchmark goes beyond low-dimensional scenarios like "rolling balls." For instance, rigid body dynamics involves rotation, torque, balance, center of mass, impact, and deformation; fluid dynamics involves water, smoke, buoyancy, viscosity, and particle behavior. This prevents models from hiding weaknesses by simply memorizing common video snippets.
2. Three Prompt Variants: Distinguishing between lack of physical knowledge and lack of prompt information
Each sub-category features 7 scenarios, each with three prompt types: Event Prompt, Physics-enhanced Prompt, and Detailed Narrative Prompt. The Event Prompt is a minimal description (e.g., "A rocket launching"); the Physics-enhanced Prompt adds natural physical consequences (e.g., "moving vertically upward in a straight line"); the Detailed Narrative Prompt adds environmental, lighting, and narrative details.
This separates "physical understanding" from "prompt clarity." If Physics-enhanced Prompts significantly improve results, it suggests the model has prompt-following capabilities but requires explicit physical descriptions. If Detailed Narrative Prompts do not improve physical correctness, it suggests that cinematic details primarily improve surface appearance rather than underlying physics.
3. Basic Standards and Key Standards: Transforming subjective perception into votable Yes/No criteria
Instead of vague scores, PhyWorldBench uses two levels of standards for each prompt. Basic Standards check for the presence of objects and events (e.g., are there "two football players" and did they "collide"). Key Standards check for key physical phenomena (e.g., natural contact points, smooth momentum, and directional changes after collision).
The paper uses two core metrics: Semantic Adherence (SA) and Physical Commonsense (PC). \(SA=1\) means objects/events match the text; \(PC=1\) means the physical phenomena align with reality. The final success rate, denoted as "Both," requires \(SA=1\) and \(PC=1\).
4. Anti-Physics Category: Testing physical understanding vs. training distribution rationalization
The Anti-Physics category is a highly insightful component. The authors design prompts for "anti-gravity," "energy from nothing," "object clipping," "time reversal," and "infinite replication." While seemingly encouraging unreality, the purpose is clear: if a prompt asks for "anti-gravity wine in a glass" and the model generates a static glass or normal liquid, the model is not following the instruction but rationalizing the input into a common training distribution.
5. CAP Automated Evaluator: Informing the MLLM that "this is an AI-generated video" to reduce excuses
To reduce the cost of human evaluation, the paper proposes the Context-Aware Prompt (CAP) as a zero-shot evaluator. MLLMs often rationalize physical errors in real-world videos, assuming there must be a reason for what they see. By explicitly telling the MLLM that the frames are from an AI-generated video and may have quality issues, it becomes more sensitive to physical errors.
CAP uses a two-step process: first, the MLLM describes objects, events, and observed phenomena; second, it answers Yes/No based on standards. CAP achieved 80.3 ROC-AUC on SA and 75.1 on PC, significantly outperforming GPT-o1 (75.4 / 61.6).
Key Experimental Results¶
Main Results¶
The paper evaluates 12 models (5 closed-source, 7 open-source). Each generated 1,050 videos (12,600 total), evaluated by 3 annotators via Amazon Mechanical Turk.
| Model | Type | Overall SA | Overall PC | Overall Both | Key Conclusion |
|---|---|---|---|---|---|
| Pika 2.0 | Closed | 0.521 | 0.314 | 0.262 | Best overall in human eval; Anti-Physics Both is only 0.011 |
| Sora-Turbo | Closed | 0.384 | 0.261 | 0.208 | Strong physical correctness; 2nd among closed-source |
| Kling-1.6 | Closed | 0.357 | 0.241 | 0.188 | Strong visuals; cinematic style may mask physical errors |
| Wanx-2.1 | Open | 0.339 | 0.235 | 0.189 | Best "Both" among open-source; comparable to closed-source |
| Hunyuan 720p | Open | 0.344 | 0.250 | 0.185 | Best PC among open-source; stable physical commonsense |
| LTX-Video | Open | 0.194 | 0.085 | 0.062 | Weakest overall; frequent low-fidelity and deformation |
Absolute scores remain low. Even for Pika 2.0, "Both" is only 0.262, meaning only about a quarter of videos satisfy both semantic and physical correctness. Most models approach 0 for "Both" in Anti-Physics.
| Benchmark | Physics Categories | Prompts | Conclusion |
|---|---|---|---|
| VideoPhy | 5 | 688 | Narrower coverage |
| PhyGenBench | 27 | 160 | Higher categories, fewer prompts |
| Physics-IQ | 5 | 396 | Focus on basic understanding |
| T2VPhysBench | 3 | 84 | Small-scale consistency test |
| PhyWorldBench | 50 | 1050 | Widest coverage, highest difficulty |
Ablation Study¶
Ablation of the CAP evaluator shows that both context and CoT components are vital for the PC metric.
| Auto-Eval Method | SA ROC-AUC | PC ROC-AUC | Description |
|---|---|---|---|
| Qwen-VL-2.0 | 72.4 | 59.8 | Weak physical judgment |
| Gemini-2.0-Flash | 74.6 | 60.9 | Struggles with fine-grained physics |
| GPT-4o | 72.1 | 60.1 | Suboptimal PC without CAP |
| GPT-o1 | 75.4 | 61.6 | Strong baseline, low PC commonsense |
| CAP w/o CoT | 76.3 | 73.6 | PC improves significantly with "AI video" context |
| CAP w/o Context | 77.3 | 65.6 | Reasoning steps help, but limited without context |
| CAP | 80.3 | 75.1 | Context + Two-step reasoning is best |
Prompt type experiments indicate that explicit physical descriptions are more helpful than stylistic ones.
| Model | Event Prompt | Physics-enhanced Prompt | Detailed Prompt | Observation |
|---|---|---|---|---|
| CogVideoX-1.5 | 0.123 | 0.177 | 0.168 | Physics-enhancement significantly improves success |
| Hunyuan 720p | 0.159 | 0.198 | 0.155 | Detailed narrative is less effective than physics-enhancement |
| Open-Sora 2.0 | 0.167 | 0.177 | 0.173 | Consistent but smaller improvement |
| Wanx-2.1 | 0.175 | 0.202 | 0.190 | Physics-enhancement is best |
Key Findings¶
- Visual quality and physical correctness in T2V models are decoupled. High-fidelity visuals do not imply serious simulation-level physics.
- Closed-source models lead overall, but open-source models (e.g., Wanx-2.1, Hunyuan) are competitive in physical correctness.
- Anti-Physics is the hardest category; models rationalize counter-physical prompts back into realistic scenes.
- Prompt engineering mitigates but does not solve the problem; narrative details cannot fill the gap in deep physical capability.
- Models exhibit a "smoothing evasion" tendency, favoring simple motions over complex, non-smooth dynamics like shattering or collision.
Highlights & Insights¶
- Anti-Physics distinguishes between physical adherence and instruction following: This logic can be transferred to other tasks like counter-perspectives in images or impossible events in world models.
- Deoupling SA and PC is practical: It prevents the misidentification of semantic alignment errors as physical understanding failures.
- CAP addresses distribution bias: Calibrating the MLLM evaluation context to "AI-generated" prevents it from finding excuses for anomalies.
- Physical descriptions in prompts work: Users should describe physical consequences (acceleration, bounce, shadow changes) rather than just atmosphere.
Limitations & Future Work¶
- Binary Yes/No standards are coarse: Future work could introduce graded scoring to distinguish "slightly unnatural" from "completely broken."
- Subjectivity in human annotation: Agreement on complex physical visibility may vary.
- Cinematic bias in CAP: High-quality styles (e.g., Kling) might still trick MLLMs into higher scores.
- Black-box evaluation: Without access to internal model details, it is difficult to localize the exact causes of failure.
- Future Direction: Integrate benchmark with physical engines and 3D reconstruction for measurable dynamical consistency.
Related Work & Insights¶
- vs VBench / EvalCrafter: PhyWorldBench provides much deeper focus on physical categories and standards compared to general quality benchmarks.
- vs VideoPhy / PhyGenBench: Higher scale and the inclusion of Anti-Physics for counterfactual control.
- Inspiration for R&D: Resolution and aesthetics are insufficient; models need better dynamic representation, object persistence, and contact modeling.
- Evaluation Insight: When evaluating AI videos, MLLM prompts should explicitly state the source to avoid "reality rationalization" by the evaluator.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Built on existing ideas but distinguished by Anti-Physics and the CAP evaluator.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Solid scale with 12 models and 12,600 videos via dual evaluation.
- Writing Quality: ⭐⭐⭐⭐☆ Clear logic and informative figures.
- Value: ⭐⭐⭐⭐⭐ Directly useful for diagnosing world models and designing prompts for physical consistency.