\(PhyWorldBench\): A Comprehensive Evaluation of Physical Realism in Text-to-Video Models¶

Conference: ICLR2026
OpenReview: https://openreview.net/forum?id=rlZeILv3fm
Code: To be confirmed
Area: Video Generation / Physical Realism Evaluation
Keywords: Text-to-Video Generation, Physical Realism, Benchmark, Multimodal Evaluation, Anti-Physics Scenes

TL;DR¶

PhyWorldBench constructs a large-scale benchmark covering 50 physical sub-phenomena, 1,050 prompts, and 12 mainstream text-to-video models. Using human evaluation and a context-aware MLLM evaluator system, it reveals significant shortcomings of current video generation models in realistic physics, complex interactions, and anti-physics instruction following.

Background & Motivation¶

Background: Text-to-video generation models have progressed rapidly in visual quality, subject consistency, and cinematic language over the past two years. Models such as Sora, Kling, Pika, Gen-3, Hunyuan, Wanx, and Open-Sora can generate visually appealing videos. However, visual realism does not equate to physical realism: an apple may look sharp with beautiful lighting, but its falling trajectory, collision, velocity changes, and fragmentation process may still completely defy real-world laws.

Limitations of Prior Work: Existing T2V benchmarks often focus on image quality, text alignment, temporal consistency, or composition. Physics-related benchmarks usually cover only a few physical categories or concentrate on a small area of motion/dynamic common sense. This leads to two blind spots: first, models might perform well on simple motions but fail in complex physics like fluids, rigid bodies, energy conservation, scale effects, and human/animal motion; second, it is difficult to distinguish whether a model truly understands physical laws or is just replicating common visual patterns from training data.

Key Challenge: The training objectives of video generation models lean towards pixel distribution, semantic alignment, and visual continuity, whereas physical laws are a set of constraints across time, objects, and scales. Models can "look like a video" through smooth motion, cinematic composition, or rationalized prompts, but they do not necessarily handle sudden fragmentation, multi-force interactions, anti-gravity, or energy transformation scenarios that require physical causality.

Goal: The authors aim to build a sufficiently broad, detailed, and reproducible physical realism evaluation benchmark. It must cover basic and composite physics while including "Anti-Physics" scenarios that intentionally violate reality. It should provide not only prompts but also clear Yes/No standards, utilizing both large-scale human evaluation and low-cost MLLM-based approximations.

Key Insight: PhyWorldBench decomposes "physical realism" into checkable standards: whether the correct objects and events exist, and whether physical phenomena occur according to realistic laws. This decomposition is more direct than subjective quality scoring and separates semantic failures from physical failures.

Core Idea: Establish a systematic benchmark for interrogating the physical realism of text-to-video models using structured physical categories, three prompt variants, human annotation standards, and a context-aware MLLM evaluator.

Method¶

Overall Architecture¶

PhyWorldBench is not a new video generation model, but a system comprising a "dataset + evaluation protocol + automated evaluator + model diagnosis." The input consists of text prompts designed around physical phenomena; the outputs are videos generated by various T2V models; the evaluation phase first checks semantic alignment and then verifies if the video meets key physical standards corresponding to the prompt.

The workflow consists of four steps: defining physical categories based on literature and expert consensus; constructing multiple prompts for each sub-category; writing Basic Standards and Key Standards for each prompt; and performing human evaluation on 12,600 videos from 12 models while validating a zero-shot MLLM evaluation scheme (CAP) to reduce future costs.

Key Designs¶

1. Hierarchical Physical Taxonomy: Decomposing physical realism into 50 measurable sub-categories

The most significant engineering effort is the systematic cataloging of physical reality. PhyWorldBench defines 10 main physical categories, each further divided into 5 sub-categories, totaling 50 sub-classes. Categories include Object Motion and Kinematics, Interaction Dynamics, Energy Conservation, Fluid and Particle Dynamics, Rigid Body Dynamics, Lighting and Shadows, Deformations and Elasticity, Scale and Proportions, Human and Animal Motion, and Anti-Physics.

This design ensures the benchmark goes beyond low-dimensional scenarios like "rolling balls." For instance, rigid body dynamics involves rotation, torque, balance, center of mass, impact, and deformation; fluid dynamics involves water, smoke, buoyancy, viscosity, and particle behavior. This prevents models from hiding weaknesses by simply memorizing common video snippets.

2. Three Prompt Variants: Distinguishing between lack of physical knowledge and lack of prompt information

Each sub-category features 7 scenarios, each with three prompt types: Event Prompt, Physics-enhanced Prompt, and Detailed Narrative Prompt. The Event Prompt is a minimal description (e.g., "A rocket launching"); the Physics-enhanced Prompt adds natural physical consequences (e.g., "moving vertically upward in a straight line"); the Detailed Narrative Prompt adds environmental, lighting, and narrative details.

This separates "physical understanding" from "prompt clarity." If Physics-enhanced Prompts significantly improve results, it suggests the model has prompt-following capabilities but requires explicit physical descriptions. If Detailed Narrative Prompts do not improve physical correctness, it suggests that cinematic details primarily improve surface appearance rather than underlying physics.

3. Basic Standards and Key Standards: Transforming subjective perception into votable Yes/No criteria

Instead of vague scores, PhyWorldBench uses two levels of standards for each prompt. Basic Standards check for the presence of objects and events (e.g., are there "two football players" and did they "collide"). Key Standards check for key physical phenomena (e.g., natural contact points, smooth momentum, and directional changes after collision).

The paper uses two core metrics: Semantic Adherence (SA) and Physical Commonsense (PC). \(SA=1\) means objects/events match the text; \(PC=1\) means the physical phenomena align with reality. The final success rate, denoted as "Both," requires \(SA=1\) and \(PC=1\).

4. Anti-Physics Category: Testing physical understanding vs. training distribution rationalization

The Anti-Physics category is a highly insightful component. The authors design prompts for "anti-gravity," "energy from nothing," "object clipping," "time reversal," and "infinite replication." While seemingly encouraging unreality, the purpose is clear: if a prompt asks for "anti-gravity wine in a glass" and the model generates a static glass or normal liquid, the model is not following the instruction but rationalizing the input into a common training distribution.

5. CAP Automated Evaluator: Informing the MLLM that "this is an AI-generated video" to reduce excuses

To reduce the cost of human evaluation, the paper proposes the Context-Aware Prompt (CAP) as a zero-shot evaluator. MLLMs often rationalize physical errors in real-world videos, assuming there must be a reason for what they see. By explicitly telling the MLLM that the frames are from an AI-generated video and may have quality issues, it becomes more sensitive to physical errors.

CAP uses a two-step process: first, the MLLM describes objects, events, and observed phenomena; second, it answers Yes/No based on standards. CAP achieved 80.3 ROC-AUC on SA and 75.1 on PC, significantly outperforming GPT-o1 (75.4 / 61.6).

Key Experimental Results¶

Main Results¶

The paper evaluates 12 models (5 closed-source, 7 open-source). Each generated 1,050 videos (12,600 total), evaluated by 3 annotators via Amazon Mechanical Turk.

Model	Type	Overall SA	Overall PC	Overall Both	Key Conclusion
Pika 2.0	Closed	0.521	0.314	0.262	Best overall in human eval; Anti-Physics Both is only 0.011
Sora-Turbo	Closed	0.384	0.261	0.208	Strong physical correctness; 2nd among closed-source
Kling-1.6	Closed	0.357	0.241	0.188	Strong visuals; cinematic style may mask physical errors
Wanx-2.1	Open	0.339	0.235	0.189	Best "Both" among open-source; comparable to closed-source
Hunyuan 720p	Open	0.344	0.250	0.185	Best PC among open-source; stable physical commonsense
LTX-Video	Open	0.194	0.085	0.062	Weakest overall; frequent low-fidelity and deformation

Absolute scores remain low. Even for Pika 2.0, "Both" is only 0.262, meaning only about a quarter of videos satisfy both semantic and physical correctness. Most models approach 0 for "Both" in Anti-Physics.

Benchmark	Physics Categories	Prompts	Conclusion
VideoPhy	5	688	Narrower coverage
PhyGenBench	27	160	Higher categories, fewer prompts
Physics-IQ	5	396	Focus on basic understanding
T2VPhysBench	3	84	Small-scale consistency test
PhyWorldBench	50	1050	Widest coverage, highest difficulty

Ablation Study¶

Ablation of the CAP evaluator shows that both context and CoT components are vital for the PC metric.

Auto-Eval Method	SA ROC-AUC	PC ROC-AUC	Description
Qwen-VL-2.0	72.4	59.8	Weak physical judgment
Gemini-2.0-Flash	74.6	60.9	Struggles with fine-grained physics
GPT-4o	72.1	60.1	Suboptimal PC without CAP
GPT-o1	75.4	61.6	Strong baseline, low PC commonsense
CAP w/o CoT	76.3	73.6	PC improves significantly with "AI video" context
CAP w/o Context	77.3	65.6	Reasoning steps help, but limited without context
CAP	80.3	75.1	Context + Two-step reasoning is best

Prompt type experiments indicate that explicit physical descriptions are more helpful than stylistic ones.

Model	Event Prompt	Physics-enhanced Prompt	Detailed Prompt	Observation
CogVideoX-1.5	0.123	0.177	0.168	Physics-enhancement significantly improves success
Hunyuan 720p	0.159	0.198	0.155	Detailed narrative is less effective than physics-enhancement
Open-Sora 2.0	0.167	0.177	0.173	Consistent but smaller improvement
Wanx-2.1	0.175	0.202	0.190	Physics-enhancement is best

Key Findings¶

Visual quality and physical correctness in T2V models are decoupled. High-fidelity visuals do not imply serious simulation-level physics.
Closed-source models lead overall, but open-source models (e.g., Wanx-2.1, Hunyuan) are competitive in physical correctness.
Anti-Physics is the hardest category; models rationalize counter-physical prompts back into realistic scenes.
Prompt engineering mitigates but does not solve the problem; narrative details cannot fill the gap in deep physical capability.
Models exhibit a "smoothing evasion" tendency, favoring simple motions over complex, non-smooth dynamics like shattering or collision.

Highlights & Insights¶

Anti-Physics distinguishes between physical adherence and instruction following: This logic can be transferred to other tasks like counter-perspectives in images or impossible events in world models.
Deoupling SA and PC is practical: It prevents the misidentification of semantic alignment errors as physical understanding failures.
CAP addresses distribution bias: Calibrating the MLLM evaluation context to "AI-generated" prevents it from finding excuses for anomalies.
Physical descriptions in prompts work: Users should describe physical consequences (acceleration, bounce, shadow changes) rather than just atmosphere.

Limitations & Future Work¶

Binary Yes/No standards are coarse: Future work could introduce graded scoring to distinguish "slightly unnatural" from "completely broken."
Subjectivity in human annotation: Agreement on complex physical visibility may vary.
Cinematic bias in CAP: High-quality styles (e.g., Kling) might still trick MLLMs into higher scores.
Black-box evaluation: Without access to internal model details, it is difficult to localize the exact causes of failure.
Future Direction: Integrate benchmark with physical engines and 3D reconstruction for measurable dynamical consistency.

vs VBench / EvalCrafter: PhyWorldBench provides much deeper focus on physical categories and standards compared to general quality benchmarks.
vs VideoPhy / PhyGenBench: Higher scale and the inclusion of Anti-Physics for counterfactual control.
Inspiration for R&D: Resolution and aesthetics are insufficient; models need better dynamic representation, object persistence, and contact modeling.
Evaluation Insight: When evaluating AI videos, MLLM prompts should explicitly state the source to avoid "reality rationalization" by the evaluator.

Rating¶

Novelty: ⭐⭐⭐⭐☆ Built on existing ideas but distinguished by Anti-Physics and the CAP evaluator.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Solid scale with 12 models and 12,600 videos via dual evaluation.
Writing Quality: ⭐⭐⭐⭐☆ Clear logic and informative figures.
Value: ⭐⭐⭐⭐⭐ Directly useful for diagnosing world models and designing prompts for physical consistency.