OSCBench: Benchmarking Object State Change in Text-to-Video Generation¶

Conference: ACL 2026
arXiv: 2603.11698
Code: Project Page
Area: Video Generation
Keywords: Text-to-Video, Object State Change, Evaluation Benchmark, Cooking Scenarios, Multimodal Evaluation

TL;DR¶

This paper proposes OSCBench — the first benchmark dedicated to evaluating object state change (OSC) capabilities in text-to-video (T2V) models. Built on cooking scenarios with 1,120 prompts covering conventional/novel/compositional scenarios, it reveals that even the strongest T2V model achieves only 0.786 OSC accuracy.

Background & Motivation¶

Background: T2V models have made significant progress in visual quality and temporal consistency. Existing benchmarks primarily evaluate perceptual quality, text-video alignment, or physical plausibility.

Limitations of Prior Work: Existing benchmarks overlook a critical dimension of action understanding — object state changes explicitly specified by text prompts (e.g., peeling potatoes, slicing lemons). T2V models may align well at the high-level semantic level but generate incorrect, incomplete, or inconsistent object state changes.

Key Challenge: High-quality visual appearance masks deficiencies in action consequence modeling — videos look realistic but objects do not correctly change state.

Goal: Construct a systematic OSC evaluation benchmark to diagnose specific deficiencies in T2V models' state change modeling.

Key Insight: Select cooking scenarios as the evaluation domain (frequent, diverse, well-defined state changes), and design conventional/novel/compositional scenarios to test different capability levels.

Core Idea: Decompose OSC evaluation into state change accuracy and state change consistency sub-dimensions, paired with CoT-guided MLLM automated evaluation.

Method¶

Overall Architecture¶

OSCBench starts from the HowToChange dataset, abstracting 20 actions and 134 objects into 9 action categories and 8 object categories (28 subcategories) through human-machine collaboration, constructing three types of OSC scenarios (conventional 108, novel 20, compositional 12), each with 8 action-object combinations, totaling 1,120 prompts. Evaluation covers four dimensions: semantic adherence, OSC performance, scene alignment, and perceptual quality.

Key Designs¶

Three Types of OSC Scenario Design:
- Function: Probe model OSC capabilities from different dimensions
- Mechanism: Conventional scenarios cover common action-object combinations (e.g., slicing lemons) testing basic capabilities; novel scenarios use uncommon but physically plausible combinations (e.g., mashing grapefruit) testing generalization; compositional scenarios involve consecutive multi-action sequences (e.g., first peel then slice) testing temporal consistency
- Design Motivation: Distinguish memorization from reasoning — conventional scenarios can be solved through memorization, while novel scenarios require inferring state changes from action semantics
CoT-Guided MLLM Evaluation:
- Function: Automated, scalable fine-grained OSC evaluation
- Mechanism: Rather than using MLLMs as black-box scorers, a CoT strategy guides MLLMs through standard grounding → evidence extraction → score argumentation reasoning processes, providing more reliable state change judgments
- Design Motivation: OSC evaluation requires multi-step reasoning (determining whether objects reach the correct target state and whether the change process is smooth); simple scoring is insufficient
Multi-Dimensional Evaluation System:
- Function: Comprehensively diagnose various T2V model capabilities
- Mechanism: Semantic adherence (subject/object/action three-way alignment), OSC performance (state change accuracy + consistency), scene alignment, and perceptual quality (realism + aesthetics). Each item uses a 1-5 Likert scale; human evaluation takes the average of three annotators
- Design Motivation: OSC failure may originate from multiple stages, requiring dimension-by-dimension diagnosis

Loss & Training¶

This is an evaluation work; no model training is involved.

Key Experimental Results¶

Main Results (Human Evaluation, Normalized 0-1)¶

Model	Subject Align.	Object Align.	Action Align.	OSC Accuracy	OSC Consistency	Realism
Veo-3.1-Fast	0.936	0.916	0.908	0.786	0.748	Highest
Kling-2.5-Turbo	0.938	0.900	0.826	0.726	0.726	0.732
Wan-2.2	0.904	0.842	0.616	0.560	0.668	0.702
HunyuanVideo-1.5	0.914	0.902	0.656	0.524	0.608	0.618
Open-Sora-2.0	0.860	0.734	0.518	0.380	0.428	0.416

Key Findings¶

All models perform well on subject/object alignment (>0.73) but show significantly lower OSC accuracy and consistency
The strongest model Veo-3.1 achieves only 0.786 OSC accuracy, indicating state change modeling is a key bottleneck for T2V
Novel and compositional scenarios perform worse than conventional scenarios, revealing insufficient generalization capability
Closed-source models (Veo/Kling) significantly outperform open-source models, with the gap being especially pronounced in the OSC dimension

Highlights & Insights¶

The OSC perspective fills an important gap in T2V evaluation — actions should not only involve motion but should also produce correct object state changes
The three-scenario design cleverly distinguishes memorization from reasoning capabilities
CoT-guided MLLM evaluation shows high correlation with human evaluation, providing a feasible path for large-scale automated OSC evaluation

Limitations & Future Work¶

Focuses solely on the cooking domain; OSC evaluation in other domains (crafts, chemical experiments) awaits expansion
Currently evaluates only single actions or two-step compositions; longer-sequence compositional actions pose greater challenges
MLLM evaluation, while correlated with human judgment, is not a perfect substitute and may misjudge extreme failure cases

vs VBench: Focuses on overall video quality, lacking dedicated evaluation of object state changes
vs PhyWorldBench: Focuses on physical plausibility (gravity, collisions); OSCBench focuses on action consequence modeling
vs T2V-CompBench: Evaluates compositional generation capability but does not address state change accuracy and temporal consistency

Rating¶

Novelty: ⭐⭐⭐⭐ First dedicated OSC benchmark, filling an important gap
Experimental Thoroughness: ⭐⭐⭐⭐ 6 models, dual human + automated evaluation
Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is clear, evaluation system is complete
Value: ⭐⭐⭐⭐ Points to a key improvement direction for T2V research