Skip to content

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Conference: ACL 2026
arXiv: 2603.11698
Code: Project Page
Area: Video Generation
Keywords: Text-to-Video, Object State Change, Evaluation Benchmark, Cooking Scenarios, Multimodal Evaluation

TL;DR

This paper proposes OSCBench — the first benchmark dedicated to evaluating object state change (OSC) capabilities in text-to-video (T2V) models. Built on cooking scenarios with 1,120 prompts covering conventional/novel/compositional scenarios, it reveals that even the strongest T2V model achieves only 0.786 OSC accuracy.

Background & Motivation

Background: T2V models have made significant progress in visual quality and temporal consistency. Existing benchmarks primarily evaluate perceptual quality, text-video alignment, or physical plausibility.

Limitations of Prior Work: Existing benchmarks overlook a critical dimension of action understanding — object state changes explicitly specified by text prompts (e.g., peeling potatoes, slicing lemons). T2V models may align well at the high-level semantic level but generate incorrect, incomplete, or inconsistent object state changes.

Key Challenge: High-quality visual appearance masks deficiencies in action consequence modeling — videos look realistic but objects do not correctly change state.

Goal: Construct a systematic OSC evaluation benchmark to diagnose specific deficiencies in T2V models' state change modeling.

Key Insight: Select cooking scenarios as the evaluation domain (frequent, diverse, well-defined state changes), and design conventional/novel/compositional scenarios to test different capability levels.

Core Idea: Decompose OSC evaluation into state change accuracy and state change consistency sub-dimensions, paired with CoT-guided MLLM automated evaluation.

Method

Overall Architecture

OSCBench starts from the HowToChange dataset, abstracting 20 actions and 134 objects into 9 action categories and 8 object categories (28 subcategories) through human-machine collaboration, constructing three types of OSC scenarios (conventional 108, novel 20, compositional 12), each with 8 action-object combinations, totaling 1,120 prompts. Evaluation covers four dimensions: semantic adherence, OSC performance, scene alignment, and perceptual quality.

Key Designs

  1. Three Types of OSC Scenario Design:

    • Function: Probe model OSC capabilities from different dimensions
    • Mechanism: Conventional scenarios cover common action-object combinations (e.g., slicing lemons) testing basic capabilities; novel scenarios use uncommon but physically plausible combinations (e.g., mashing grapefruit) testing generalization; compositional scenarios involve consecutive multi-action sequences (e.g., first peel then slice) testing temporal consistency
    • Design Motivation: Distinguish memorization from reasoning — conventional scenarios can be solved through memorization, while novel scenarios require inferring state changes from action semantics
  2. CoT-Guided MLLM Evaluation:

    • Function: Automated, scalable fine-grained OSC evaluation
    • Mechanism: Rather than using MLLMs as black-box scorers, a CoT strategy guides MLLMs through standard grounding → evidence extraction → score argumentation reasoning processes, providing more reliable state change judgments
    • Design Motivation: OSC evaluation requires multi-step reasoning (determining whether objects reach the correct target state and whether the change process is smooth); simple scoring is insufficient
  3. Multi-Dimensional Evaluation System:

    • Function: Comprehensively diagnose various T2V model capabilities
    • Mechanism: Semantic adherence (subject/object/action three-way alignment), OSC performance (state change accuracy + consistency), scene alignment, and perceptual quality (realism + aesthetics). Each item uses a 1-5 Likert scale; human evaluation takes the average of three annotators
    • Design Motivation: OSC failure may originate from multiple stages, requiring dimension-by-dimension diagnosis

Loss & Training

This is an evaluation work; no model training is involved.

Key Experimental Results

Main Results (Human Evaluation, Normalized 0-1)

Model Subject Align. Object Align. Action Align. OSC Accuracy OSC Consistency Realism
Veo-3.1-Fast 0.936 0.916 0.908 0.786 0.748 Highest
Kling-2.5-Turbo 0.938 0.900 0.826 0.726 0.726 0.732
Wan-2.2 0.904 0.842 0.616 0.560 0.668 0.702
HunyuanVideo-1.5 0.914 0.902 0.656 0.524 0.608 0.618
Open-Sora-2.0 0.860 0.734 0.518 0.380 0.428 0.416

Key Findings

  • All models perform well on subject/object alignment (>0.73) but show significantly lower OSC accuracy and consistency
  • The strongest model Veo-3.1 achieves only 0.786 OSC accuracy, indicating state change modeling is a key bottleneck for T2V
  • Novel and compositional scenarios perform worse than conventional scenarios, revealing insufficient generalization capability
  • Closed-source models (Veo/Kling) significantly outperform open-source models, with the gap being especially pronounced in the OSC dimension

Highlights & Insights

  • The OSC perspective fills an important gap in T2V evaluation — actions should not only involve motion but should also produce correct object state changes
  • The three-scenario design cleverly distinguishes memorization from reasoning capabilities
  • CoT-guided MLLM evaluation shows high correlation with human evaluation, providing a feasible path for large-scale automated OSC evaluation

Limitations & Future Work

  • Focuses solely on the cooking domain; OSC evaluation in other domains (crafts, chemical experiments) awaits expansion
  • Currently evaluates only single actions or two-step compositions; longer-sequence compositional actions pose greater challenges
  • MLLM evaluation, while correlated with human judgment, is not a perfect substitute and may misjudge extreme failure cases
  • vs VBench: Focuses on overall video quality, lacking dedicated evaluation of object state changes
  • vs PhyWorldBench: Focuses on physical plausibility (gravity, collisions); OSCBench focuses on action consequence modeling
  • vs T2V-CompBench: Evaluates compositional generation capability but does not address state change accuracy and temporal consistency

Rating

  • Novelty: ⭐⭐⭐⭐ First dedicated OSC benchmark, filling an important gap
  • Experimental Thoroughness: ⭐⭐⭐⭐ 6 models, dual human + automated evaluation
  • Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is clear, evaluation system is complete
  • Value: ⭐⭐⭐⭐ Points to a key improvement direction for T2V research