Skip to content

NeuS-V: Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

Conference: CVPR 2025
arXiv: 2411.16718
Code: https://utaustin-swarmlab.github.io/NeuS-V
Area: Autonomous Driving / Video Generation Evaluation
Keywords: Text-to-Video Evaluation, Temporal Logic, Model Checking, Formal Verification, VLM

TL;DR

Proposes NeuS-V, the first framework evaluating the temporal consistency of text-to-video (T2V) models using formal verification (temporal logic + probabilistic model checking). It translates text prompts into temporal logic specifications, scores atomic propositions using VLMs, constructs video automata, and formally verifies the satisfaction probability. It achieves a Pearson correlation of 0.71 with human annotations on Gen-3 (compared to only 0.47 for VBench).

Background & Motivation

Background

Background: The evaluation of T2V models (such as Gen-3, Pika, CogVideoX) primarily relies on CLIP similarity-based metrics like VBench. These metrics average frame-level similarities and fail to capture temporal dynamics—"A then B" and "B then A" would receive the same score.

Limitations of Prior Work: VBench's evaluation of temporal alignment correlates poorly with human judgment (Pearson of only 0.47) because it does not understand temporal logic (e.g., ALWAYS, EVENTUALLY, UNTIL). Single-frame CLIP cannot distinguish between "always present" and "occasionally appearing".

Key Challenge: The semantics of a video are not only frame-level (each frame is correct) but also sequence-level (temporal relations are correct), yet existing evaluations only focus on the former.

Key Insight: Utilizing formal verification techniques from computer science—translating prompts into Temporal Logic (TL), modeling video as a Discrete-Time Markov Chain (DTMC), and using the probabilistic model checker STORM to compute satisfaction probability.

Core Idea: Prompt \(\rightarrow\) Temporal Logic, Video \(\rightarrow\) Automata \(\rightarrow\) Formal verification of satisfaction probability = Rigorous temporal consistency evaluation.

Method

Key Designs

  1. PULS (Prompt→TL Conversion): Uses GPT-4o to decompose natural language prompts into atomic propositions + temporal logic specifications. For example, "A cat walks then sits" \(\rightarrow\) EVENTUALLY(walk) AND EVENTUALLY(sit) AND walk UNTIL sit.

  2. VLM Atomic Proposition Scoring: Uses LLaMA-3.2 to score each frame (0 or 1), determining whether each atomic proposition is satisfied.

  3. Video Automata + STORM Verification: Constructs the frame-level scores into a DTMC transition matrix \(\delta(q,q') = \prod_i (C_i^*)^{1_{q'_i=1}}(1-C_i^*)^{1_{q'_i=0}}\), and uses the STORM model checker to compute the satisfaction probability of the temporal logic specification.

Loss & Training

No training is involved—this is a pure evaluation framework. The satisfaction probability \(\mathbb{P}[\mathcal{A}_\mathcal{V} \models \Phi]\) is exactly computed by STORM.

Key Experimental Results

Model NeuS-V Pearson VBench Pearson
Gen-3 0.71 0.47
Pika 0.62 0.70
T2V-Turbo 0.55 Lower
Mismatched Caption Detection 0.52 difference 0.38 difference

Ablation Study

  • No TL/No verification (pure VLM): Pearson is only 0.30-0.40.
  • Temporal operators (ALWAYS/EVENTUALLY/UNTIL) are crucial for distinguishing between aligned and unaligned videos.
  • LLaVA-Video-7B is overconfident, whereas frame-by-frame LLaMA-3.2 is more reliable.
  • The accuracy of PULS conversion directly impacts the final score: incorrect TL specifications lead to misjudgments, but GPT-4o's conversion quality reaches a practical level on test prompts.
  • The computational overhead of the STORM model checker is acceptable for simple specifications (\(\le 3\) operators), but requires optimization for complex specifications.

Key Findings

  • NeuS-V is 5\(\times\) better than VBench in temporal alignment—formal verification truly understands sequential order ("before and after" relations).
  • Stronger mismatch detection capability: The score difference between aligned and unaligned is 0.52 vs 0.38 for VBench.

Highlights & Insights

  • An elegant combination of formal methods and AI—introducing model checking from theoretical computer science into video generation evaluation.
  • Temporal logic is the right abstraction for temporal alignment—ALWAYS/EVENTUALLY/UNTIL precisely describe human judgments of temporal relationships.

Limitations & Future Work

  • Reliance on VLM accuracy—errors will propagate.
  • TL specifications depend on GPT-4o conversion (which may have semantic bias).
  • Limited to English prompts.
  • Non-trivial computational overhead for complex specifications (\(>3\) TL operators).
  • The temporal granularity of VLM frame-level judgment is limited by the sampling frame rate; rapid action changes may be missed.
  • Although temporal logic has strong expressiveness, it does not cover all dimensions of human judgment (such as aesthetics, physical plausibility) and only evaluates temporal correlation.
  • The benchmark dataset size (360 prompts) is relatively small, and the coverage of temporal pattern types may not be comprehensive.
  • Future work can extend NeuS-V to evaluate the physical plausibility and causal consistency of videos.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to introduce formal verification to T2V evaluation.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 4 models + 360 prompts + human annotation correlation.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear explanation of cross-disciplinary concepts.
  • Value: ⭐⭐⭐⭐⭐ Potential to shift the T2V evaluation paradigm.