NeuS-V: Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification¶

Conference: CVPR 2025
arXiv: 2411.16718
Code: https://utaustin-swarmlab.github.io/NeuS-V
Area: Autonomous Driving / Video Generation Evaluation
Keywords: Text-to-Video Evaluation, Temporal Logic, Model Checking, Formal Verification, VLM

TL;DR¶

Proposes NeuS-V, the first framework evaluating the temporal consistency of text-to-video (T2V) models using formal verification (temporal logic + probabilistic model checking). It translates text prompts into temporal logic specifications, scores atomic propositions using VLMs, constructs video automata, and formally verifies the satisfaction probability. It achieves a Pearson correlation of 0.71 with human annotations on Gen-3 (compared to only 0.47 for VBench).

Background & Motivation¶

Background¶

Background: The evaluation of T2V models (such as Gen-3, Pika, CogVideoX) primarily relies on CLIP similarity-based metrics like VBench. These metrics average frame-level similarities and fail to capture temporal dynamics—"A then B" and "B then A" would receive the same score.

Limitations of Prior Work: VBench's evaluation of temporal alignment correlates poorly with human judgment (Pearson of only 0.47) because it does not understand temporal logic (e.g., ALWAYS, EVENTUALLY, UNTIL). Single-frame CLIP cannot distinguish between "always present" and "occasionally appearing".

Key Challenge: The semantics of a video are not only frame-level (each frame is correct) but also sequence-level (temporal relations are correct), yet existing evaluations only focus on the former.

Key Insight: Utilizing formal verification techniques from computer science—translating prompts into Temporal Logic (TL), modeling video as a Discrete-Time Markov Chain (DTMC), and using the probabilistic model checker STORM to compute satisfaction probability.

Core Idea: Prompt \(\rightarrow\) Temporal Logic, Video \(\rightarrow\) Automata \(\rightarrow\) Formal verification of satisfaction probability = Rigorous temporal consistency evaluation.

Method¶

Key Designs¶

PULS (Prompt→TL Conversion): Uses GPT-4o to decompose natural language prompts into atomic propositions + temporal logic specifications. For example, "A cat walks then sits" \(\rightarrow\) EVENTUALLY(walk) AND EVENTUALLY(sit) AND walk UNTIL sit.
VLM Atomic Proposition Scoring: Uses LLaMA-3.2 to score each frame (0 or 1), determining whether each atomic proposition is satisfied.
Video Automata + STORM Verification: Constructs the frame-level scores into a DTMC transition matrix \(\delta(q,q') = \prod_i (C_i^*)^{1_{q'_i=1}}(1-C_i^*)^{1_{q'_i=0}}\), and uses the STORM model checker to compute the satisfaction probability of the temporal logic specification.

Loss & Training¶

No training is involved—this is a pure evaluation framework. The satisfaction probability \(\mathbb{P}[\mathcal{A}_\mathcal{V} \models \Phi]\) is exactly computed by STORM.

Key Experimental Results¶

Model	NeuS-V Pearson	VBench Pearson
Gen-3	0.71	0.47
Pika	0.62	0.70
T2V-Turbo	0.55	Lower
Mismatched Caption Detection	0.52 difference	0.38 difference

Ablation Study¶

No TL/No verification (pure VLM): Pearson is only 0.30-0.40.
Temporal operators (ALWAYS/EVENTUALLY/UNTIL) are crucial for distinguishing between aligned and unaligned videos.
LLaVA-Video-7B is overconfident, whereas frame-by-frame LLaMA-3.2 is more reliable.
The accuracy of PULS conversion directly impacts the final score: incorrect TL specifications lead to misjudgments, but GPT-4o's conversion quality reaches a practical level on test prompts.
The computational overhead of the STORM model checker is acceptable for simple specifications (\(\le 3\) operators), but requires optimization for complex specifications.

Key Findings¶

NeuS-V is 5\(\times\) better than VBench in temporal alignment—formal verification truly understands sequential order ("before and after" relations).
Stronger mismatch detection capability: The score difference between aligned and unaligned is 0.52 vs 0.38 for VBench.

Highlights & Insights¶

An elegant combination of formal methods and AI—introducing model checking from theoretical computer science into video generation evaluation.
Temporal logic is the right abstraction for temporal alignment—ALWAYS/EVENTUALLY/UNTIL precisely describe human judgments of temporal relationships.

Limitations & Future Work¶

Reliance on VLM accuracy—errors will propagate.
TL specifications depend on GPT-4o conversion (which may have semantic bias).
Limited to English prompts.
Non-trivial computational overhead for complex specifications (\(>3\) TL operators).
The temporal granularity of VLM frame-level judgment is limited by the sampling frame rate; rapid action changes may be missed.
Although temporal logic has strong expressiveness, it does not cover all dimensions of human judgment (such as aesthetics, physical plausibility) and only evaluates temporal correlation.
The benchmark dataset size (360 prompts) is relatively small, and the coverage of temporal pattern types may not be comprehensive.
Future work can extend NeuS-V to evaluate the physical plausibility and causal consistency of videos.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to introduce formal verification to T2V evaluation.
Experimental Thoroughness: ⭐⭐⭐⭐ 4 models + 360 prompts + human annotation correlation.
Writing Quality: ⭐⭐⭐⭐⭐ Clear explanation of cross-disciplinary concepts.
Value: ⭐⭐⭐⭐⭐ Potential to shift the T2V evaluation paradigm.