VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos¶
Conference: ACL 2025
arXiv: 2505.23693
Code: https://github.com/SighingSnow/VF-Eval
Area: Multimodal VLM
Keywords: AIGC video evaluation, video generation feedback, error detection, reasoning evaluation, benchmark
TL;DR¶
This paper proposes the VF-Eval benchmark to systematically evaluate the capability of 13 MLLMs in providing feedback on AIGC videos across four tasks: consistency verification, error perception, error type detection, and reasoning evaluation. The evaluation reveals that even GPT-4.1 struggles to perform consistently across all tasks, highlighting the challenges of AIGC video understanding.
Background & Motivation¶
Background: MLLMs are increasingly used for video generation quality assessment (e.g., providing quality scores or natural language feedback). However, existing research focuses on natural video understanding (e.g., MVBench, Video-MME), leaving the understanding capabilities of MLLMs on AI-Generated Content (AIGC) synthetic videos under-evaluated.
Limitations of Prior Work: (a) AIGC videos possess unique characteristics (e.g., synthetic textures, dynamic lighting effects, and algorithmically generated characters) that differ significantly from natural videos, and MLLMs' capabilities on such videos have not been systematically evaluated. (b) Existing AIGC video evaluation methods usually offer only implicit quality scores, which lack precision and fail to diagnose specific types of errors. (c) There is a lack of multi-dimensional benchmarks for AIGC video understanding.
Key Challenge: The video generation field increasingly relies on MLLMs for quality feedback, yet there is no systematic benchmark to measure whether MLLMs' feedback capabilities on AIGC videos are reliable.
Goal: (a) Can MLLMs accurately detect errors in AIGC videos? (b) Can MLLMs distinguish between different types of errors (quality issues, commonsense violations, and ethical concerns)? (c) Can MLLM feedback practically help improve video generation?
Key Insight: Designing four evaluation tasks covering the entire processing pipeline from consistency checks to fine-grained reasoning, and conducting comprehensive testing using three question types: Yes/No, multiple-choice, and open-ended.
Core Idea: Constructing the first benchmark to systematically evaluate the feedback capabilities of MLLMs on AIGC videos, revealing significant deficiencies of current MLLMs in synthetic video understanding.
Method¶
Overall Architecture¶
Input: AIGC videos (generated by commercial models such as Pika, Kling, and Gen-3, plus open-source models like T2V-turbo and OpenSora) and corresponding questions. Output: Answers from MLLMs. The four tasks cover different levels of feedback capability, comprising 9,740 QA pairs with video durations ranging from 4 to 12 seconds.
Key Designs¶
-
Four Evaluation Tasks:
- Function: Systematically evaluating the AIGC video feedback capabilities of MLLMs from shallow to deep levels.
- Mechanism:
- Consistency Verification (CV): Detects inconsistencies between the AIGC video and the generation prompt, and generates refined prompts (evaluated via open-ended questions scored by an LLM).
- Error Awareness (EA): Determines whether a video contains errors (Yes/No questions where all correct answers are designated as "Yes" to detect whether models are biased toward assuming the video is normal).
- Error Type Detection (ED): Identifies specific error types in the AIGC video—video quality (spatial-temporal consistency, visual appeal, camera movement), commonsense/physical violations (logic, mechanics, lighting), and ethical issues (four-option multiple-choice questions).
- Reasoning Evaluation (RE): Fine-grained reasoning—spatial, temporal, action, object, counting questions, and narrative summarization (open-ended questions).
- Design Motivation: A single quality score is insufficient to evaluate feedback capability; a full-pipeline evaluation is required, moving from error discovery to error classification, and finally to deep reasoning.
-
Multi-Source AIGC Video Collection:
- Function: Building an AIGC video dataset that covers a wide range of scenarios and video generation models.
- Mechanism: Utilizing 1000 GPT-4o-generated prompts (manually verified) to generate videos through commercial models (Pika, Kling, Pixeldance, Gen-3) and open-source models (T2V-turbo-v2), with additional videos collected from Lavie and OpenSora.
- Design Motivation: Covering the feature variations across different video generation models to ensure the breadth and representativeness of the evaluation.
-
RePrompt Experimental Design:
- Function: Verifying whether MLLM feedback can help improve video generation.
- Mechanism: Comparing refined prompts provided by MLLMs versus those from humans, regenerating videos using these prompts, and asking human evaluators to compare video quality before and after the refinement. The experiments demonstrate that MLLM feedback aligned with human preferences can improve both the quality and consistency of AIGC videos.
- Design Motivation: Evaluation is not the ultimate goal; it is crucial to validate the practical value of the feedback loop.
Loss & Training¶
VF-Eval serves as an evaluation benchmark and does not involve model training. Evaluation metrics: CV and RE are scored by an LLM (GPT-4.1-mini / GPT-4o-mini), while EA and ED are evaluated using accuracy.
Key Experimental Results¶
Main Results¶
Overall performance of 13 MLLMs on VF-Eval (Overall scores):
| Model | Consistency Verification | Error Awareness (Quality) | Error Awareness (Commonsense) | Error Type (Quality) | Reasoning Evaluation | Overall |
|---|---|---|---|---|---|---|
| Human | 81.9 | 84.3 | 84.2 | 86.9 | 70.1 | 84.4 |
| GPT-4.1 | 66.3 | 39.7 | 24.0 | 56.0 | 42.1 | 51.6 |
| InternVL3-38B | 52.9 | 34.7 | 5.0 | 49.4 | 36.2 | 43.6 |
| Qwen2.5-VL-72B | 59.8 | 22.9 | 8.6 | 31.0 | 35.6 | 35.8 |
| Qwen2.5-VL-7B | 51.5 | 23.4 | 6.1 | 23.8 | 35.3 | 30.4 |
Ablation Study¶
The gap between Humans and MLLMs on different tasks:
| Task | Human | GPT-4.1 | Gap |
|---|---|---|---|
| Consistency Verification | 81.9 | 66.3 | -15.6 |
| Error Awareness (Commonsense) | 84.2 | 24.0 | -60.2 |
| Error Type Detection (Quality) | 86.9 | 56.0 | -30.9 |
| Reasoning Evaluation | 70.1 | 42.1 | -28.0 |
Key Findings¶
- MLLMs perform significantly worse than humans on AIGC videos: The best-performing model, GPT-4.1, achieves an Overall score of only 51.6 compared to 84.4 for humans, showing a substantial gap of 32.8 percentage points (pp).
- Detecting commonsense and physical violations is the biggest bottleneck: GPT-4.1 achieves only 24.0% on Error Awareness (Commonsense dimension), lagging behind random guessing (50.0%). This indicates that MLLMs are virtually unable to identify violations of physical commonsense in AIGC videos.
- Models suffer from a "normality bias": In the Error Awareness task where all correct answers are "Yes" (indicating the video has an error), GPT-4.1 only scores 24.0% in the commonsense dimension, showing a strong bias toward classifying the videos as normal.
- Open-source models are competitive: InternVL3-38B (43.6) is close to GPT-4.1-mini (44.3), indicating that the performance gap of open-source models in AIGC video understanding is narrowing.
- Inconsistent performance across different tasks: GPT-4.1 performs well in Error Type Detection (75.2 on the Object dimension) but poorly in Error Awareness (24.0), highlighting highly uneven capability distributions.
- RePrompt experiments validate the practical value of feedback: Aligning MLLM feedback with human preferences effectively improves the quality of video generation.
Highlights & Insights¶
- First benchmark to systematically evaluate MLLM feedback capabilities on AIGC videos: Fills a gap in synthetic video evaluation, utilizing 4 tasks and 3 question types to provide thorough coverage.
- Valuable discovery of "normality bias": The tendency of MLLMs to judge physical violations in AIGC videos as normal serves as an important warning for practical applications of AIGC video quality assessment.
- Hierarchical design of error types: Categorizes errors into video quality, commonsense/physical violations, and ethical issues, yielding much higher diagnostic value than a single quality score.
Limitations & Future Work¶
- Short video duration: With an average duration of 8.98 seconds, it does not cover long-video generation scenarios (such as Sora-level 60-second-plus videos).
- Design bias in the Error Awareness task: All correct answers are "Yes". Although designed to detect model bias, this might result in a somewhat unfair scoring mechanism.
- Limited dataset scale: The 9,740 QA pairs are still relatively small for a comprehensive evaluation.
- Timeliness issues: As video generation models iterate rapidly, current profiles of error types may evolve alongside model advancements.
- Future directions: Incorporating comparative evaluations between AIGC and natural videos; extending to longer videos and a broader range of generative models.
Related Work & Insights¶
- vs EditVid-QA: EditVid-QA also evaluates synthetic videos, but is limited to edited videos and only features open-ended questions; VF-Eval covers more diverse AIGC types and question formats.
- vs QBench: QBench addresses quality assessment of AI-generated content but focuses primarily on scoring, whereas VF-Eval emphasizes error diagnostics and reasoning.
- vs Video-MME: Video-MME is the most comprehensive benchmark for natural video understanding; VF-Eval serves as its counterpart in the AIGC video domain.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic AIGC video feedback evaluation benchmark, featuring hierarchically designed tasks.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated 13 models alongside human baselines and validated via RePrompt.
- Writing Quality: ⭐⭐⭐ Well-structured, although some details could be more concise.
- Value: ⭐⭐⭐⭐ Offers significant practical guidance for both AIGC video quality assessment and the improvement of video generation.