Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs¶

Conference: CVPR 2025
arXiv: 2409.20063
Code: https://github.com/Q-Future/Q-Bench-Video
Area: Video Understanding / Quality Assessment
Keywords: Video Quality Assessment, LMM Benchmark, LMM, AIGC Distortion, Temporal Consistency

TL;DR¶

The first benchmark, Q-Bench-Video, to systematically evaluate the video quality understanding capabilities of Large Multimodal Models (LMMs), covering natural/AIGC/CG videos, four-dimensional quality focus, and multiple question types.

Background & Motivation¶

Key Challenge¶

Key Challenge: Background: Large Multimodal Models (LMMs) have made remarkable progress in high-level semantic video understanding tasks, but a systematic evaluation of video quality understanding is severely lacking. Video quality is crucial for compression optimization, user experience enhancement, and the establishment of video generation standards. The low-level information involved (blur, noise, compression artifacts, etc.) is fundamentally different from high-level semantic understanding. Existing LMM video benchmarks (e.g., MVBench, Video-MME) focus on semantic understanding, leaving out the quality perception dimension. On the other hand, the explosive development of AIGC video generation has introduced brand-new distortion types (unnatural textures, illumination inconsistency, etc.), urgently calling for a specialized evaluation framework. This paper systematically fills this gap.

Proposed Solution¶

Goal: ### Overall Architecture

The construction of Q-Bench-Video follows three principles: (1) broad video content coverage—1,000 natural scenes, 600 AIGC, and 200 CG videos for a total of 1,800 videos; (2) uniform sampling based on quality annotations to ensure a balanced quality distribution; (3) focusing on four-dimensional quality dimensions that affect the viewing experience.

Method¶

Overall Architecture¶

The construction of Q-Bench-Video follows three principles: (1) broad video content coverage—1,000 natural scenes, 600 AIGC, and 200 CG videos for a total of 1,800 videos; (2) uniform sampling based on quality annotations to ensure a balanced quality distribution; (3) focusing on three-dimensional quality dimensions that affect the viewing experience. Each data entry is a meta-structure (V, Q, A, C), totaling 2,378 question-answering pairs. Twelve open-source and five closed-source LMMs are evaluated.

Key Designs¶

Three Question Types Design: (a) Yes-or-No questions: binary judgment of video quality, with annotations adjusted to ensure a balanced 50:50 ratio of correct answers to avoid the bias of LMMs; (b) What-How questions: "What" identifies specific distortion types, while "How" distinguishes fine-grained differences in distortion severity; (c) Open-ended questions: without limiting the answer set, evaluating LMMs' ability to perceive video quality in real-world scenarios, such as "Please list and explain the possible factors causing the low clarity of this video." Additionally, a video pair comparison task is added to evaluate relative quality judgment capabilities.
Four Dimensions of Quality Focus: (a) Technical distortion: low-level degradations like blur, noise, and compression artifacts; (b) Aesthetic distortion: subjective aesthetic deviations in composition, color, illumination, etc.; (c) Temporal distortion: temporal issues such as camera shake, flickering, inconsistent motion, and stuttering; (d) AIGC distortion: unnatural textures, eerie faces, unrealistic object behaviors, and other artifacts unique to AI-generated content. A single question can cover multiple dimensions simultaneously.
Diversity of Video Sources: Natural videos are from LSVQ (600 sampled from 39K), MaxWell (350 sampled from 4.5K), and the WaterlooSQoE series; AIGC videos are from T2VQA-DB (200 sampled from 10K) and VideoFeedback (400 sampled from 37.6K); CG videos are sourced from LIVE-YT-Gaming (200 sampled from 600). Most datasets contain ITU-standard MOS annotations, ensuring the scientific rigour of quality sampling.

Loss & Training¶

Pure evaluation benchmark with no training component
Open-ended questions use GPT-4 as an auxiliary scorer
Multiple-choice questions use accuracy
Video pair comparison uses consistency rate

Key Experimental Results¶

Main Results¶

Model	Yes-or-No↑	What-How↑	Open-ended↑	Average↑
GPT-4o	Highest	Highest	Highest	Highest
InternVL2	Second Highest	Second Highest	-	Second Highest
VideoLLaMA2	Medium	Medium	-	Medium
Human Performance	Far higher than all LMMs	Far higher than all LMMs	Far higher than all LMMs	Significant Lead

Ablation Study¶

Dimension	LMM Performance Variance
Technical Distortion	Relatively Good (LMMs have basic perception of blur/noise)
Aesthetic Distortion	Medium
Temporal Distortion	Poor (LMMs struggle to capture temporal issues)
AIGC Distortion	Poor (LMMs are insensitive to AI-generated artifacts)

Key Findings¶

LMMs have a basic but incomplete and imprecise understanding of video quality, with a significant gap compared to human performance.
Closed-source models (e.g., GPT-4o) significantly outperform open-source models.
LMMs perform worst on the temporal distortion and AIGC distortion dimensions—which are precisely the two most unique aspects of video quality.
The video pair comparison task is more challenging than single-video evaluation.
Open-ended questions expose the limitations of LMMs in explaining the causes of quality issues.

Highlights & Insights¶

The first work to propose LMM video quality understanding as an independent research direction, filling an important gap.
The introduction of the AIGC distortion dimension is highly timely—with the popularity of video generation models, the demand for such evaluation is surging.
The balanced design of Yes-or-No questions and the introduction of open-ended questions enhance the comprehensiveness and authenticity of the evaluation.
The benchmark reveals the fundamental limitations of LMMs in low-level information perception.

Limitations & Future Work¶

The scale of 2,378 QA pairs can be further expanded.
Open-ended evaluation relying on GPT-4 may introduce bias.
The video quality scoring capability of LMMs (quantitative scoring vs. qualitative description) was not evaluated.
The framework can be extended to evaluate outputs from more diverse video generation models.

vs Video-MME/MVBench: Focus on semantic understanding; Q-Bench-Video focuses on low-level quality understanding, complementing each other.
vs Traditional VQA Methods: Traditional methods output quality scores; Q-Bench-Video evaluates the quality understanding and explanation capabilities of LMMs.
vs Q-Bench (Image Version): Extends the image quality benchmark paradigm to videos, adding temporal and AIGC dimensions.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first LMM benchmark for video quality, a pioneering direction.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — A comprehensive evaluation with 17 models, 4 dimensions, and 3 question types.
Writing Quality: ⭐⭐⭐⭐ — Clear benchmark design principles and a complete taxonomic system.
Value: ⭐⭐⭐⭐⭐ — Provides a standardized evaluation platform for video quality understanding research.