Skip to content

Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs

Conference: CVPR 2025
arXiv: 2409.20063
Code: https://github.com/Q-Future/Q-Bench-Video
Area: Video Understanding / Quality Assessment
Keywords: Video Quality Assessment, LMM Benchmark, LMM, AIGC Distortion, Temporal Consistency

TL;DR

The first benchmark, Q-Bench-Video, to systematically evaluate the video quality understanding capabilities of Large Multimodal Models (LMMs), covering natural/AIGC/CG videos, four-dimensional quality focus, and multiple question types.

Background & Motivation

Key Challenge

Key Challenge: Background: Large Multimodal Models (LMMs) have made remarkable progress in high-level semantic video understanding tasks, but a systematic evaluation of video quality understanding is severely lacking. Video quality is crucial for compression optimization, user experience enhancement, and the establishment of video generation standards. The low-level information involved (blur, noise, compression artifacts, etc.) is fundamentally different from high-level semantic understanding. Existing LMM video benchmarks (e.g., MVBench, Video-MME) focus on semantic understanding, leaving out the quality perception dimension. On the other hand, the explosive development of AIGC video generation has introduced brand-new distortion types (unnatural textures, illumination inconsistency, etc.), urgently calling for a specialized evaluation framework. This paper systematically fills this gap.

Proposed Solution

Goal: ### Overall Architecture

The construction of Q-Bench-Video follows three principles: (1) broad video content coverage—1,000 natural scenes, 600 AIGC, and 200 CG videos for a total of 1,800 videos; (2) uniform sampling based on quality annotations to ensure a balanced quality distribution; (3) focusing on four-dimensional quality dimensions that affect the viewing experience.

Method

Overall Architecture

The construction of Q-Bench-Video follows three principles: (1) broad video content coverage—1,000 natural scenes, 600 AIGC, and 200 CG videos for a total of 1,800 videos; (2) uniform sampling based on quality annotations to ensure a balanced quality distribution; (3) focusing on three-dimensional quality dimensions that affect the viewing experience. Each data entry is a meta-structure (V, Q, A, C), totaling 2,378 question-answering pairs. Twelve open-source and five closed-source LMMs are evaluated.

Key Designs

  1. Three Question Types Design: (a) Yes-or-No questions: binary judgment of video quality, with annotations adjusted to ensure a balanced 50:50 ratio of correct answers to avoid the bias of LMMs; (b) What-How questions: "What" identifies specific distortion types, while "How" distinguishes fine-grained differences in distortion severity; (c) Open-ended questions: without limiting the answer set, evaluating LMMs' ability to perceive video quality in real-world scenarios, such as "Please list and explain the possible factors causing the low clarity of this video." Additionally, a video pair comparison task is added to evaluate relative quality judgment capabilities.

  2. Four Dimensions of Quality Focus: (a) Technical distortion: low-level degradations like blur, noise, and compression artifacts; (b) Aesthetic distortion: subjective aesthetic deviations in composition, color, illumination, etc.; (c) Temporal distortion: temporal issues such as camera shake, flickering, inconsistent motion, and stuttering; (d) AIGC distortion: unnatural textures, eerie faces, unrealistic object behaviors, and other artifacts unique to AI-generated content. A single question can cover multiple dimensions simultaneously.

  3. Diversity of Video Sources: Natural videos are from LSVQ (600 sampled from 39K), MaxWell (350 sampled from 4.5K), and the WaterlooSQoE series; AIGC videos are from T2VQA-DB (200 sampled from 10K) and VideoFeedback (400 sampled from 37.6K); CG videos are sourced from LIVE-YT-Gaming (200 sampled from 600). Most datasets contain ITU-standard MOS annotations, ensuring the scientific rigour of quality sampling.

Loss & Training

  • Pure evaluation benchmark with no training component
  • Open-ended questions use GPT-4 as an auxiliary scorer
  • Multiple-choice questions use accuracy
  • Video pair comparison uses consistency rate

Key Experimental Results

Main Results

Model Yes-or-No↑ What-How↑ Open-ended↑ Average↑
GPT-4o Highest Highest Highest Highest
InternVL2 Second Highest Second Highest - Second Highest
VideoLLaMA2 Medium Medium - Medium
Human Performance Far higher than all LMMs Far higher than all LMMs Far higher than all LMMs Significant Lead

Ablation Study

Dimension LMM Performance Variance
Technical Distortion Relatively Good (LMMs have basic perception of blur/noise)
Aesthetic Distortion Medium
Temporal Distortion Poor (LMMs struggle to capture temporal issues)
AIGC Distortion Poor (LMMs are insensitive to AI-generated artifacts)

Key Findings

  • LMMs have a basic but incomplete and imprecise understanding of video quality, with a significant gap compared to human performance.
  • Closed-source models (e.g., GPT-4o) significantly outperform open-source models.
  • LMMs perform worst on the temporal distortion and AIGC distortion dimensions—which are precisely the two most unique aspects of video quality.
  • The video pair comparison task is more challenging than single-video evaluation.
  • Open-ended questions expose the limitations of LMMs in explaining the causes of quality issues.

Highlights & Insights

  • The first work to propose LMM video quality understanding as an independent research direction, filling an important gap.
  • The introduction of the AIGC distortion dimension is highly timely—with the popularity of video generation models, the demand for such evaluation is surging.
  • The balanced design of Yes-or-No questions and the introduction of open-ended questions enhance the comprehensiveness and authenticity of the evaluation.
  • The benchmark reveals the fundamental limitations of LMMs in low-level information perception.

Limitations & Future Work

  • The scale of 2,378 QA pairs can be further expanded.
  • Open-ended evaluation relying on GPT-4 may introduce bias.
  • The video quality scoring capability of LMMs (quantitative scoring vs. qualitative description) was not evaluated.
  • The framework can be extended to evaluate outputs from more diverse video generation models.
  • vs Video-MME/MVBench: Focus on semantic understanding; Q-Bench-Video focuses on low-level quality understanding, complementing each other.
  • vs Traditional VQA Methods: Traditional methods output quality scores; Q-Bench-Video evaluates the quality understanding and explanation capabilities of LMMs.
  • vs Q-Bench (Image Version): Extends the image quality benchmark paradigm to videos, adding temporal and AIGC dimensions.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The first LMM benchmark for video quality, a pioneering direction.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — A comprehensive evaluation with 17 models, 4 dimensions, and 3 question types.
  • Writing Quality: ⭐⭐⭐⭐ — Clear benchmark design principles and a complete taxonomic system.
  • Value: ⭐⭐⭐⭐⭐ — Provides a standardized evaluation platform for video quality understanding research.