FineVQ: Fine-Grained User Generated Content Video Quality Assessment¶
Conference: CVPR 2025
arXiv: 2412.19238
Code: https://github.com/IntMeGroup/FineVQ
Area: Recommender Systems
Keywords: Video Quality Assessment, User Generated Content, Fine-Grained Assessment, Large Multimodal Models, Instruction Tuning
TL;DR¶
This work constructs the first large-scale, fine-grained UGC video quality assessment database, FineVD (6,104 videos, 800k+ ratings, 6 dimensions), and proposes FineVQ, an LMM-based approach. FineVQ enables a single model to simultaneously perform quality rating, scoring, and attribution, achieving state-of-the-art performance on FineVD and other UGC-VQA datasets.
Background & Motivation¶
Background: With the explosive growth of UGC videos, video quality assessment (VQA) has become crucial for content monitoring, optimization, and recommendation on platforms. However, existing UGC-VQA methods (such as VSFA, SimpleVQA, and DOVER) typically output only a single overall quality score.
Limitations of Prior Work: A single overall quality score cannot satisfy the diverse needs of downstream applications. Video processing pipelines need to identify which specific dimension (e.g., distortion, blur, color) has issues; recommender systems require multi-dimensional quality signals; and content creators need to understand specific quality defects. Existing databases also lack multi-dimensional, fine-grained annotations.
Key Challenge: Application scenarios demand fine-grained, multi-dimensional quality information, whereas existing databases only provide coarse-grained overall scores, which limits models to outputting a single score. Furthermore, using separate models to evaluate each dimension individually is highly inefficient and suffers from poor consistency.
Goal: (1) To establish the first large-scale UGC-VQA database with multi-dimensional, fine-grained quality annotations; (2) to design a "one-for-all" fine-grained quality assessment method.
Key Insight: Utilizing the powerful visual understanding and text generation capabilities of Large Multimodal Models (LMMs), instruction tuning is leveraged to empower a single model with multi-task capabilities, including quality scoring (regression), quality rating (classification), and quality description (generation).
Core Idea: To construct a database with fine-grained annotations across 6 dimensions (color, noise, artifacts, blur, temporal, and overall quality), and to train a "one-for-all" video quality assessment model based on the InternVL framework, utilizing dual spatial-motion visual encoders and LoRA fine-tuning.
Method¶
Overall Architecture¶
Given an input UGC video and user prompt, encoding transitions through three parallel pathways: (1) a spatial image encoder (InternViT) extracts spatial features from 8 uniformly sampled frames; (2) a motion encoder (SlowFast) extracts temporal motion features from the entire video; and (3) a text tokenizer encodes the user prompt. The tokens from these three pathways are concatenated and fed into a pre-trained Large Language Model (InternLM) to generate quality-related responses, which can be quality ratings (classification), quality scores (regression), or quality descriptions (attribution).
Key Designs¶
-
FineVD Database Construction:
- Function: Provides the first large-scale, multi-dimensional, fine-grained UGC-VQA annotated dataset.
- Mechanism: Collects 6,104 UGC videos (including professional and user-shot live broadcasts/VODs) from Bilibili. Twenty-two professional annotators rated the videos in a laboratory environment across six dimensions—color, noise, artifacts, blur, temporal, and overall—using a 5-level scale, yielding over 800k total ratings. Distortion types were annotated, and GPT-4 was employed to generate quality-related QA pairs, which were manually verified. These steps established the training data for three tasks: quality rating, scoring, and description.
- Design Motivation: Existing databases only contain overall ratings, failing to support fine-grained quality assessment research. Laboratory annotations are more controllable in terms of quality compared to crowd-sourced annotations.
-
Dual Visual Encoders + LoRA Fine-Tuning:
- Function: Empowers the pre-trained LMM with quality-aware capabilities without excessively increasing parameter scale.
- Mechanism: Spatial features are extracted by InternViT (from 8 sampled frames), while motion features are derived from the entire video using a SlowFast network. Both feature pathways are projected into the language space via 2-layer MLPs. LoRA weights are applied to both the image encoder and the LLM for low-rank adaptation, infusing quality assessment domain knowledge while preserving the general capabilities of the foundation model.
- Design Motivation: Sparse frame sampling is insufficient for capturing temporal quality issues such as jitter and lagging; thus, an auxiliary motion feature extractor is critical. LoRA avoids the high computational costs of full parameter fine-tuning while retaining flexibility across different quality dimensions.
-
Multi-Task Unification via Instruction Tuning:
- Function: Enables a single model to simultaneously handle quality rating (classification), quality scoring (regression), and quality description (text generation).
- Mechanism: Diverse types of instruction-answer pairs (QA pairs) are designed. The rating task requires the model to output classes like "good/fair/poor"; the scoring task requires numerical outputs ranging from 0 to 100; and the description task requires text generation of natural language describing the quality degradation. Mixing these three types of QA pairs during training achieves multi-task integration.
- Design Motivation: Different applications require different granularities of quality feedback. Unifying them into a single LMM framework allows them to share low-level visual representation power while distinguishing task types via natural instructions.
Loss & Training¶
The scoring task utilizes a regression loss (MSE), while the rating and description tasks employ the cross-entropy loss of language modeling. Training follows a two-stage strategy: the first stage freezes the visual encoders and the LLM, training only the projection layers; the second stage unfrees the LoRA weights for end-to-end fine-tuning.
Key Experimental Results¶
Main Results (FineVD Scoring Task)¶
| Method | Overall SRCC | Overall PLCC | Noise SRCC | Blur SRCC |
|---|---|---|---|---|
| Ours (FineVQ) | 0.8834 | 0.8891 | 0.8444 | 0.8711 |
| DOVER | 0.8422 | 0.8393 | 0.8018 | 0.8404 |
| SimpleVQA | 0.8311 | 0.8358 | 0.8070 | 0.8466 |
| FAST-VQA | 0.8348 | 0.8474 | 0.8093 | 0.8352 |
| VIDEVAL (Traditional) | 0.7310 | 0.7307 | 0.6912 | 0.7610 |
Ablation Study¶
| Configuration | Key Metrics | Description |
|---|---|---|
| DNN Method (per-dim) | Each dimension trained individually | Separate model for each dimension |
| FineVQ (one-for-all) | Unified across six dimensions | Single model with multiple dimensions, achieving superior performance |
| General LMM (Zero-Shot) | Lower quality attribution accuracy | InternVL2 zero-shot ~50-70% |
| FineVQ Quality Attribution | 87-95% across all dimensions | Significant improvement after instruction tuning |
Key Findings¶
- FineVQ, using a one-for-all strategy, outperforms individually trained DNN-based methods (e.g., DOVER, SimpleVQA) across all six dimensions, illustrating the benefit of a unified model.
- The integration of the motion feature encoder yields notable performance gains in temporal quality assessment.
- General-purpose LMMs (such as InternVL2 and Qwen2-VL) show poor zero-shot performance in quality attribution tasks (~50-70%), whereas fine-tuning via FineVQ boosts accuracy to 87-95%.
- FineVQ displays competitive cross-dataset generalization capabilities on external benchmarks (e.g., LSVQ, KoNViD-1k).
Highlights & Insights¶
- First Multi-Dimensional, Fine-Grained VQA Database: FineVD fills the vacancy of fine-grained annotations in the UGC-VQA field. The scale of 6 dimensions × 6,104 videos × 22 annotators is sufficient to support modern deep learning research.
- One-for-All Design Philosophy: Unifying scoring, rating, and description into a single model via instruction tuning eliminates the complexity of maintaining multiple dedicated models. Notably, shared representations in turn boost the performance of each sub-task.
- Emergent Application of LMMs to Low-Level Vision: The work showcases the untapped capability of LMMs in traditional low-level visual quality evaluation, demonstrating that low-rank adaptation with a fraction of parameters is sufficient to inject strong quality perception.
Limitations & Future Work¶
- All source videos in the database are collected from a single platform (Bilibili), which may introduce platform-specific content and quality distribution biases.
- The six quality dimensions are pre-defined, limiting direct exploration of finer grains (e.g., specific severity levels of individual distortion types) or adaptive dimension discovery.
- Model inference demands LMM-scale computational resources, making it less suitable for edge devices or real-time application scenarios.
- The PLCC of the scoring task on the noise dimension (0.7986) is lower compared to other dimensions, indicating room for improvement in noise quality assessment.
Related Work & Insights¶
- vs DOVER: DOVER covers technical and aesthetic qualities but only across two dimensions without rating or description capabilities. FineVQ expands this to six dimensions and supports multiple tasks.
- vs Q-Align: Q-Align also uses LMMs for quality assessment but primarily targets images and yields only overall scores. FineVQ focuses on videos, acts multi-dimensionally, and integrates a motion encoder.
- vs SimpleVQA: SimpleVQA simplifies quality prediction via pre-trained features and linear regression layers, but fails to provide text descriptions or quality attribution. FineVQ also demonstrates superior scoring performance over SimpleVQA.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first fine-grained VQA database combined with an LMM-based multi-task unified framework fills a critical gap in the field.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks, comparative studies against diverse baselines, and rigorous evaluation of all three tasks (rating, scoring, and description).
- Writing Quality: ⭐⭐⭐⭐ Clear structure, logical descriptions of database construction steps, and rich visualization.
- Value: ⭐⭐⭐⭐ Both the dataset and models are open-sourced, delivering high practical value to the UGC video quality assessment community.