CVPR 2025 Video Understanding Video understanding benchmark expert-level reasoning multi-disciplinary evaluation domain knowledge multimodal foundation models

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding¶

Conference: CVPR 2025
arXiv: 2501.12380
Code: https://github.com/yale-nlp/MMVU
Area: Video Understanding
Keywords: Video understanding benchmark, expert-level reasoning, multi-disciplinary evaluation, domain knowledge, multimodal foundation models

TL;DR¶

This paper proposes MMVU, an expert-annotated benchmark containing 3,000 video understanding questions across 27 disciplines, to evaluate the expert-level knowledge reasoning capabilities of multimodal foundation models in professional video domains, revealing that even the strongest models still significantly lag behind human experts.

Background & Motivation¶

Existing video understanding benchmarks primarily focus on general scenarios (e.g., action recognition, captioning) but lack evaluations of expert-level reasoning in professional domains. However, video serves as a crucial modality for conveying complex dynamic information in many specialized fields (such as medicine, engineering, and scientific research). For instance, when analyzing a chemical reaction video, models must identify visual cues such as color changes and reason using domain-specific chemical knowledge.

While multi-disciplinary benchmarks (e.g., MMLU, MMMU) exist, they target text or images, leaving expert-level reasoning in the video modality severely under-evaluated. In the only related work, MMWorld, only 39.5% of samples require professional domain knowledge, and 76.4% of them are synthetically generated by GPT-4V. MMVU fills this gap through a completely manual, scratch-built, textbook-guided annotation pipeline.

Method¶

Overall Architecture¶

The construction of MMVU involves three stages: (1) Preparation — identifying 27 disciplines through a user study with 133 students and hiring 67 expert annotators; (2) Textbook-guided QA Annotation — annotators start from textbook concepts to find CC-licensed videos and construct expert-level questions and answers; (3) Data Quality Control — including time-based annotation compensation and manual verification by experts.

Key Designs¶

Textbook-guided Annotation Pipeline: Annotators first identify core concepts (e.g., experimental procedures, mechanical operations) from textbooks, then search for corresponding CC-licensed videos on YouTube, and finally design questions that require domain knowledge and expert reasoning to answer. This ensures both the breadth of knowledge (covering diverse textbook chapters) and the depth of reasoning (requiring specialized reasoning over simple visual recognition). Each instance is accompanied by an expert-annotated reasoning process and relevant domain knowledge (linked to Wikipedia pages), supporting fine-grained analysis.
Strict Video Quality Constraints: Videos must be visually intensive, excluding audio (to prevent speech shortcuts) and excluding content with excessive on-screen text (such as lecture recordings). This ensures models must rely on visual understanding to answer. Each sample is validated by experts to ensure that watching the video is indispensable — questions cannot be answered using only text or static single frames.
Multi-level Human Baseline Evaluation: A three-stage human evaluation is designed — closed-book (3.5 hours, 49.7% on average), open-book (allowing searches, 86.8%), and Oracle (revised post providing correct domain knowledge, 95.3%), providing a precise calibration of task difficulty.

Loss & Training¶

This paper is a benchmarking work and does not involve model training. The evaluation employs two prompting strategies: Direct Answer and Chain-of-Thought (CoT). Accuracy is evaluated using GPT-4o for answer extraction and judgment. The benchmark covers 32 cutting-edge multimodal models, including 16 open-source and 8 closed-source model series.

Key Experimental Results¶

Main Results¶

Model	Science	Medicine	Humanities & Social Sciences	Engineering	Test Set Average
Human (Open-book)	84.7	92.7	83.3	86.8	86.8
o1	78.0	76.0	74.0	79.0	77.0
Gemini 2.0 Flash Thinking	71.2	73.4	67.3	69.1	69.5
GPT-4o	71.8	72.0	61.6	67.4	66.7
Claude 3.5 Sonnet	64.0	70.9	64.5	65.2	64.1
Qwen2-VL-72B (Best Open Source)	53.6	61.7	53.9	53.0	53.2
Human (Closed-book)	54.7	42.7	44.7	56.7	49.7

Ablation Study¶

Configuration	Key Metric	Description
GPT-4o Direct Answer \(\rightarrow\) CoT	65.4 \(\rightarrow\) 66.7 (+1.3)	CoT provides limited improvement for GPT-4o
Claude 3.5 Sonnet Direct \(\rightarrow\) CoT	53.1 \(\rightarrow\) 64.1 (+11.0)	CoT significantly improves Claude
o1 (System-2 Reasoning)	77.0	Long-chain reasoning yields the best performance
Gemini 2.0 Flash Thinking	69.5	System-2 reasoning shows significant efficacy
Qwen2-VL-72B (Open Source) vs GPT-4o	53.2 vs 66.7	13.5% gap between open-source and closed-source

Key Findings¶

Even the strongest model, o1 (77.0%), still noticeably lags behind open-book human experts (86.8%), with a gap of 9.8%.
While GPT-4o is close to human performance on MMLU (text) and MMMU (image), a wide gap remains in expert-level video reasoning.
CoT reasoning generally improves performance, but the gains vary dramatically across models (Claude +11.0% vs. GPT-4o +1.3%).
System-2 thinking (o1, Gemini Thinking) shows a significant advantage in expert-level video reasoning.
Six major error categories: Domain knowledge reasoning errors (27%) > Lack of domain knowledge in visual perception (20%) = Over-reliance on text (20%) > Visual perception errors (18%).

Highlights & Insights¶

Each sample is accompanied by annotations for the expert-level reasoning process and domain knowledge. This enables error analysis to precisely locate where models fail among "what is seen", "what is known", and "how to reason".
Textbook-guided annotation ensures systematic coverage of knowledge rather than arbitrary topic selection.
Restricting the dataset to CC-licensed videos, though increasing annotation difficulty, resolves copyright issues and ensures the paradigm's sustainability.
The three-stage human evaluation (closed-book \(\rightarrow\) open-book \(\rightarrow\) Oracle) strictly calibrates task difficulty, providing a powerful reference for analyzing model capabilities.

Limitations & Future Work¶

Restricting to CC licenses leads to a scarcity of videos in certain disciplines (e.g., sports), limiting coverage depth.
The average video length is only 51.4 seconds, which does not cover long-video scenarios.
Evaluation of open-ended questions relies on GPT-4o, which may introduce evaluation bias.
Audio information is excluded, though audio is vital in certain professional scenarios (e.g., music, language learning).
Currently covering only 27 disciplines, which can be further expanded.

MMVU complements MMLU (text) and MMMU (image) to form a progressively evolving evaluation framework across modalities, filling the blank in the video modality.
Compared to MMWorld, MMVU features higher-quality annotations (100% manual vs. 76% GPT-generated) and focuses on questions that genuinely require domain knowledge.
The textbook-guided annotation pipeline can be generalized to the construction of other benchmarks requiring systematic knowledge coverage.
The error classification taxonomy (visual perception, domain knowledge, reasoning, and text dependency) points to clear directions for model improvement.

Rating¶

Novelty: ⭐⭐⭐⭐ The first high-quality multi-disciplinary video expert-level reasoning benchmark, filling a vital gap.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation on 32 models with detailed error analysis and human baselines.
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definitions, rigorous annotation pipeline, and profound analysis.
Value: ⭐⭐⭐⭐⭐ Highly valuable for promoting the deployment of multimodal models in professional domains.