VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection¶

Conference: CVPR 2025
arXiv: 2411.14794
Code: https://github.com/hshjerry/VideoEspresso
Area: LLM Reasoning
Keywords: Video Chain-of-Thought, Core Frame Selection, Video QA, Multimodal Reasoning, Dataset

TL;DR¶

VideoEspresso constructs a large-scale video CoT reasoning dataset of over 200k samples (containing spatial bounding box and temporal grounding annotations). It also proposes a hybrid framework, VideoQA-SC, which employs a lightweight 1.5B model to select an average of 2.36 core frames, followed by an 8B reasoning model performing two-stage evidence extraction and answer generation. With only 1.8% of the frames and 14.7% of the computation, it outperforms GPT-4o and all open-source LVLMs.

Background & Motivation¶

Background: LVLMs have made significant progress in multimodal understanding, but still fall short in video reasoning tasks, primarily limited by the scarcity of high-quality, large-scale VideoQA datasets.

Limitations of Prior Work: (1) Manual annotation is costly and lacks fine-grained details; (2) Automatic construction approaches rely on frame-by-frame analysis, which is computationally expensive and introduces heavy redundancy; (3) Existing video CoT efforts perform reasoning mainly at the textual level, neglecting visual grounding (e.g., where is the object? which frame is it?); (4) Current LVLMs typically sample a large number of frames uniformly (e.g., 128 frames) during video processing, leading to immense computational overhead where most frames are redundant.

Key Challenge: Video information is highly redundant, yet complex reasoning demands precise localization of key frames and key objects. Uniform sampling wastes computational resources, while under-sampling can potentially lose critical information.

Goal: (1) Construct a large-scale VideoQA dataset containing multimodal CoT annotations (spatial + temporal grounding); (2) Design an efficient video reasoning framework that achieves high-quality reasoning using minimal frames.

Key Insight: First use a semantic-aware approach for redundancy removal and QA pair generation, then utilize GPT-4o to annotate multimodal CoT (including core frames, key objects, and evidence chains), and finally train a hybrid framework that sequentially operates as "frame selection before reasoning".

Core Idea: Employ a lightweight model to select 2–3 core frames, and then utilize a larger model to perform two-stage fine-grained video reasoning in an "evidence extraction \(\rightarrow\) evidence-based reasoning" manner.

Method¶

Overall Architecture¶

The proposed method consists of two parts: data construction and the model framework. Data Construction: Videos are collected from 7 video datasets, followed by adaptive FPS sampling \(\rightarrow\) InternVL2-8B frame description \(\rightarrow\) BGE-M3 semantic redundancy removal \(\rightarrow\) GPT-4o QA pair generation \(\rightarrow\) Claude/Gemini quality filtering \(\rightarrow\) GPT-4o multimodal CoT annotation (core frame selection + key object extraction + evidence generation) \(\rightarrow\) GroundingDINO spatial labeling + BGE-M3 temporal retrieval. Model Framework (VideoQA-SC): Stage 1: A lightweight Frame Selector (InternVL2-1B + Qwen-0.5B) selects core frames; Stage 2: Two-stage SFT for the reasoning LVLM—first training on evidence extraction, and then training on evidence-based reasoning for answer generation.

Key Designs¶

Semantic-Aware Frame Redundancy Elimination:
- Function: Efficiently remove redundant frames from videos while retaining key information.
- Mechanism: First apply adaptive FPS sampling (FPS 2–4 for dynamic scenes, FPS 1 for static scenes), then use InternVL2-8B to generate descriptions for each frame. Compute semantic similarity of adjacent frame descriptions via the BGE-M3 model, and eliminate redundant frames whose similarity exceeds a threshold \(\tau\) (using a LIFO filtering strategy).
- Design Motivation: Determining redundancy based on textual semantics rather than pixel-level differences better captures changes at the content level.
Multimodal CoT Annotation Pipeline:
- Function: Generate a reasoning chain containing spatial and temporal grounding details for each QA pair.
- Mechanism: A three-step process: (1) GPT-4o selects the core frame descriptions most relevant to the question from the generated frame descriptions; (2) Key objects are extracted from these core frame descriptions to serve as evidence; (3) These key objects are organized into a natural language reasoning chain. For spatial annotations, GroundingDINO is used to label bounding boxes, which are then verified for consistency via CLIP. For temporal annotations, BGE-M3 semantic retrieval is employed to match the original frames.
- Design Motivation: Text-only CoT lacks visual grounding. Multimodal CoT uses spatial and temporal localization to make the reasoning process traceable and verifiable.
VideoQA-SC Hybrid LVLM Collaborative Framework:
- Function: Achieve highly efficient and precise video reasoning.
- Mechanism: Frame Selector: A 1B-parameter LVLM generates frame descriptions, and a 0.5B-parameter LLM selects core frames based on the question (averaging only 2.36 frames). Two-Stage Reasoning LVLM: Stage 1 trains on evidence extraction ("Please provide evidence helpful to answer the question"), and Stage 2 trains on answer generation ("Please answer the question based on the evidence"). This progressive two-stage training ensures step-by-step integration of multimodal information.
- Design Motivation: Decoupling frame selection and reasoning allows a highly compact model to perform frame selection (saving compute) so that the larger model only processes the 2–3 most critical frames. Meanwhile, two-stage SFT prevents the model from bypassing evidence and directly guessing the answer.

Loss & Training¶

Both stages of SFT utilize LoRA fine-tuning: learning rates are \(2 \times 10^{-5}\) and \(10^{-5}\) respectively, batch size = 16, \(8 \times \text{A100}\), LoRA rank = 16, alpha = 32. The input resolution is \(224 \times 224\), max tokens = 6144, 1 epoch of training with cosine decay.

Key Experimental Results¶

Main Results¶

Model	Params	Frames	TFLOPs	Avg Acc (14 tasks)
GPT-4o	-	FPS3	-	26.4%
Qwen-VL-Max	-	FPS3	-	26.0%
InternVL2-8B	8B	FPS1	73.2	28.7%
Qwen2-VL-7B	7B	FPS1	64.6	28.5%
LongVA-DPO-7B	7B	128 frames	465.4	24.4%
VideoEspresso	8.5B	2.36 frames	9.26	34.1%

Subjective evaluation (scores based on a 10-point scale):

Model	Logicality	Factuality	Accuracy	Conciseness	Overall
GPT-4o	73.2	63.1	61.7	70.0	66.1
InternVL2	70.6	56.3	54.5	66.8	60.1
VideoEspresso	72.3	61.3	59.7	75.7	65.8

Ablation Study¶

Configuration	Accuracy	Delta
Full model	34.13%	-
GT-CoT (oracle evidence)	72.95%	+38.82%
w/o Bbox (w/o spatial labels)	33.14%	-0.99%
w/o CoT (w/o reasoning chain)	31.32%	-2.81%

Generalizability of the Core Frame Selector:

Model	Uniform Sampling	+ Selector	Delta
GPT-4o (16 frames)	26.9%	29.5% (2.36 frames)	+2.6%
InternVL2 (16 frames)	28.6%	30.0% (2.36 frames)	+1.4%

Key Findings¶

Just 2.36 frames outperforming 128 frames: Using only 1.8% of the frames and 2% of the FLOPs of LongVA-DPO, VideoEspresso surpasses it by 9.7% in accuracy, proving that "carefully selected sparse frames" is far superior to "brute-force dense frames".
Huge room for growth with GT-CoT: Utilizing ground truth evidence can skyrocket performance from 34.1% to 73.0% (+38.8%), indicating that the evidence extraction capabilities of current models still have immense room for improvement.
Core Frame Selector as a plug-and-play module: Directly plugging it into GPT-4o yields a 2.6% accuracy improvement while reducing the frame count by 85%.
Superiority over GPT-4o in conciseness: The conciseness score in subjective evaluation reaches 75.7 vs. 70.0 for GPT-4o, indicating that selecting core frames effectively reduces redundant output.

Highlights & Insights¶

The philosophy of "less is more" works surprisingly well in video reasoning: 2.36 frames > 128 frames. Rather than uniform sampling, the core frame selector achieves ultimate information compression through semantic understanding.
The two-stage "evidence-first, reasoning-second" training strategy is highly elegant: it prevents the model from taking shortcuts like "predicting the answer from the question directly" and forces it to explicitly extract visual evidence first.
Highly automated dataset construction pipeline: The entire pipeline (InternVL2 frame description \(\rightarrow\) BGE-M3 redundancy removal \(\rightarrow\) GPT-4o QA generation \(\rightarrow\) multi-LLM cross-verification \(\rightarrow\) GroundingDINO spatial annotation) exhibits strong scalability.

Limitations & Future Work¶

The GT-CoT experiment reveals a huge room for improving the quality of automatically generated CoT annotations (34.1% vs. 72.9%).
The test set contains only 1,382 items, which is relatively small and might invite statistical variance.
The selector slightly degrades performance on certain models (e.g., -0.6% for LongVA), suggesting that the generalizability of the selector needs further improvement.
The distribution among the 14 tasks is highly imbalanced (e.g., 87k for causal reasoning vs. 276 for cooking steps), which might affect the fairness of the evaluation.

vs. LongVA-DPO: While LongVA processes videos with 128 frames in a brute-force manner, VideoEspresso dramatically outperforms it (+9.7%) using only 2.36 frames, consuming only 2% of the FLOPs.
vs. MVBench: MVBench focuses on basic video understanding (e.g., "what object is this"), whereas VideoEspresso emphasizes complex reasoning ("why", "inferring"), which presents a fundamental difference at the dataset design level.
vs. Text-only CoT methods: Text-only CoT lacks visual grounding, whereas VideoEspresso introduces spatial bounding boxes and temporal localization to make the reasoning chain traceable.

Rating¶

Novelty: ⭐⭐⭐⭐ The multimodal CoT annotation and the hybrid frame selection-reasoning framework are designed novelly, though individual sub-modules (frame descriptions, semantic retrieval, etc.) are combinations of existing techniques.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive comparisons across 14 tasks and 9 models, coupled with thorough ablation studies and cross-model generalization validation.
Writing Quality: ⭐⭐⭐⭐ The pipeline is clearly described, accompanied by rich figures and tables.
Value: ⭐⭐⭐⭐⭐ The 200k high-quality video CoT dataset combined with an efficient reasoning framework acts as a significant catalyst for video understanding research.