Generative Frame Sampler for Long Video Understanding¶

Conference: ACL 2025
arXiv: 2503.09146
Code: https://generative-sampler.github.io
Area: Video Understanding / Multimodal VLM
Keywords: Long Video Understanding, Frame Sampling, VideoLLM, Question-Aware Sampling, Plug-and-Play

TL;DR¶

GenS is proposed, a generative frame sampling module based on VideoLLM. It outputs question-aware relevant frame intervals and confidence scores in natural language format. As a plug-and-play module, it consistently improves multiple VideoLLMs by 2-4 points on LongVideoBench, MLVU, and HourVideo.

Background & Motivation¶

Background: Long video understanding is a core challenge for current VideoLLMs. Due to context window limitations, VideoLLMs must sample a limited number of frames from long videos. Mainstream methods use uniform sampling or fixed FPS, ignoring the relevance between the question and the video content.

Limitations of Prior Work: (1) Uniform sampling wastes a large number of tokens on irrelevant frames in long videos; (2) Semantic matching sampling based on CLIP or SigLIP can only score single frames independently, failing to understand temporal relationships between frames; (3) There is a lack of sampling methods that consider multi-hop reasoning and temporal logic.

Key Challenge: Efficient frame sampling is the bottleneck of long video understanding—the sampling strategy directly determines whether the VideoLLM can "see" the key frames required to answer the question.

Goal: To design a question-aware frame sampler that can understand temporal relationships between frames and select the most relevant frames.

Key Insight: Modeling frame sampling as a generative task, using a VideoLLM to directly output relevant frame intervals and confidence scores in natural language format.

Core Idea: Using a VideoLLM (Aria with a 256-frame window) to process sparsely sampled video frames to generate continuous intervals and confidence scores (0-5). Dense sampling is then performed from the high-confidence intervals.

Method¶

Overall Architecture¶

Four-stage dataset construction (GenS-Video-150K) \(\rightarrow\) Training the GenS sampler (based on Aria) \(\rightarrow\) At inference, sampling first with GenS and then feeding the sampled frames into any downstream VideoLLM for answering.

Key Designs¶

GenS-Video-150K Dataset Construction:
- Function: Creating large-scale frame sampling training data.
- Mechanism: (1) Densely sampling frames from video and using a VLM to generate descriptions for each frame; (2) Using an LLM to generate QA pairs based on the frame descriptions while labeling grounded frames; (3) Using CLIP to expand relevant frame windows around the labeled frames; (4) Labeling the relevance of each frame to the question with fine-grained scores (0-5).
- Design Motivation: Approximately 20% of the frames are labeled (sparse yet precise). Continuous scores from 0 to 5 are more flexible than binary labels and support top-K retrieval.
Generative Frame Sampling (GenS):
- Function: Modeling frame sampling as a text-generation task.
- Mechanism: Inputting sparsely sampled frames and a question, and outputting continuous frame intervals (e.g., "frames 10-25") along with corresponding confidence scores. Frames are densely sampled from high-scoring intervals after sorting by confidence.
- Design Motivation: Generating continuous intervals (instead of discrete frame indices) captures temporal continuity; sorting by confidence enables adaptive top-K sampling.
Plug-and-Play Design:
- Function: GenS operates independently of downstream VideoLLMs.
- Mechanism: GenS first samples key frames \(\rightarrow\) Key frames are fed into any downstream VideoLLM \(\rightarrow\) Downstream VideoLLM answers the question. GenS is based on Aria (256-frame context), while downstream VLMs can be any model.
- Design Motivation: Decoupling sampling and understanding, allowing a single GenS to serve all downstream VideoLLMs.

Loss & Training¶

Fine-tuning based on the Aria model with standard next-token prediction loss.
Task-specific prompts outperform unified prompts.
Text-only indices outperform mixed vision-text indices.

Key Experimental Results¶

Main Results¶

Baseline VideoLLM	LongVideoBench	MLVU	HourVideo
LLaVA-Video-72B	62.5\(\rightarrow\)66.8 (+4.3)	74.3\(\rightarrow\)77.0 (+2.7)	-
Aria	58.7\(\rightarrow\)66.1 (+7.4)	69.5\(\rightarrow\)72.6 (+3.1)	37.3\(\rightarrow\)39.2 (+1.9)
Qwen2-VL-7B	58.7\(\rightarrow\)60.3 (+1.6)	64.7\(\rightarrow\)66.9 (+2.2)	-
GPT-4o	66.7\(\rightarrow\)67.6 (+0.9)	-	-
Gemini-1.5-pro	-	-	37.3\(\rightarrow\)40.7 (+3.4)

Ablation Study¶

Sampling Strategy	LongVideoBench (Aria)
Uniform	54.4
CLIP sampler	~55
GenS (GenS-Video-150K data only)	57.7 (+3.3)
GenS full	66.1 (+11.7)

Key Findings¶

Consistent Improvement Across All Models: Both open-source (Qwen, LLaVA, Aria, VILA) and closed-source (GPT-4o, Gemini) models show improvements of 1-7 points.
Continuous Intervals > Discrete Frame Indices: Outputting continuous frame intervals + confidence (56.1) outperforms using discrete frame indices.
Largest Improvement on LongVideoBench: This benchmark requires multi-hop reasoning across time, showcasing the advantage of question-aware sampling most clearly.
GenS Can Generalize from Small to Large Models: The GenS trained on Aria effectively enhances the 72B LLaVA-Video model.

Highlights & Insights¶

Paradigm Shift of "Frame Sampling as Generation": Redefines sampling from a retrieval/matching problem to a generation problem, allowing the VideoLLM to directly "speak out" which frames are important. This enables the model to leverage its temporal reasoning capabilities for sampling.
Soft Ranking with Confidence Scores: More flexible than a hard binary selection, supporting different downstream models in retrieving top-K frames as needed.
Small Model for Sampling, Large Model for Understanding: A cost-effective division of labor strategy.

Limitations & Future Work¶

GenS needs to process a window of 256 frames, leading to non-negligible computational overhead (which can be mitigated by parallel windows).
Currently a single-round sampling method; multi-round iterative retrieval and Video Agent integration have not yet been explored.
Training data relies on the quality of frame descriptions from the VLM; errors in description propagate to the sampling labels.
Evaluated only on question-answering tasks; other tasks like video summarization and video grounding have not been tested.

vs. CLIP/SigLIP Sampling: These methods score single frames independently and fail to understand temporal relationships. GenS outputs intervals, inherently taking temporal continuity into account.
vs. TimeChat: TimeChat uses time-aware encoding but does not perform explicit sampling. GenS serves as an explicit sampling front-end.
vs. FPS/Uniform Sampling: Wastes a large number of tokens on long videos. GenS achieves smart, question-aware sampling.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Modeling frame sampling as a generative task is a fresh perspective, and the dataset construction pipeline is highly complete.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 6+ models and 3 benchmarks, with detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured and clear.
Value: ⭐⭐⭐⭐⭐ A plug-and-play, general-purpose solution with direct practical value for long video understanding.

Generative Frame Sampler for Long Video Understanding¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Rating¶

Related Papers¶