Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs¶

Conference: ICCV 2025 arXiv: 2506.22139 Code: N/A Area: Video Understanding / Video Large Language Models Keywords: Video Frame Selection, Multi-Resolution Adaptation, Video-LLM, CLIP, Gumbel-Max Sampling

TL;DR¶

This paper proposes Q-Frame, a training-free plug-and-play framework for video frame selection and multi-resolution adaptation. By leveraging CLIP cross-modal matching and the Gumbel-Max trick, Q-Frame achieves query-aware frame selection, enabling Video-LLMs to process more informative frames under the same computational budget. It achieves significant performance gains on three benchmarks: MLVU, LongVideoBench, and Video-MME.

Background & Motivation¶

Video-LLMs face a fundamental tension between the large number of video frames and the limited context length. A 3-minute video at 24 fps contains approximately 4,320 frames, yet VideoLLaMA2 supports only 2,000 tokens and VILA-V1.5 approximately 4,000 tokens.

Uniform frame sampling suffers from three key issues:

Temporal Sparsity: A fixed number of frames becomes increasingly sparse in long videos, disrupting temporal continuity and causing critical transitions to be missed.

Query Agnosticism: The same set of frames is used for every question, regardless of query-specific requirements.

One-Size-Fits-All Resolution: All frames are processed at the same resolution, causing detail loss in high-information-density frames through downsampling, while wasting computation on low-density frames.

Existing improvements—such as Top-K semantic retrieval and frame ranking—still fail to capture complex temporal dependencies and do not explore dynamic resolution adaptation.

Method¶

Overall Architecture¶

Q-Frame consists of three components: 1. CQR (Cross-modal Query Retrieval) 2. QFS (Query-aware Frame Selection) 3. MRA (Multi-Resolution Adaptation)

Key Designs¶

Cross-modal Query Retrieval (CQR):
- Uniformly subsamples \(T\) frames from the original video as candidate frame sequence \(\mathcal{F}\)
- Maps video frames and text queries to a shared semantic space using pretrained CLIP/Long-CLIP
- Computes per-frame similarity to the query: \(I = QF^T \in \mathbb{R}^{1 \times T}\)
- Adopts Long-CLIP to overcome the 77-token text encoder limit of the original CLIP
Query-aware Frame Selection (QFS):
- Since CLIP is primarily trained on image-text pairs, it lacks temporal relationship modeling for video
- Introduces a probability-guided sampling strategy based on the Gumbel-Max trick
- Converts matching scores to a probability distribution: \(\pi = \text{Softmax}(I/\tau)\), where \(\tau\) is the temperature parameter
- Injects independent Gumbel noise to perturb log-probabilities: \(p = \log\pi + g\), where \(g = -\log(-\log\epsilon)\)
- Selects Top-K frames: \(\text{idx}^{\text{select}} = \{i | \text{rank}(i) \leq K\}\)
- Core advantage: exploration–exploitation balance—highly relevant frames are selected with higher probability, while random noise ensures diversity
Multi-Resolution Adaptation (MRA):
- Allocates one of three resolution levels based on query relevance
- High-relevance frames (\(\text{rank} \leq K\)) → high resolution \(r^{(3)}\)
- Mid-relevance frames (\(K < \text{rank} \leq M\)) → medium resolution \(r^{(2)}\)
- Low-relevance frames (\(M < \text{rank} \leq N\)) → low resolution \(r^{(1)}\)
- Resolution relationship: \(r^{(1)} = 4r^{(2)} = 16r^{(3)}\) (high-resolution frames produce 16× more visual tokens)
- Token budget constraint: \(K + M/4 + N/16 = 8\) (equivalent to the computation of 8 high-resolution frames)

Loss & Training¶

No training is required. Q-Frame operates entirely at inference time: - Similarity scores are computed using pretrained Long-CLIP - Gumbel-Max sampling requires no gradient computation - The framework is plug-and-play and compatible with any Video-LLM, including both open-source models and closed-source APIs

Key Experimental Results¶

Main Results¶

Evaluated on MLVU, LongVideoBench, and Video-MME. Frames are selected from 128 candidates, with a budget of 8 frames (or equivalent token budget).

Model	#Frames	MLVU	LongVideoBench	Video-MME (wo/w sub)
VILA-V1.5	8	46.3	47.1	47.5 / 50.0
+Frame-Voyager	8	49.8	-	50.5 / 53.6
+Q-Frame	8	54.4	51.6	50.7 / 55.0
Qwen2-VL	8	56.9	53.5	53.7 / 59.4
+Q-Frame	4+8+32	65.4	58.4	58.3 / 61.8
GPT-4o	8	28.6	53.3	61.9 / 64.5
+Q-Frame	8	29.3	58.6	63.8 / 66.5

Q-Frame consistently improves performance across all models and benchmarks. Qwen2-VL + Q-Frame achieves state-of-the-art on MLVU (65.4%), while GPT-4o + Q-Frame achieves state-of-the-art on the other two benchmarks.

Ablation Study¶

Sampling Strategy	Resolution	Acc (%)
Uniform + Fixed	-	53.5
CLIP Top-K + Fixed	-	56.0
QFS + Fixed	-	57.6
QFS + MRA	-	58.4

Ablation on frame resolution allocation (token budget equivalent to 8 high-resolution frames):

K (High)	M (Mid)	N (Low)	Tokens/Video	Acc (%)
8	0	0	2265	57.6
6	4	16	2313 (+2%)	58.3
4	8	32	2345 (+3.5%)	58.4
4	4	48	2370 (+4.6%)	57.4

Key Findings¶

Q-Frame yields larger gains on longer videos: Qwen2-VL improves by +10.5% on 15m–60m videos
Among 6 video task categories, Reasoning, Recognition, and Counting benefit the most
Temperature \(\tau=0.8\) is optimal; lower values lead to over-exploitation, higher values introduce excessive randomness
The optimal resolution configuration is 4 high + 8 medium + 32 low resolution frames
Increasing low-resolution frames to 48 degrades performance, indicating that more frames do not always help
Q-Frame adds only 3–5% token overhead, with negligible impact on overall inference time

Highlights & Insights¶

Elegant training-free design: The Gumbel-Max trick converts deterministic ranking into probabilistic sampling, balancing diversity and relevance
Model agnosticism: Effective for both open-source (VILA-V1.5, Qwen2-VL) and closed-source (GPT-4o) models
Multi-resolution innovation: The first work to introduce dynamic resolution adaptation for Video-LLMs, retaining more information under the same computational budget
Rigorous experimental design: Both fixed-frame-count and fixed-token-budget settings provide comprehensive and fair comparisons

Limitations & Future Work¶

Relies on CLIP's image-text matching capability; frame selection may be inaccurate for queries requiring complex temporal reasoning (e.g., event ordering)
Gumbel-Max sampling introduces stochasticity, making results non-deterministic (though this is generally an advantage)
MRA is only applicable to models supporting dynamic resolution input (e.g., Qwen2-VL), not all Video-LLMs
The candidate frame count \(T=128\) is fixed, which may be insufficient for very long videos
Explicit modeling of temporal causal relationships is absent; case analysis reveals persistent difficulties on temporal reasoning tasks

KeyVideoLLM's Top-K semantic retrieval and Frame-Voyager's loss-driven optimization serve as comparison baselines
Qwen2-VL's native dynamic resolution framework provides the underlying support for MRA
The Gumbel-Max trick originates from discrete probabilistic sampling theory; its application to frame selection represents an elegant cross-domain transfer
Takeaway: Intelligent input-side filtering during large model inference may be more efficient than model-side improvements

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of Gumbel-Max frame selection and multi-resolution adaptation is original, though individual components are not entirely novel
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, three baseline models, multi-dimensional ablations, cross-video-length analysis, and sub-task analysis—very comprehensive
Writing Quality: ⭐⭐⭐⭐ Clear exposition with intuitive figures and tables
Value: ⭐⭐⭐⭐ The training-free plug-and-play design is highly practical and directly applicable to Video-LLM deployment