Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation¶

Conference: ICML 2025
arXiv: 2504.02438
Code: Yes
Area: Video Understanding / Video Language Models
Keywords: Long Video Understanding, Visual Token Compression, Keyframe Selection, Feature Fusion, Mixed-Precision

TL;DR¶

ViLaMP proposes the Differential Distillation principle to achieve "mixed-precision" video processing through two mechanisms: hierarchical frame-level Differential Keyframe Selection (DKS) and patch-level Differential Feature Merging (DFM). In this paradigm, keyframes retain all visual tokens, while non-keyframes are compressed into a single token. This enables processing ultra-long videos of up to 10K frames (approximately 2.7 hours) on a single A100 GPU.

Background & Motivation¶

Background¶

Vision-Language Models (VLMs) face fundamental challenges when processing long videos: the sequence of visual tokens generated by videos far exceeds the context length of LLMs. For instance, a 1-minute 24fps video generates over 11 million visual tokens, vastly exceeding the 4K-128K token capacity of mainstream LLMs. Existing methods mainly include: token pruning (uniform or content-aware sampling, such as LongVU) and feature fusion (heuristic or learnable mechanisms, such as Q-Former).

Limitations of Prior Work¶

Token Pruning: May lose critical temporal dependencies, and improper frame selection can lead to the loss of key information.

Feature Fusion: Often leads to the dilution of semantic information, making it difficult to maintain semantic fidelity.

Redundant Computation: Analysis shows that ~90% of query attention is concentrated on only 5% of the frames, and these high-attention frames are highly redundant and similar to each other.

Ours¶

The Differential Distillation principle is proposed: truly critical information must simultaneously satisfy (1) high relevance to the query and (2) low redundancy with the temporal context. Based on this, ViLaMP is designed to keep full tokens for keyframes, while retaining the most salient features for non-keyframes and compressing them into a single token.

Method¶

Overall Architecture¶

ViLaMP is a hierarchical architecture consisting of a frame-level Differential Keyframe Selector (DKS) and a patch-level Differential Feature Merger (DFM), achieving "mixed-precision" video processing. The total number of visual tokens is reduced from $MN$ to $MK + (N-K)$, where $K \ll N$. A dual-stream visual connector is employed to project keyframes and compressed non-keyframes respectively into the language model space.

Key Designs¶

Differential Distillation Principle: For any video component $v$ and query $Q$, the differential information saliency score is defined as: $$D(v) = R(v, Q) - T(v, \mathcal{C}(v))$$ where $R(v,Q)$ measures query relevance, and $T(v, \mathcal{C}(v))$ captures temporal redundancy. A higher $D(v)$ indicates that the information is more unique and task-relevant. The core insight stems from empirical analysis: at the frame level, ~90% of attention is concentrated on 5% of the frames, which are highly similar to each other; at the patch level, ~50% of the low-attention frame patches contribute to 80% of the attention and are highly similar to the keyframes.
Differential Keyframe Selection (DKS): A CLIP encoder is utilized to calculate the cosine similarity between frames and the query as the relevance score: $$R_f(f_n, Q) = \cos(\boldsymbol{f}_n, \boldsymbol{q})$$ Temporal redundancy is defined as the maximum similarity with already selected keyframes: $$T_f(f_n, \mathcal{C}(f_n)) = \max_{f \in \mathcal{C}(f_n)} \cos(\boldsymbol{f}_n, E_f(f))$$ A greedy algorithm (Algorithm 1) is used: frames are sorted in descending order of query relevance, and those with similarity to already selected frames below a threshold $\tau$ are sequentially chosen. This has a complexity of $O(\max(NK, N\log N))$, ensuring both semantic relevance and temporal diversity.
Differential Feature Merging (DFM): For each patch $p_n^m$ of non-keyframes, the differential saliency is calculated as: $$D_p(p_n^m) = R_p(p_n^m, Q) - \lambda T_p(p_n^m, p_k^m)$$ $$R_p(p_n^m, Q) = \cos(\boldsymbol{p}_n^m, \boldsymbol{q}), \quad T_p(p_n^m, p_k^m) = \cos(\boldsymbol{p}_n^m, \boldsymbol{p}_k^m)$$ The non-keyframes are compressed into a single token via differential weighted pooling: $$\boldsymbol{t}_n = \frac{\sum_{m=1}^M w_n^m \boldsymbol{p}_n^m}{\sum_{m=1}^M w_n^m}, \quad w_n^m = \text{softmax}\left(\frac{1}{\alpha}[D_p(p_n^1), \cdots, D_p(p_n^M)]\right)\bigg|_m$$ where $\alpha$ controls the sharpness of the weight distribution, and $\lambda$ balances query relevance and temporal uniqueness.

Loss & Training¶

The model is trained using a language modeling objective. Keyframes project each patch embedding through $\text{MLP}_k$, while non-keyframes project the compressed representation through $\text{MLP}_n$: $$\mathcal{L} = -\log P(A | \{\boldsymbol{h}_k^m | f_k \in \mathcal{K}\} \cup \{\boldsymbol{h}_n | f_n \notin \mathcal{K}\}, Q)$$ The embeddings of keyframes and non-keyframes are arranged in temporal order, and parameters of DFM are learned through end-to-end optimization.

Key Experimental Results¶

Main Results¶

Benchmark	Subset	ViLaMP	Prev. SOTA	Gain
Video-MME (Long, w/o sub)	Long video (>39min)	-	Previous Best	+3.5%
Video-MME (Long, w/ sub)	Long video (>39min)	-	Previous Best	+1.6%
VideoNIAH	10K frames	-	VideoChat-Flash	+12.82%

VideoNIAH Ultra-long Video Benchmark¶

Model	10K Frame Processing Capability	Performance Degradation	Description
LLaMA-VID	OOM	-	Out of Memory
VideoChat-Flash	Executable	>24.50% Degradation	Severe performance drop from 2K to 10K frames
ViLaMP	Single A100	Stable	Performance largely maintained

Key Findings¶

Highly Concentrated Attention: Analysis across 4 VLMs reveals that ~90% of query attention is concentrated on <5% of frames.
Pervasive Attention Redundancy: The cosine similarity between high-attention frames is >0.8, far exceeding the random baseline (0.54-0.61).
Complementary Information in Non-Keyframes: Approximately 50% of patches in non-keyframes contribute 80% of attention and are highly similar to keyframes.
Effectiveness of Mixed Precision: The strategy of keyframes retaining full tokens + non-keyframes compressing into a single token achieves the best balance between efficiency and performance.

Highlights & Insights¶

Differential Distillation Principle: Unifies the definitions of frame-level and patch-level saliency into a concise and elegant formulation, $D(v) = R(v,Q) - T(v, \mathcal{C}(v))$.
Empirical Analysis-Driven Design: Discovers redundancy through in-depth analysis of attention patterns in 4 VLMs before designing the compression strategy.
Exquisite "Mixed Precision" Analogy: Borrows the concept of mixed-precision training to allocate different "precision" representations to frames of varying importance.
10K Frame Processing Capacity: Processes ~2.7 hours of video on a single A100; the proposed VideoNIAH benchmark fills the gap in ultra-long video evaluation.
Greedy DKS Algorithm: Sorts by relevance before filtering redundancy, ensuring query relevance is prioritized over diversity.

Limitations & Future Work¶

DKS relies on the CLIP encoder to calculate frame-query similarity; the encoding capability of CLIP itself may become a bottleneck.
Compressing non-keyframes into a single token may lose some fine-grained temporal information.
Hyperparameters such as thresholds $\tau$, $\lambda$, and $\alpha$ require tuning, and different video types may require different settings.
The cached paper experimental results table is truncated; specific values should be verified from the original text.

Its ideas are complementary to LongVU (content-aware frame grouping + similarity-based patch selection) and VideoChat-Flash (similarity-guided fusion).
The differential distillation principle can be extended to other scenarios requiring information compression (e.g., long document or long audio processing).
The "mixed-precision" concept can inspire more flexible multi-granularity representation designs.

Rating¶

Novelty: ⭐⭐⭐⭐ The differential distillation principle unifies frame-level and patch-level operations, and the mixed-precision analogy is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ 5 benchmarks + self-built VideoNIAH + attention analysis of 4 VLMs (cached table truncated).
Writing Quality: ⭐⭐⭐⭐⭐ Driven by preliminary studies, logical, and structurally rigorous.
Value: ⭐⭐⭐⭐⭐ 10K frame processing is a critical need for practical applications; the direction is clear and practical.