AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding¶

Conference: CVPR 2026 arXiv: 2604.08077 Code: None Area: Video Understanding / Efficient Inference Keywords: long video, adaptive sparsity, Video-LLM, efficient inference, 3D cube

TL;DR¶

AdaSpark is proposed to reduce FLOPs for long-video processing by up to 57% while maintaining performance, via 3D spatiotemporal cube partitioning and two synergistic adaptive sparsity mechanisms: cube-level attention selection and token-level FFN selection.

Background & Motivation¶

Long videos can produce token sequences on the order of hundreds of thousands to millions, rendering standard Video-LLMs infeasible due to quadratic attention complexity and FFN activation costs. Existing efficiency methods suffer from two key limitations: (1) irreversible information loss from frame sampling or token pruning harms fine-grained perception; (2) rigid predefined patterns such as local attention constrain long-range temporal modeling.

Preliminary analysis reveals two key phenomena: (1) video attention exhibits high intrinsic sparsity—a small number of tokens concentrate the majority of attention probability, with significantly varying token requirements across layers; (2) FFN layers exhibit "computational laziness" toward visual tokens—text tokens undergo substantial transformation after FFN (high variance), whereas visual tokens change stably.

Method¶

Overall Architecture¶

Video tokens are partitioned into 3D spatiotemporal cubes (\(h \times w \times t\)), and entropy-based adaptive sparse computation is applied separately at the attention layer and the FFN layer.

Key Designs¶

Adaptive Cube Selection Attention (AdaS-Attn): Each query token computes a relevance score against all preceding cubes (via similarity with the cube mean key \(\bar{k}_j\)), and Top-p (nucleus) selection determines the set of attended cubes: \(P_i = \text{Softmax}([q \cdot \bar{k}_1/\sqrt{d_k}, ..., q \cdot \bar{k}_{i-1}/\sqrt{d_k}]^T)\), \(\mathcal{S}_i = \{j | p_j \in \text{Top-p}(P_i, p)\}\). High-entropy distributions (dispersed attention) lead to selecting more cubes; low-entropy distributions (concentrated attention) select only a few. Full attention within the token's own cube is always retained.
Adaptive Token Selection FFN (AdaS-FFN): Token importance is estimated by L2 norm, and Top-p selection similarly determines which tokens pass through the FFN. Skipped tokens are updated via mean compensation—the mean of FFN transformations from active tokens—to prevent complete information stagnation. Text tokens always pass through the FFN in a dense manner.
Entropy-Based Top-p Selection: Unified across both AdaS-Attn and AdaS-FFN. Adaptive sparsity allocates computation according to input complexity—more computation is allocated when information density is high, and computation is largely bypassed when information is sparse.

Loss & Training¶

The sparsity strategy is applied on top of Qwen2.5-VL with lightweight fine-tuning. The sparsity threshold \(p\) uniformly controls the computational budget of both modules.

Key Experimental Results¶

Main Results¶

Benchmark	AdaSpark	Dense Baseline	FLOPs Reduction
MLVU Dev	Comparable	baseline	Up to 57%
VideoMME	Comparable	baseline	Up to 57%
VideoNIAH (ultra-long video)	Comparable	baseline	Significant

Key Findings¶

Up to 57% FLOPs reduction while maintaining comparable performance across multiple benchmarks
Top-p selection outperforms fixed sparsity ratios—different layers and inputs require different sparsity levels
Mean compensation is critical for preserving information flow in skipped tokens
The semantic homogeneity of cube partitioning underpins the accuracy of sparse selection
Skipped tokens in AdaS-FFN are updated as \(y_k = x_k + \bar{m}_i\), where \(\bar{m}_i = \frac{1}{|\mathcal{M}_i|}\sum_{j \in \mathcal{M}_i} FFN(x_j)\)
Preliminary analysis shows that FFN exhibits "computational laziness" toward visual tokens: the variance of the L2-norm ratio is far lower than that of text tokens
The sparsity strategy is applied on top of Qwen2.5-VL via lightweight fine-tuning

Highlights & Insights¶

The cube-token two-level hierarchical sparsity design is systematic and comprehensive
The entropy-based adaptive mechanism elegantly avoids the suboptimality of fixed sparsity ratios
The discovery of FFN "laziness" toward visual tokens in the preliminary analysis provides a solid motivation for token-selective FFN computation
The mean compensation strategy is simple yet effective

Limitations & Future Work¶

The Top-p threshold still requires manual specification
Hardware efficiency of sparse patterns depends on underlying framework support
The impact on fine-grained temporal reasoning (e.g., precise temporal localization) warrants further evaluation
Video cube partitioning uses a fixed window \(h \times w \times t\); adaptive partitioning could yield further improvements
Text tokens always pass through the FFN densely, preserving rich instruction and semantic content
Comparable performance is maintained across benchmarks including MLVU Dev, VideoMME, and VideoNIAH, including hour-long ultra-long videos

Rating¶

Novelty: ⭐⭐⭐⭐ — Unified cube-token two-level sparsity framework
Technical Depth: ⭐⭐⭐⭐ — Rigorous logic from preliminary analysis to method design
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-benchmark validation including ultra-long videos
Practical Value: ⭐⭐⭐⭐⭐ — 57% FLOPs reduction offers strong practical utility