Skip to content

Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding

Conference: ACL 2025 (Long Paper)
Code: HuggingFace
Area: Multimodal VLM / Video Understanding / Efficient Inference
Keywords: Long Video Understanding, Shot-adaptive Frame Pruning, Hierarchical Attention, Shot Detection, Sparse Attention

TL;DR

Proposes the Sophia model to handle hour-scale long videos: accurately selects query-relevant frames via Shot-adaptive Frame Pruning (a two-stage frame pruning based on shot segmentation), and replaces full attention with Hierarchical Attention of \(O(N)\) complexity. It achieves state-of-the-art (SOTA) performance on 6 out of 8 long video benchmarks, while requiring only 1/8.5 of the attention FLOPs compared to InternVL2.

Background & Motivation

Triple Challenges of Long Videos: (1) Context length overflow—a 10-minute video sampled at 2 fps amounts to 1200 frames, generating tens of thousands of visual tokens; (2) Massive memory consumption—the KV cache of standard quadratic attention requires 70GB+ at 128 frames; (3) Prohibitive computational complexity—full attention FLOPs grow quadratically with the number of frames.

Limitations of Prior Work: Existing approaches either compress the token count per frame (e.g., spatial pooling in LLaVA-OneVision), sacrificing spatial details, or apply uniform temporal segmentation, which discards substantial segments and ignores the temporal non-uniformity of events/shots in videos. LongVU selects frames based on DINOv2 feature clustering, but fails to leverage query information for targeted filtering.

Key Insight: Videos possess a natural structure—shot transitions. Leveraging this structure for two-level pruning (filtering shots first, then removing redundant frames) aligns better with video semantics than uniform segmentation or unstructured clustering. Meanwhile, frame-to-frame attention can be replaced with a hierarchical structure (intra-frame local + inter-frame global) instead of full connections, theoretically guaranteeing an information propagation distance of \(O(1)\).

Method

Overall Architecture

Two core modules: (1) Shot-adaptive Frame Pruning—two-stage frame pruning based on shot detection; (2) Hierarchical Attention—replacing full attention with hierarchical sparse attention. The backbone consists of an InternViT-300M encoder, an MLP projector, and an InternLM2-Chat-7B language model.

Key Designs

  1. Shot-adaptive Frame Pruning (Two-Stage)

    • Shot Detection: Employs a pre-trained TransNet to detect shot transitions, naturally segmenting the video into variable-length shot clips.
    • Inter-shot Pruning: Extracts the visual embeddings of the key frames from each shot, computes the cosine similarity with the MLP-mapped query text, and discards the least relevant \(\alpha\%\) of the shots.
    • Intra-shot Pruning: Calculates the cosine similarity between adjacent frames within the same shot, removing the highly redundant \(\beta\%\) of frames (e.g., duplicated frames in prolonged static scenes).
    • Differentiable Indexing: Utilizes Gumbel-Softmax during training to achieve a differentiable approximation of frame selection, enabling end-to-end gradient propagation.
  2. Hierarchical Attention (\(O(N)\) Complexity)

    • Groups video tokens by frame, dividing attention into two levels: (a) Intra-frame local attention—fully connected among all tokens within the same frame; (b) Inter-frame global attention—fully connected among frame-level summary tokens.
    • IPD Theoretical Guarantee: The Information Propagation Distance (IPD) is \(O(1)\)—any two frames require at most 2 attention layers to exchange information (first aggregating to the frame summary token \(\rightarrow\) propagating inter-frame \(\rightarrow\) distributing to target frames), significantly superior to the \(O(F/w)\) of sliding window attention.
    • Efficient Implementation: Implemented with custom Triton CUDA kernels to avoid the extra overhead associated with PyTorch sparse attention.

Loss & Training

  • Three-stage training: (1) MLP projector alignment \(\rightarrow\) (2) Joint full-parameter fine-tuning \(\rightarrow\) (3) Video instruction tuning.
  • The Gumbel-Softmax temperature is gradually annealed during training.
  • The TransNet shot detector remains frozen and does not participate in training.

Experiments

Main Results: Long Video Understanding

Benchmark Sophia Prev. SOTA Gain
EgoSchema 64.4 54.9 (LongVU) +17.3%
MovieChat-1K 78.2 74.7 (LLaVA-OneVision) +4.7%
LongVideoBench 57.9 55.0 (InternVL2) +5.3%
LVBench 46.2 44.3 (LongVU) +4.3%
MLVU 68.3 65.4 (LongVU) +4.4%
Video-MME (Long) 47.1 45.5 (InternVL2) +3.5%

Efficiency Comparison (128-frame input)

Model Attention FLOPs Memory Usage
LongVU 87.03T ~80GB
InternVL2-8B 22.33T ~70GB
Qwen2-VL-7B 19.06T ~65GB
Sophia 2.64T ~27GB

Sophia’s attention FLOPs are only 1/8.5 of InternVL2 and 1/33 of LongVU.

Ablation Study

Ablation Dimension Conclusion
Shot detection vs. Uniform splitting Shot-adaptive is 3.2% higher on EgoSchema, supporting that shot awareness aligns better with video semantics.
Two-stage (Inter + Intra) Removing either stage leads to performance degradation, demonstrating their complementary and non-substitutable nature.
Hierarchical vs. Dense Attention Performance discrepancy is <1%, while FLOPs are reduced by over 10 times, offering an exceptional efficiency-performance trade-off.
Query-guided vs. Query-free pruning Query-guided Inter-shot Pruning contributes approximately 2-3% absolute improvement.
Gumbel-Softmax vs. Hard selection Differentiable selection stabilizes training and accelerates convergence.

Key Findings

  • The 8B-parameter Sophia outperforms the 34B LLaVA-NeXT-Video and the 40B InternVL2, demonstrating that architectural efficiency can bridge the gap in parameter scale.
  • \(\text{IPD} = O(1)\) implies that even when processing a 1-hour video (thousands of frames), there is no information decay across long distances between frames.
  • The quality of shot detection significantly impacts final performance—TransNet performs best on videos with explicit shots, such as movies.

Highlights & Insights

  • Shot awareness is the core innovation: Leveraging the natural structure of videos (shot boundaries) instead of artificial partitioning aligns better with video semantic distribution.
  • Theoretically guaranteed \(O(N)\) attention: \(\text{IPD} = O(1)\) balances efficiency and long-range modeling capability, unlike sliding window attention which decays with distance.
  • Solid engineering implementation: Custom Triton kernel implementation combined with practical memory/speed comparisons provides empirical data support alongside theoretical advantages.
  • Smaller models outperforming larger ones: The 8B Sophia outperforms 34-40B models on 6 out of 8 benchmarks, highlighting the importance of architectural design.

Limitations & Future Work

  • The \(\alpha\) and \(\beta\) values for frame pruning are fixed hyperparameters without adaptive adjustments (different videos/queries should have different optimal pruning rates).
  • The TransNet shot detector remains frozen and is not jointly trained with the VLM, posing a pipeline bottleneck where detector errors will propagate downstream.
  • Hierarchical Attention assumes visual tokens far outnumber textual tokens, which might not be applicable to short video scenarios.
  • Not validated in real-time or streaming video understanding scenarios.
  • Shot detection may have limited efficacy on videos without distinct shot boundaries (e.g., surveillance footage, continuous screen recordings).
  • vs. LongVU: Performs frame selection based on DINOv2 feature clustering without leveraging query information; Sophia's shot-aware and query-guided selection is more precise.
  • vs. Qwen2-VL: Handles dynamic resolution but still relies on full attention, whereas Sophia's hierarchical attention is more efficient.
  • vs. InternVL2: Achieves comparable performance but with Sophia’s FLOPs being an order of magnitude lower (1/8.5).
  • vs. Video-LLaMA series: Compresses via video Q-Former but loses details, whereas Sophia's frame pruning retains the complete information of key frames.

Rating

  • Novelty: ⭐⭐⭐⭐ Shot-aware segmentation and IPD theoretical analysis are novel; their combination addresses practical bottlenecks.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 8 benchmarks, offering detailed efficiency analysis and comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐ Excellent integration of theory and practice, with clear and intuitive efficiency comparison diagrams.
  • Value: ⭐⭐⭐⭐⭐ Addresses core efficiency bottlenecks in long video understanding, possessing both solid engineering and academic value.