Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs¶
Conference: NeurIPS 2025 arXiv: 2510.17364 Code: Unavailable Area: Model Compression Keywords: Streaming video understanding, visual token compression, attention-based selection, long video, Video-LLM
TL;DR¶
This paper proposes rLiVS (Recurrent LLM-informed Visual Selection), a training-free and model-agnostic method for streaming video understanding. It achieves state-of-the-art performance on streaming video benchmarks through three complementary designs: LLM attention-guided visual token selection (retaining only ~6% of tokens), recurrent reuse of historical tokens, and caption-based retrieval for question answering.
Background & Motivation¶
- Background: Video-LLMs achieve strong performance on short video understanding, but face severe challenges in streaming scenarios involving hour-long online video processing with real-time question answering. The number of visual tokens grows linearly with frame count, making brute-force full-frame processing computationally infeasible for long videos and often exceeding context length limits.
- Limitations of Prior Work: Existing long-video understanding approaches each have notable drawbacks:
- Training-based methods (e.g., VideoStreaming, Flash-VStream): require additional training, suffer from extrapolation issues on arbitrary-length videos, and incur high training costs.
- KV-cache methods (e.g., ReKV): store complete decoder KV-caches, leading to large memory consumption (18.8 GB/hour) with significant redundancy.
- Caption-only methods (e.g., Goldfish): process short clips independently, lacking temporal continuity and making entity tracking difficult.
- Key Insight: Inspired by cognitive neuroscience—where selective attention governs encoding under limited memory capacity and past experience shapes current attention—the authors propose combining LLM self-attention for visual token selection, recurrent propagation of historical context, and text-based retrieval for question answering.
Method¶
Overall Architecture¶
Long videos are segmented into short clips (e.g., 16 frames) and processed in a streaming fashion. For each clip: (1) historical selected tokens are concatenated with the current clip's tokens and fed into the LLM to generate a caption; (2) a small number of key tokens are selected from the current clip based on attention weights and appended to a FIFO history queue; (3) the generated caption is stored in long-term textual memory. At inference time, the most relevant captions are retrieved from textual memory and fed to the LLM to generate answers.
Key Designs¶
- Attention-based Visual Token Selection
After caption generation, the already-computed attention matrices are used to measure the importance of each visual token. The attention coefficients from caption tokens to visual tokens are extracted from layer \(l\), head \(h\):
\(\mathbf{A}^{l,h}_V = \mathbf{A}^{l,h}[TN_V+N_I : TN_V+N_I+N_C, \; 0:TN_V]\)
A global importance score for each visual token \(j\) is computed by averaging across all caption tokens, attention heads, and layers:
\(a_j = \frac{1}{L}\sum_{l=1}^{L}\frac{1}{H}\sum_{h=1}^{H}\left(\frac{1}{N_C}\sum_{i=1}^{N_C}\mathbf{A}^{l,h}_{V_{ij}}\right)\)
The top-\(N_S\) tokens by score are retained (\(N_S \ll N_V\)); in practice, only 6.25% are kept (196 out of 3,136 tokens). Uniformly sampling 4 out of \(L\) layers is sufficient for robust results.
Design Motivation: Attention scores are signals already computed during caption generation, introducing no additional overhead, and naturally reflect which visual tokens are most relevant to the current linguistic understanding.
- Recurrent Long-Video Processing
A FIFO queue maintains historically selected tokens \([\mathbf{S}^{(0)}, \mathbf{S}^{(1)}, \ldots, \mathbf{S}^{(t)}]\), which are prepended as a context prefix when processing the next clip. Tokens from the earliest clip are discarded when the context window limit \(W\) is exceeded.
The recurrent design serves a dual purpose: (1) it enhances visual continuity and coherence across clips; (2) it guides LLM attention toward content consistent with historical context, reinforcing the selection effect.
- Caption-based Retrieval for Question Answering
The embeddings of captions from all clips \(\{\mathbf{X}_C^{(t)}\}\) are stored. Given a question \(q\), the average cosine similarity between question tokens \(\mathbf{X}_q\) and caption tokens is computed. Maximal Marginal Relevance (MMR) is applied to balance relevance and diversity, and top-\(K\) captions are retrieved. Only the retrieved captions (not visual tokens) are fed to the LLM for answer generation.
Design Motivation: Experiments show that cosine similarities between visual tokens and questions are concentrated in \([-0.02, 0.06]\), offering nearly no discriminative power, whereas caption similarities span \([0.4, 0.9]\) with strong discriminability. Moreover, since LLMs are well-suited for long-form text reasoning, converting video QA into textual QA proves more effective.
Loss & Training¶
The method is entirely training-free; it operates directly on pretrained Video-LLMs without any architectural modification, making it applicable to any short-video-pretrained Video-LLM.
Key Experimental Results¶
Main Results¶
Streaming benchmarks (RVS-Ego / RVS-Movie):
| Method | Backbone | RVS-Ego Acc | RVS-Movie Acc | Latency | VRAM |
|---|---|---|---|---|---|
| Flash-VStream-7B | Dedicated | 57.3 | 53.1 | 2.1s | 19GB |
| ReKV | LLaVA-OV 7B | 63.7 | 54.4 | 2.7s | 36GB |
| rLiVS | LLaVA-OV 7B | 65.3 | 57.7 | 1.9s | 25GB |
| rLiVS | Qwen2.5-VL 7B | 68.1 | 56.1 | 2.7s | 19GB |
| ReKV | LLaVA-OV 0.5B | 54.7 | 44.6 | 1.6s | 19GB |
| rLiVS | LLaVA-OV 0.5B | 57.6 | 51.3 | 1.5s | 11GB |
Offline benchmarks:
| Method | VS-Ego Acc | VS-Movie Acc | MovieChat Acc | CG-Bench Acc |
|---|---|---|---|---|
| Flash-VStream-7B | 59.0 | 56.1 | - | - |
| Goldfish | - | - | 67.6 | - |
| rLiVS | 61.0 | 59.3 | 78.0 | 33.1 |
Ablation Study¶
Token selection method comparison (NextQA, 6% retention):
| Selection Method | Accuracy |
|---|---|
| Full model (100%) | 78.6 |
| Uniform sampling (6%) | 75.5 |
| Mean pooling (6%) | 70.7 |
| K-Means (6%) | 76.8 |
| Attention selection (6%) | 77.0 |
| Attention selection (12%) | 78.4 |
Design choice ablation (streaming benchmarks):
| Configuration | RVS-Ego Acc | RVS-Movie Acc | Note |
|---|---|---|---|
| rLiVS (full) | 65.3 | 57.7 | Recurrent + attention selection + caption QA |
| w/o recurrence | 62.5 | 53.7 | Recurrence contributes 3–4% gain |
| Visual token retrieval for QA | 58.2 | 48.4 | Captions far outperform visual tokens |
| Uniform sampling instead of attention | 64.2 | 56.0 | Attention selection gains 1–2% |
Key Findings¶
- Retaining only 6% of visual tokens results in a performance drop of merely 1.6% on NextQA; at 12%, performance is nearly lossless.
- Recurrently propagating historical tokens improves long-video understanding by 3–4 percentage points.
- Captions substantially outperform visual tokens as the information carrier for both retrieval and question answering.
- A 0.5B model equipped with rLiVS surpasses most competing methods that require 7B parameters.
- A context length of 10K tokens represents the optimal trade-off between efficiency and effectiveness.
Highlights & Insights¶
- The method is remarkably simple and elegant: it repurposes the LLM's already-computed attention for token selection at zero additional cost.
- The model-agnostic design enables plug-and-play integration with arbitrary Video-LLMs such as LLaVA-OV and Qwen2.5-VL.
- Zero KV-cache storage: unlike ReKV, no full KV-cache needs to be stored (saving 18.8 GB/hour).
- The design is inspired by cognitive science: attention → selective memory → recurrent processing mirrors human visual information processing.
Limitations & Future Work¶
- Focusing exclusively on selected tokens may cause the method to miss fine-grained details.
- The FIFO memory buffer operates on temporal rather than semantic priority, potentially discarding critical early-stage information.
- Recurrent caption generation may introduce cross-segment redundancy.
- The method fully inherits the capabilities and limitations of the pretrained backbone.
- Future work could explore adaptive compression rates that dynamically adjust the retention ratio according to scene complexity.
Related Work & Insights¶
- ReKV stores complete KV-caches for streaming understanding → rLiVS achieves better results with far fewer tokens.
- Goldfish processes short clips independently → rLiVS enhances temporal continuity through recurrence.
- Using attention as a token importance indicator → this principle can be generalized to other multimodal long-context scenarios.
- The insight of "converting video QA into textual QA" is worth adopting in other long-video systems.
Rating¶
- Novelty: ⭐⭐⭐⭐ Integrates known concepts (attention selection, recurrent processing, caption QA) into an elegant unified framework.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both streaming and offline benchmarks with comprehensive ablations and efficiency comparisons.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear exposition with complete algorithmic pseudocode.
- Value: ⭐⭐⭐⭐⭐ Training-free, plug-and-play, efficient, and practical — establishes a strong baseline for streaming video understanding.