Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs¶

Conference: NeurIPS 2025 arXiv: 2510.17364 Code: Unavailable Area: Model Compression Keywords: Streaming video understanding, visual token compression, attention-based selection, long video, Video-LLM

TL;DR¶

This paper proposes rLiVS (Recurrent LLM-informed Visual Selection), a training-free and model-agnostic method for streaming video understanding. It achieves state-of-the-art performance on streaming video benchmarks through three complementary designs: LLM attention-guided visual token selection (retaining only ~6% of tokens), recurrent reuse of historical tokens, and caption-based retrieval for question answering.

Background & Motivation¶

Background: Video-LLMs achieve strong performance on short video understanding, but face severe challenges in streaming scenarios involving hour-long online video processing with real-time question answering. The number of visual tokens grows linearly with frame count, making brute-force full-frame processing computationally infeasible for long videos and often exceeding context length limits.
Limitations of Prior Work: Existing long-video understanding approaches each have notable drawbacks:
Training-based methods (e.g., VideoStreaming, Flash-VStream): require additional training, suffer from extrapolation issues on arbitrary-length videos, and incur high training costs.
KV-cache methods (e.g., ReKV): store complete decoder KV-caches, leading to large memory consumption (18.8 GB/hour) with significant redundancy.
Caption-only methods (e.g., Goldfish): process short clips independently, lacking temporal continuity and making entity tracking difficult.
Key Insight: Inspired by cognitive neuroscience—where selective attention governs encoding under limited memory capacity and past experience shapes current attention—the authors propose combining LLM self-attention for visual token selection, recurrent propagation of historical context, and text-based retrieval for question answering.

Method¶

Overall Architecture¶

Long videos are segmented into short clips (e.g., 16 frames) and processed in a streaming fashion. For each clip: (1) historical selected tokens are concatenated with the current clip's tokens and fed into the LLM to generate a caption; (2) a small number of key tokens are selected from the current clip based on attention weights and appended to a FIFO history queue; (3) the generated caption is stored in long-term textual memory. At inference time, the most relevant captions are retrieved from textual memory and fed to the LLM to generate answers.

Key Designs¶

Attention-based Visual Token Selection

After caption generation, the already-computed attention matrices are used to measure the importance of each visual token. The attention coefficients from caption tokens to visual tokens are extracted from layer \(l\), head \(h\):

\(\mathbf{A}^{l,h}_V = \mathbf{A}^{l,h}[TN_V+N_I : TN_V+N_I+N_C, \; 0:TN_V]\)

A global importance score for each visual token \(j\) is computed by averaging across all caption tokens, attention heads, and layers:

\(a_j = \frac{1}{L}\sum_{l=1}^{L}\frac{1}{H}\sum_{h=1}^{H}\left(\frac{1}{N_C}\sum_{i=1}^{N_C}\mathbf{A}^{l,h}_{V_{ij}}\right)\)

The top-\(N_S\) tokens by score are retained (\(N_S \ll N_V\)); in practice, only 6.25% are kept (196 out of 3,136 tokens). Uniformly sampling 4 out of \(L\) layers is sufficient for robust results.

Design Motivation: Attention scores are signals already computed during caption generation, introducing no additional overhead, and naturally reflect which visual tokens are most relevant to the current linguistic understanding.

Recurrent Long-Video Processing

A FIFO queue maintains historically selected tokens \([\mathbf{S}^{(0)}, \mathbf{S}^{(1)}, \ldots, \mathbf{S}^{(t)}]\), which are prepended as a context prefix when processing the next clip. Tokens from the earliest clip are discarded when the context window limit \(W\) is exceeded.

The recurrent design serves a dual purpose: (1) it enhances visual continuity and coherence across clips; (2) it guides LLM attention toward content consistent with historical context, reinforcing the selection effect.

Caption-based Retrieval for Question Answering

The embeddings of captions from all clips \(\{\mathbf{X}_C^{(t)}\}\) are stored. Given a question \(q\), the average cosine similarity between question tokens \(\mathbf{X}_q\) and caption tokens is computed. Maximal Marginal Relevance (MMR) is applied to balance relevance and diversity, and top-\(K\) captions are retrieved. Only the retrieved captions (not visual tokens) are fed to the LLM for answer generation.

Design Motivation: Experiments show that cosine similarities between visual tokens and questions are concentrated in \([-0.02, 0.06]\), offering nearly no discriminative power, whereas caption similarities span \([0.4, 0.9]\) with strong discriminability. Moreover, since LLMs are well-suited for long-form text reasoning, converting video QA into textual QA proves more effective.

Loss & Training¶

The method is entirely training-free; it operates directly on pretrained Video-LLMs without any architectural modification, making it applicable to any short-video-pretrained Video-LLM.

Key Experimental Results¶

Main Results¶

Streaming benchmarks (RVS-Ego / RVS-Movie):

Method	Backbone	RVS-Ego Acc	RVS-Movie Acc	Latency	VRAM
Flash-VStream-7B	Dedicated	57.3	53.1	2.1s	19GB
ReKV	LLaVA-OV 7B	63.7	54.4	2.7s	36GB
rLiVS	LLaVA-OV 7B	65.3	57.7	1.9s	25GB
rLiVS	Qwen2.5-VL 7B	68.1	56.1	2.7s	19GB
ReKV	LLaVA-OV 0.5B	54.7	44.6	1.6s	19GB
rLiVS	LLaVA-OV 0.5B	57.6	51.3	1.5s	11GB

Offline benchmarks:

Method	VS-Ego Acc	VS-Movie Acc	MovieChat Acc	CG-Bench Acc
Flash-VStream-7B	59.0	56.1	-	-
Goldfish	-	-	67.6	-
rLiVS	61.0	59.3	78.0	33.1

Ablation Study¶

Token selection method comparison (NextQA, 6% retention):

Selection Method	Accuracy
Full model (100%)	78.6
Uniform sampling (6%)	75.5
Mean pooling (6%)	70.7
K-Means (6%)	76.8
Attention selection (6%)	77.0
Attention selection (12%)	78.4

Design choice ablation (streaming benchmarks):

Configuration	RVS-Ego Acc	RVS-Movie Acc	Note
rLiVS (full)	65.3	57.7	Recurrent + attention selection + caption QA
w/o recurrence	62.5	53.7	Recurrence contributes 3–4% gain
Visual token retrieval for QA	58.2	48.4	Captions far outperform visual tokens
Uniform sampling instead of attention	64.2	56.0	Attention selection gains 1–2%

Key Findings¶

Retaining only 6% of visual tokens results in a performance drop of merely 1.6% on NextQA; at 12%, performance is nearly lossless.
Recurrently propagating historical tokens improves long-video understanding by 3–4 percentage points.
Captions substantially outperform visual tokens as the information carrier for both retrieval and question answering.
A 0.5B model equipped with rLiVS surpasses most competing methods that require 7B parameters.
A context length of 10K tokens represents the optimal trade-off between efficiency and effectiveness.

Highlights & Insights¶

The method is remarkably simple and elegant: it repurposes the LLM's already-computed attention for token selection at zero additional cost.
The model-agnostic design enables plug-and-play integration with arbitrary Video-LLMs such as LLaVA-OV and Qwen2.5-VL.
Zero KV-cache storage: unlike ReKV, no full KV-cache needs to be stored (saving 18.8 GB/hour).
The design is inspired by cognitive science: attention → selective memory → recurrent processing mirrors human visual information processing.

Limitations & Future Work¶

Focusing exclusively on selected tokens may cause the method to miss fine-grained details.
The FIFO memory buffer operates on temporal rather than semantic priority, potentially discarding critical early-stage information.
Recurrent caption generation may introduce cross-segment redundancy.
The method fully inherits the capabilities and limitations of the pretrained backbone.
Future work could explore adaptive compression rates that dynamically adjust the retention ratio according to scene complexity.

ReKV stores complete KV-caches for streaming understanding → rLiVS achieves better results with far fewer tokens.
Goldfish processes short clips independently → rLiVS enhances temporal continuity through recurrence.
Using attention as a token importance indicator → this principle can be generalized to other multimodal long-context scenarios.
The insight of "converting video QA into textual QA" is worth adopting in other long-video systems.

Rating¶

Novelty: ⭐⭐⭐⭐ Integrates known concepts (attention selection, recurrent processing, caption QA) into an elegant unified framework.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers both streaming and offline benchmarks with comprehensive ablations and efficiency comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Clear exposition with complete algorithmic pseudocode.
Value: ⭐⭐⭐⭐⭐ Training-free, plug-and-play, efficient, and practical — establishes a strong baseline for streaming video understanding.