VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges¶
Conference: ICCV 2025 arXiv: 2409.01071 Code: https://github.com/bigai-nlco/VideoLLaMB Area: Video Understanding Keywords: Long video understanding, recurrent memory, streaming video, video-language models, frame retrieval
TL;DR¶
VideoLLaMB is proposed, achieving long streaming video understanding with linear GPU memory scaling via SceneTiling semantic segmentation, recurrent memory bridge layers, and a memory cache retrieval mechanism, yielding an average improvement of 4.2 points across 4 VideoQA benchmarks.
Background & Motivation¶
Large-scale video-language models (e.g., GPT-4o) have demonstrated strong potential for streaming video understanding, yet face the following challenges:
Computational bottleneck: The high-dimensional data of long videos is computationally unaffordable for academic researchers.
Information loss from compression: Sampling, aggregation, and semantic merging strategies discard critical visual cues.
Semantic discontinuity from segmentation: Splitting videos into short clips disrupts semantic flow and impairs holistic understanding.
Evaluation bias: Existing benchmarks suffer from static and language biases, failing to comprehensively assess long-video capabilities.
Core motivation: To design an efficient framework that encodes the entire video sequence via a recurrent memory mechanism while preserving semantic continuity, without discarding visual information.
Method¶
Overall Architecture¶
VideoLLaMB consists of three core modules: (1) a SceneTiling semantic segmenter, (2) recurrent memory bridge layers, and (3) a memory cache retriever. After ViT encoding, the video is segmented by SceneTiling; the recurrent memory layers encode across semantic segments recursively; the memory cache maintains long-range dependencies via retrieval; and the enhanced representations are finally fed into the LLM.
Key Designs¶
-
SceneTiling Semantic Segmentation Algorithm: A model-free scene segmentation algorithm inspired by TextTiling. It computes cosine similarity between adjacent frame [CLS] tokens, \(c_i = S_C(\text{ViT}(v_i), \text{ViT}(v_{i+1}))\), followed by a depth score \(d_i = (cl_i + cr_i - 2c_i)/2\). Segmentation is performed using \(\mu + \alpha \cdot \sigma\) as the threshold. This algorithm ensures intra-segment semantic consistency and adapts to streaming video captioning without any training.
-
Recurrent Memory Bridge Layers: A recurrent memory token is introduced into a single-layer Transformer (Bridge Layer). For each semantic segment \(s_i\), memory tokens are prepended as \([m_i; s_i]\), and self-attention yields \([m_{i+1}; o_i] = \text{BridgeLayer}([m_i; s_i])\). Memory tokens are updated by iterating over all semantic segments. This compresses historical video content into memory while preserving per-frame detail via projection.
-
Memory Cache with Retrieval: All historical memory tokens \(M_i = [m_1, ..., m_i]\) are stored at each timestep \(i\). The current memory is updated via a cross-attention self-retrieval mechanism: \(m_{i+1} = \text{Softmax}(W_i^Q m_i (W_i^K M_i)^\top / \sqrt{d_k}) W_i^V M_i\), alleviating gradient vanishing and maintaining long-range dependencies.
Loss & Training¶
- Training follows the same video data protocol as PLLaVA.
- The LLM backbone is Vicuna-7B-v1.5; the visual backbone is ViT-L/14.
- Both training and evaluation use 16 frames across 4 semantic segments.
- Time complexity is \(\mathcal{O}(K^2)\); space complexity is \(\mathcal{O}(K)\) (where \(K\) is the number of segments), enabling linear GPU memory scaling.
Key Experimental Results¶
Main Results¶
EgoSchema zero-shot accuracy:
| Model | LLM | Frames | Accuracy |
|---|---|---|---|
| GPT-4o | OpenAI API | 16 | 72.2 |
| Video-LLaVA | Vicuna-7B | 8 | 40.2 |
| PLLaVA | Vicuna-7B | 16 | 45.6 |
| PLLaVA | Vicuna-7B | 32 | 43.8 |
| VideoLLaMB | Vicuna-7B | 32 (trained on 8) | 53.8 |
NExT-QA accuracy comparison:
| Model | Temporal | Causal | Description | All |
|---|---|---|---|---|
| PLLaVA* | 62.2 | 68.5 | 79.7 | 68.2 |
| VideoLLaMB* | 66.8 | 71.6 | 78.4 | 71.1 |
Ablation Study¶
| Configuration | Key Improvement | Notes |
|---|---|---|
| Base linear projection | — | Strong detail retention, weak memory |
| + Resampler | Semantic compression | Strong compression, detail loss |
| + Recurrent memory bridge layers | +4.2 avg | Balances compression and detail |
| + Memory cache retrieval | +Robustness on long videos | Mitigates gradient vanishing |
| + SceneTiling | +Semantic coherence | Training-free streaming captioning |
Key Findings¶
- Robust performance is maintained when video length is scaled up to 8× the training length.
- Target frames are accurately retrieved across videos of 1–320 seconds on the NIAVH (Needle in a Video Haystack) benchmark.
- A single A100 GPU can process 320 frames (trained on only 16 frames).
- On the EgoPlan task, VideoLLaMB achieves the best performance among all 7B models, outperforming PLLaVA by 2.06 points.
Highlights & Insights¶
- SceneTiling elegantly transfers the TextTiling concept to video segmentation, maintaining semantic consistency without any training.
- The recurrent memory bridge layers are implemented without modifying the visual encoder or LLM architecture, enabling a plug-and-play design.
- Linear memory scaling makes long-video understanding feasible for academic research.
- The NIAVH benchmark fills the gap in frame-level retrieval evaluation.
Limitations & Future Work¶
- Based on a 7B model, a notable gap remains compared to large models such as GPT-4o.
- Segmentation quality depends on the representational capability of the ViT [CLS] token.
- The growing memory cache demands more efficient eviction and compression strategies for very long videos.
- Training on a limited number of frames (16) leaves the generalization to ultra-long videos to be further validated.
Related Work & Insights¶
- The combination of recurrent memory and retrieval can be generalized to other multimodal tasks requiring long-range dependency modeling.
- The training-free streaming paradigm of SceneTiling has practical value for real-time video understanding.
- The bridge layer concept—balancing projection and compression—is worth adopting in other video-language models.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Overall | ⭐⭐⭐⭐ |