NeurIPS 2025 Autonomous Driving streaming video understanding persistent event memory memory tree structure visual token compression multimodal large language models

StreamForest: Efficient Online Video Understanding with Persistent Event Memory¶

Conference: NeurIPS 2025 arXiv: 2509.24871 Code: GitHub Area: Autonomous Driving / Online Video Understanding Keywords: streaming video understanding, persistent event memory, memory tree structure, visual token compression, multimodal large language models

TL;DR¶

This paper proposes StreamForest, an architecture that adaptively organizes streaming video frames into multiple event-level tree structures via a "Persistent Event Memory Forest," combined with a "Fine-grained Spatiotemporal Window" to capture short-term visual cues. The method achieves 77.3% accuracy on StreamingBench and retains 96.8% of performance under extreme compression (only 1024 visual tokens).

Background & Motivation¶

Background: Multimodal large language models (MLLMs) have achieved remarkable progress in offline video understanding, yet face two major challenges in real-time streaming scenarios: (1) the memory burden of storing historical features from continuously arriving frames; and (2) insufficient capacity for real-time spatiotemporal reasoning.
Limitations of Prior Work: Existing streaming video processing strategies exhibit clear deficiencies. Sampling-stage compression (e.g., aggressively discarding frames) sacrifices fine-grained spatiotemporal reasoning; storage-stage compression (merging frames based on inter-frame similarity) tends to miss critical foreground actions due to background noise, and excessive local merging introduces spatiotemporal irregularities.
Key Insight: The paper addresses memory management at the semantic event level — decomposing video into event segments and constructing hierarchical event tree structures. A multi-dimensional penalty function guides adaptive merging to preserve semantic richness while controlling memory overhead. A fine-grained spatiotemporal window is further introduced to focus on detailed visual features at the current timestep.

Method¶

Overall Architecture¶

StreamForest processes streaming video frames at 1 FPS and comprises two core components: (1) the Fine-grained Spatiotemporal Window (FSTW) for high-resolution perception and short-term memory at the current timestep; and (2) the Persistent Event Memory Forest (PEMF) for hierarchical storage and adaptive compression of long-term history. The visual encoder is SigLiP-so400M and the LLM is Qwen2-7B. Upon receiving a user query, all root node features from the PEMF and the full visual features from the FSTW are fed into the LLM.

Key Designs¶

Fine-grained Spatiotemporal Window (FSTW):
- Real-time perception: directly samples high-resolution visual features (729 tokens) from the current frame with encoded spatiotemporal positional information.
- Short-term spatiotemporal memory: maintains a frame buffer spanning \(t_s\) seconds (18 frames, 128 tokens per frame); incoming frames cause older frames to be compressed along the spatial dimension.
- Computes inter-frame similarity for subsequent event-level segmentation.
- Upon buffer overflow, a "meta-event" is segmented at the position of the local minimum inter-frame similarity and transferred to the PEMF.
- A meta-event is a collection of visual tokens from a group of similar consecutive frames, serving as an independent node in the PEMF.
Persistent Event Memory Forest (PEMF):
- Unlike conventional frame-level compression, PEMF organizes memory hierarchically at the semantic event level.
- Events are managed in a tree structure: when the number of long-term memory tokens exceeds the upper limit \(L_q\), the adjacent node pair with the lowest penalty score is selected for merging.
- Merging employs Token Merging (ToMe), compressing the visual tokens of the selected node pair to half their combined count.
- A triple penalty function jointly guides merging decisions to ensure adaptivity.
Triple Penalty Function Design:
- Similarity penalty \(P_s\): computes the cosine similarity of tokens between two event nodes via bipartite graph matching, taking the mean of the top-\(k\) highest similarity scores; \(P_s = 1 - \text{avg}\). Encourages merging of highly similar, redundant events.
- Merge-count penalty \(P_m\): \(P_m = (c_i + c_{i+1}) / (2c_{max})\). Penalizes nodes that have been repeatedly merged to prevent spatiotemporal inconsistency from accumulated information loss.
- Temporal distance penalty \(P_t\): \(P_t = 1 - (d_i + d_{i+1})/2\). Preserves more detail for recent events while permitting more aggressive compression for distant ones.
- Total penalty: \(P = w_s P_s + w_m P_m + w_t P_t\) (default weights: 0.4, 0.4, 0.2).
- Degeneration analysis: using only \(P_s\) degenerates to similarity-based compression; only \(P_m\) degenerates to uniform downsampling; only \(P_t\) degenerates to FIFO.
OnlineIT Training Dataset:
- OnlineIT-general (32K): integrates multiple streaming video understanding datasets to address hallucinations caused by spatiotemporal distribution shift.
- OnlineIT-drive (89K): streaming QA data for autonomous driving scenarios, covering real-time localization, static/dynamic traffic entity understanding, and risk assessment.
ODV-Bench:
- A streaming video understanding benchmark for autonomous driving, comprising three task categories: static objects, dynamic objects, and multi-agent interaction events.
- Constructed via a semi-automatic pipeline: YOLO detection + VLLM annotation + human verification.

Loss & Training¶

A five-stage training strategy is adopted: the first three stages follow the offline long-video MLLM training paradigm (VideoChat-Flash); the fourth stage fine-tunes on OnlineIT to yield the base StreamForest; an optional fifth stage fine-tunes on OnlineIT-Drive to yield StreamForest(FT-drive). Training uses 32 A100 GPUs.

Key Experimental Results¶

Main Results¶

Online video understanding benchmarks:

Method	Scale	StreamingBench	OVBench	OVO-Bench
VideoChat-Online	4B	-	62.9	-
Dispider	7B	-	52.7	-
Flash-VStream	7B	-	40.2	-
StreamForest	7B	77.3	62.3	55.6

ODV-Bench (autonomous driving):

Method	Static Obj. Avg	Dynamic Obj. Avg	Event-level Avg	Overall
Qwen2.5-VL-7B	48.3	57.5	59.4	55.6
StreamForest	51.5	62.3	63.8	59.9
StreamForest(FT-drive)	62.6	64.0	67.5	65.0
Human	95.9	88.2	92.5	91.4

Ablation Study¶

Configuration	Avg. Accuracy	Notes
Default 8192 tokens	100% (baseline)	Full setting
4096 tokens	~99%	Mild compression, nearly lossless
2048 tokens	~98%	Moderate compression, well maintained
1024 tokens	96.8%	Extreme compression retains most performance
\(P_s\) only	Degraded	Similarity-based compression
\(P_m\) only	Degraded	Similar to uniform downsampling
\(P_t\) only	Degraded	Similar to FIFO

Key Findings¶

StreamForest loses only 3.2% of performance under extreme compression (1024 tokens), demonstrating the effectiveness of event-level memory management.
On offline video benchmarks, the method matches or surpasses state-of-the-art offline models, indicating that streaming processing does not sacrifice understanding quality.
Among the three penalty terms, the merge-count penalty and similarity penalty carry the highest weights (0.4 each), suggesting that preventing excessive merging and eliminating redundancy are equally important.
Fine-tuning for autonomous driving (FT-drive) substantially improves driving-scene performance (+5.1 overall accuracy), yet a gap of 26 percentage points with human performance remains.

Highlights & Insights¶

The event-level memory tree design is elegant and closely mirrors the human cognitive approach to video (episodic memory in event units).
The triple penalty function elegantly balances content redundancy, information fidelity, and temporal importance, with tunable weights that recover multiple known strategies as special cases.
The 96.8% retention rate under extreme compression is a highly compelling result that directly demonstrates the robustness of the method.
ODV-Bench fills an important gap in evaluation for streaming video understanding in autonomous driving.

Limitations & Future Work¶

The fixed 1 FPS processing rate may not satisfy applications requiring higher frame rates (e.g., fast-motion scenes).
Event boundary detection relies on local minima of inter-frame similarity, which may be insufficiently robust to gradual scene transitions.
While ToMe merging is efficient, it continuously discards details; information from early events may degrade significantly over long operating periods.
Only 7B-scale models are evaluated; the behavior of larger models remains unknown.
The substantial gap between ODV-Bench performance and human-level performance (59.9 vs. 91.4) indicates that understanding of autonomous driving scenes requires considerable further improvement.

vs. VideoChat-Online: StreamForest comprehensively outperforms VideoChat-Online on all online benchmarks, owing to event-level memory management (vs. VideoChat-Online's static hierarchical memory).
vs. Flash-VStream: Flash-VStream employs a similarity-based compression strategy, which is equivalent to a degenerate version of PEMF using only the \(P_s\) factor.
Implications for streaming AI agents: Event-level memory management can be directly applied to AI agents requiring long-term memory (e.g., robots, live-streaming assistants).

Rating¶

Novelty: ⭐⭐⭐⭐ The event memory forest constitutes a novel memory management paradigm, and the triple penalty function is cleverly designed.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across 4 online benchmarks and offline benchmarks; the extreme compression experiments are particularly impressive.
Writing Quality: ⭐⭐⭐⭐ Structure is clear and method descriptions are thorough, though some mathematical notation could be made more consistent.
Value: ⭐⭐⭐⭐⭐ Makes an important contribution to the streaming video understanding field; ODV-Bench and OnlineIT are valuable community resources.