StreamingTOM: Streaming Token Compression for Efficient Video Understanding¶

Conference: CVPR2026
arXiv: 2510.18269
Code: yige24/StreamingTOM
Area: Video Understanding / Streaming Video QA / Token Compression
Keywords: streaming video understanding, token compression, kv-cache quantization, training-free, causal inference

TL;DR¶

This paper proposes StreamingTOM, a training-free two-stage framework for streaming video understanding. Causal Temporal Reduction (CTR) compresses per-frame tokens from 196 to 50 via causal temporal selection before the LLM, while Online Quantized Memory (OQM) constrains kv-cache growth after the LLM through 4-bit quantization and on-demand retrieval. The framework achieves a 15.7× compression ratio, 1.2× lower peak memory, and 2× faster TTFT.

Background & Motivation¶

Dual constraints of streaming video: Unlike offline processing, streaming video VLMs face two fundamental constraints — causality (no access to future frames) and accumulation (unbounded token growth over time) — making token compression not merely an optimization but a necessity.
Unbounded kv-cache growth: Using LLaVA-OV-7B as an example, a 1-hour video at 0.5 fps generates an 18.8 GB kv-cache, far exceeding typical GPU memory capacity and rendering real-time inference infeasible.
Existing methods only manage post-LLM state: Current training-free streaming approaches (ReKV, LiveVLM, StreamMem) apply eviction or compression only to the kv-cache after the LLM, without reducing the \(O(tNLd^2)\) computational cost of pre-LLM prefill.
Offline compression violates causality: Well-established offline token merging and pruning methods (ToMe, DyCoke, HoliTom) rely on global or bidirectional attention and future frame information, making them inapplicable to streaming scenarios.
Training-based methods are costly: Training-based streaming approaches (Flash-VStream, Dispider) require expensive backbone-specific retraining and do not transfer easily across architectures.
Gap in causal pre-LLM compression: To the authors' knowledge, no prior training-free streaming method performs strictly causal token reduction before the LLM, leaving a significant efficiency opportunity unexplored.

Method¶

Overall Architecture: Two-Stage Pipeline¶

StreamingTOM = OQM₁₆→₄ ∘ CTR_{N→G}, using a group abstraction (frame-aligned groups of fixed \(G=50\) tokens per frame) as the interface between the two stages:

Visual pipeline: Visual encoder extracts features → CTR compression → written to online memory
Query pipeline: User question drives the decoder → OQM retrieves relevant groups → 4-bit dequantization → efficient generation

Stage 1: Causal Temporal Reduction (CTR)¶

CTR adheres to three design principles: strict causality (2-frame window), single-pass processing, and a fixed per-frame budget \(G\).

Temporal similarity computation: Cosine similarity \(s_t^{(i)}\) is computed between tokens at the same spatial position across adjacent frames \(t\) and \(t{-}1\), measuring cross-frame redundancy.
Spatial saliency: Attention scores \(\alpha_t^{(i)}\) from the visual encoder are reused as a zero-cost byproduct; chunked attention is applied to avoid memory spikes.
Static/dynamic classification: Tokens are partitioned into a static set \(\mathcal{S}_t\) (high similarity, redundant) and a dynamic set \(\mathcal{D}_t\) (low similarity, novel information) using threshold \(\tau_c = 0.9\).
Adaptive budget allocation: The \(G\) slots are divided into \(k_s\) and \(k_d\) proportionally to the static/dynamic ratio, allocating more capacity to dynamic tokens when scene content changes rapidly.
Dual-path processing:
- Dynamic path: Top-\(k_d\) tokens selected by saliency (preserving key novel information)
- Static path: Density-based clustering merges tokens into \(k_s\) representative tokens (removing redundancy)
Complexity: \(O(N + G^2)\) per frame; state requires only the previous frame's features \(O(Nd)\), independent of stream length.

Stage 2: Online Quantized Memory (OQM)¶

OQM addresses the residual linear growth of the kv-cache after CTR:

Incremental group quantization: Each group is independently quantized to 4-bit (per-head, per-channel scale/offset), with a representative key \(\bar{\mathbf{k}}_t\) stored alongside.
Retrieve-then-dequantize paradigm: At query time, cosine similarity is computed between the decoder state and all group representative keys; the top-\(k\) most relevant groups are selected and dequantized from 4-bit to FP16.
Bounded active memory: Total storage is \(O(T \cdot G \cdot d / 4)\) retaining the full history, while active kv is \(O(k \cdot G \cdot d)\) (\(k \ll T\)), keeping decoding latency independent of stream length.

Compression Ratio¶

Combining CTR and OQM: compression ratio \(= 4N/G = 4 \times 196/50 \approx 15.7\times\).

Key Experimental Results¶

Offline Long-Video Benchmarks (LLaVA-OV-7B backbone)¶

Method	VideoMME Overall	MLVU	EgoSchema	Avg
LLaVA-OV-7B (offline baseline)	58.4	64.7	60.1	61.0
+LiveVLM (training-free SOTA)	57.3	66.3	59.0	60.9
+StreamMem	59.4	66.9	63.0	63.1
+StreamingTOM (ours)	59.9	67.9	63.7	63.8

Online Streaming Evaluation (RVS benchmark, 28 GB memory limit)¶

Method	RVS-Ego Acc/Score	RVS-Movie Acc/Score	Avg Acc/Score
Flash-VStream (training-based)	57.0 / 4.0	53.1 / 3.3	55.0 / 3.6
StreamMem	57.6 / 3.8	52.7 / 3.4	55.2 / 3.6
StreamingTOM	58.3 / 3.9	53.2 / 3.5	55.8 / 3.7

Efficiency Metrics¶

kv-cache compression ratio: 15.7×
Peak memory: 1.2× reduction compared to LiveVLM
TTFT: 2× speedup compared to LiveVLM
1-hour video kv-cache: 18.8 GB → 1.2 GB
Memory growth: only 16.0 GB → 16.7 GB from 16 to 512 frames (sub-linear)
Throughput: stable at ~20 tokens/s on long sequences

Ablation Study¶

Token Count	Quantization	Compression Ratio	VideoMME Overall
40	4-bit	5.1%	58.9
50	4-bit	6.4%	59.9
60	4-bit	7.7%	59.3
50	2-bit	3.2%	58.5

50 tokens is the optimal trade-off: fewer (40) loses critical details, more (60) reduces temporal coverage under fixed memory budget
4-bit quantization outperforms 2-bit, achieving the best accuracy–compression balance

Highlights & Insights¶

First causal pre-LLM token compression: Fills the gap in training-free streaming methods by reducing prefill complexity from \(O(tNLd^2)\) to \(O(tGLd^2)\).
Elegant group abstraction: Fixed-size, frame-aligned groups serve both as CTR output units and OQM storage/retrieval units, ensuring temporal consistency and predictable latency.
Fully plug-and-play: Requires no training and can be directly applied to different backbones such as LLaVA-OV.
Deployment-friendly: Runs on a single A6000 GPU, is batch-agnostic, and exhibits sub-linear memory growth.
Complementary two-stage design: CTR reduces computation while OQM reduces memory; both stages are necessary, and their combination far surpasses either alone.

Limitations & Future Work¶

Fixed \(G\) may be suboptimal: A uniform budget of 50 tokens per frame lacks flexibility for frames with heterogeneous information density (e.g., keyframes vs. static frames).
Single backbone evaluated: Experiments are primarily conducted on LLaVA-OV-7B; generalization to larger models (e.g., 72B) or other architectures remains unverified.
2-frame window limitation: CTR's causal window considers only adjacent frame pairs, potentially accumulating errors in slowly evolving scenes.
Retrieval quality of representative keys: OQM uses mean keys for retrieval, which may be insufficiently precise for fine-grained temporal reasoning.
Multimodal audio streams not evaluated: Only the visual stream is considered; practical streaming applications typically involve concurrent audio streams.

Dimension	StreamingTOM	LiveVLM/StreamMem	DyCoke/HoliTom	Flash-VStream
Pre-LLM compression	✅ CTR	❌	✅ (non-causal)	✅ (requires training)
Post-LLM management	✅ OQM 4-bit	✅ kv-cache eviction	❌	✅ (requires training)
Causal constraint	✅ strict	✅	❌ requires future frames	✅
Training required	No	No	No	Retraining needed
Compression ratio	15.7×	~4×	~4×	N/A

Rating¶

Novelty: ⭐⭐⭐⭐ — First to introduce causal pre-LLM token compression in a training-free streaming framework; group abstraction elegantly unifies the two stages
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers both offline and online benchmarks with detailed efficiency analysis and complete ablations
Writing Quality: ⭐⭐⭐⭐⭐ — Problem formulation is clear, mathematical derivations are rigorous, and pipeline diagrams are intuitive
Value: ⭐⭐⭐⭐ — Addresses a real deployment bottleneck in streaming video VLMs; plug-and-play design ensures strong practical utility