InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding¶

Conference: NeurIPS 2025 arXiv: 2506.15745 Code: GitHub Area: Video Understanding / Model Efficiency Keywords: KV Cache Compression, Streaming Video Understanding, Multimodal Large Language Models, Temporal Redundancy, Edge Deployment

TL;DR¶

This paper proposes InfiniPot-V, the first training-free and query-agnostic streaming video understanding framework. It achieves online KV cache compression via two complementary metrics — Temporal-axis Redundancy (TaR) and Value-Norm (VaN) — enabling streaming video understanding of arbitrary length under a fixed memory budget.

Background & Motivation¶

Modern multimodal large language models (MLLMs) have demonstrated the capability to process hour-long videos, yet this brings a critical practical challenge: KV cache size grows linearly with the number of video frames, rapidly exceeding device memory limits.

This problem is particularly acute in the following scenarios:

Mobile/Edge Devices: Smartphones, AR glasses, and robots have fixed and limited GPU memory.

Streaming Video: Video length is unknown and continuously growing, making memory pre-allocation infeasible.

Multi-turn Dialogue: Users may query the model multiple times during video playback, requiring long-term cache maintenance.

Existing KV cache compression methods suffer from two fundamental limitations: - Offline Assumption: Most methods assume the entire video and user query are available before processing, making them unsuitable for real-time streaming. - Memory Still Scales with Length: Even "compression" methods require constructing the full KV cache before compressing, so peak memory remains proportional to video length.

InfiniPot-V aims to: perform online compression during video encoding, enforcing a fixed memory upper bound independent of video length.

Method¶

Overall Architecture¶

InfiniPot-V adopts a block-wise processing strategy:

Input video frames are grouped into fixed-size blocks and encoded sequentially.
After encoding each block, the system checks whether the current KV cache size exceeds a user-defined threshold.
If the threshold is exceeded, a lightweight compression procedure reduces the cache below the budget.
The compression combines two complementary metrics: Temporal-axis Redundancy (TaR) and Value-Norm (VaN).

The entire process is training-free and query-agnostic, making it fully plug-and-play.

Key Designs¶

1. Temporal-axis Redundancy (TaR)

TaR identifies and removes tokens that are redundant along the temporal dimension: - Computes the cosine similarity between KV vectors at corresponding positions across adjacent frames. - High similarity indicates that the position changes little over time (e.g., static background) and can be safely discarded. - Formula: \(\text{TaR}(i) = \text{CosSim}(\mathbf{k}_i^{(t)}, \mathbf{k}_i^{(t-1)})\) - Tokens with the highest TaR scores (most redundant) are removed.

Intuition: A large proportion of video tokens correspond to static backgrounds or slowly changing regions, which are highly redundant along the temporal axis.

2. Value-Norm (VaN)

VaN retains the semantically most important tokens: - Computes the L2 norm of each token's Value vector. - Larger value norms typically correspond to semantically salient content (moving objects, key actions, etc.). - Tokens with the highest VaN scores are preserved.

Intuition: Tokens with larger Value norms contribute more to the attention aggregation output.

3. Two-Stage Compression Pipeline

Stage 1 (TaR Filtering): Remove the top-\(r_1\%\) tokens by temporal redundancy.
Stage 2 (VaN Ranking): Among the remaining tokens, retain the top-\(r_2\%\) by VaN score.
The two stages are complementary: TaR eliminates redundancy while VaN preserves importance, enabling precise compression.

Loss & Training¶

Training-free: InfiniPot-V is entirely based on statistical metrics and introduces no learnable parameters.
No fine-tuning of the underlying MLLM is required; the method is applied directly at inference time.
Compatible with multiple open-source MLLMs (Qwen2-VL, Qwen2.5-VL, etc.).

Key Experimental Results¶

Main Results: Long Video Understanding Benchmarks¶

Performance of InfiniPot-V on Qwen2.5-VL-7B, compressing the KV cache from ~32K tokens to ~4K tokens:

Benchmark	Full Cache	Uniform	SWA	InfiniPot-V	Compression Ratio
MLVU	70.2	64.8	66.1	69.5	8×
Video-MME	63.4	58.2	59.7	62.8	8×
LongVideoBench	55.1	49.6	51.3	54.7	8×
EgoSchema	67.8	61.5	63.2	67.1	8×

At 8× compression, InfiniPot-V incurs only a 0.5–1.0 point drop, significantly outperforming uniform sampling and sliding window attention.

Cross-Model Generalization¶

Performance across different MLLMs (MLVU benchmark, 8× compression):

Model	Full Cache	InfiniPot-V	Accuracy Retention
Qwen2-VL-7B	65.3	64.1	98.2%
Qwen2.5-VL-7B	70.2	69.5	99.0%
Qwen2-VL-72B	78.1	77.3	99.0%
Qwen2.5-VL-72B	80.5	79.8	99.1%

Accuracy retention improves with model scale.

Ablation Study¶

Effectiveness of Compression Components:

Configuration	MLVU	Video-MME
Full Cache (no compression)	70.2	63.4
TaR only	67.2	60.8
VaN only	66.8	60.1
TaR + VaN (InfiniPot-V)	69.5	62.8

The two metrics are complementary: TaR excels at removing background redundancy, while VaN excels at preserving foreground-salient information.

Memory Reduction Efficiency:

Configuration	Peak GPU Memory	Relative to Full Cache
Full Cache (768 frames)	~48 GB	100%
InfiniPot-V (4K budget)	~3 GB	6% (94% reduction)
InfiniPot-V (8K budget)	~5 GB	10% (90% reduction)

Key Findings¶

Up to 94% GPU memory reduction: From 48 GB to 3 GB, enabling hour-long video processing on consumer-grade GPUs.
Near-lossless accuracy: Accuracy retention exceeds 98% across multiple benchmarks; in some cases, InfiniPot-V even surpasses full cache performance.
Real-time generation speed: Inference throughput is no lower than that of full cache, as fewer tokens reduce per-step computation.
Multi-turn dialogue support: Under a fixed memory budget, the system continuously accepts new frames and queries, naturally supporting streaming multi-turn interaction.

Highlights & Insights¶

First genuinely streaming solution: All prior "long video understanding" methods follow a "read-all-then-process" paradigm; InfiniPot-V truly achieves block-wise processing under a fixed memory constraint.
Elegant training-free design: TaR and VaN are intuitively motivated, computationally lightweight, and require no additional training.
Clever exploitation of video properties: TaR leverages the inherent temporal redundancy of video, a unique source of compressibility not present in text or images.
Industry-deployment friendly: Plug-and-play, cross-model generalizable, and fixed memory budget — highly suitable for edge device deployment.

Limitations & Future Work¶

TaR currently relies on adjacent-frame comparisons, which may inadvertently discard key tokens during rapid scene cuts.
The semantic saliency assumption underlying VaN does not always hold; a large Value norm does not necessarily imply semantic importance.
Fixed compression ratios (\(r_1\), \(r_2\)) are not adaptively adjusted based on video content complexity.
Validation is limited to the Qwen model family; generalization to other architectures (e.g., LLaVA variants) remains to be confirmed.
Robustness in extreme scenarios (e.g., hours of static surveillance footage followed by a sudden anomaly) has not been evaluated.

StreamingLLM: Achieves streaming inference by retaining attention sinks and recent tokens, but does not account for semantic importance.
FastV / FreeVideoLLM: Token pruning methods, but not designed for streaming settings.
KVzip: Query-aware KV cache compression, but requires knowledge of the query content.
LiveVLM: A concurrent work on streaming video understanding using a retrieval-based approach.
Insight: The TaR metric can be generalized to token compression in other temporal data modalities, such as audio and sensor streams.

Rating¶

Novelty: ⭐⭐⭐⭐ (first streaming + fixed-memory solution)
Technical Depth: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Practicality: ⭐⭐⭐⭐⭐ (directly deployable)
Writing Quality: ⭐⭐⭐⭐