MambaMia: State-Space Hierarchical Compression for Hour-Long Video Understanding in Large Multimodal Models¶

Conference: AAAI 2026 arXiv: 2506.13564 Code: https://github.com/naver-ai/mambamia Area: Large Multimodal Models / Video Understanding Keywords: Long video compression, state space models, Mamba, gated patch aggregation, adaptive frame sampling

TL;DR¶

MambaMia proposes a two-stage hierarchical video token compression framework based on bidirectional Mamba: Gated Patch Aggregation (GPA) for spatial-temporal local compression, and a Temporal Axis Aggregator (TAA) that leverages Mamba's adaptive step size \(\Delta_t\) for data-driven keyframe sampling. The method compresses hour-long videos to only 4.7K tokens, achieving 44.6 on LVBench and surpassing Qwen2-VL and mPLUG-Owl3.

Background & Motivation¶

Background: Large multimodal models (LMMs) excel at image and short video understanding, but processing hour-long videos poses a severe token explosion problem — hundreds of frames can generate hundreds of thousands of tokens, far exceeding the capacity of standard models and hardware.

Limitations of Prior Work: (1) Per-frame spatial pooling and token pruning address only single-frame redundancy, failing to resolve inter-frame temporal accumulation; (2) query-based selection methods are task-specific and sacrifice general-purpose context modeling; (3) brute-force context window scaling demands enormous computational resources, making it impractical for academic or production settings.

Key Challenge: Long videos contain two types of redundancy — intra-frame spatial redundancy (many similar patches) and inter-frame temporal redundancy (highly similar content across consecutive frames) — while simultaneously containing fine-grained key events that must be preserved. A general solution is needed that achieves aggressive compression without losing critical information.

Goal: How to efficiently compress visual tokens from hour-long videos on standard hardware while maintaining understanding performance?

Key Insight: Exploit the linear complexity of state space models (Mamba) for processing ultra-long sequences, and repurpose Mamba's internal adaptive step size \(\Delta_t\) as a frame importance signal for adaptive sampling.

Core Idea: Apply bidirectional Mamba with gated aggregation for spatial compression, then reuse Mamba's step size for adaptive temporal frame selection, achieving hierarchical long-video compression.

Method¶

Overall Architecture¶

A two-stage compression pipeline: 384-frame video input → visual encoder extracts 576 patch tokens per frame (≈221K tokens total) → Stage 1: Spatiotemporal Compression (GPA) compresses each frame to 24 anchor tokens (≈9.2K total) → Stage 2: Temporal Axis Aggregator (TAA) further compresses via delta sampling to ≈4.7K tokens → fed into LLM.

Key Designs¶

Gated Patch Aggregation (GPA):
- Function: After bidirectional Mamba processing, learnable query anchors aggregate surrounding patch information within the sequence.
- Mechanism: Each row of 24 patches corresponds to one query anchor. GPA applies query-conditioned weighted pooling over neighboring patches: \(\boldsymbol{\alpha} = \text{softmax}(\mathbf{W}_\alpha \mathbf{q} + \mathbf{b}_\alpha)\), \(\mathbf{a} = \sum_i \alpha_i \mathbf{x}_i\). A gating mechanism then adaptively blends the results: \(\mathbf{f} = (1-g)\mathbf{q} + g \cdot \mathbf{a}\), where \(g = \sigma(\mathbf{W}_g \mathbf{q} + b_g)\).
- Design Motivation: When \(g \approx 0\), the anchor retains its own information (state-space context); when \(g \approx 1\), it absorbs local patch information. This is more flexible than the 3D average pooling used in BIMBA; ablations show GPA yields approximately 7% average improvement.
Temporal Axis Aggregator (TAA) + Delta Sampling:
- Function: Models inter-frame dependencies along the temporal axis and performs data-driven keyframe selection using Mamba's adaptive step size \(\Delta_t\).
- Mechanism: A unidirectional Mamba processes the frame-level anchor sequence, with its internal \(\Delta_t = \text{softplus}(\mathbf{W}_\Delta \mathbf{f}_t + \mathbf{b}_\Delta)\) learned end-to-end. \(\Delta_t\) is interpreted as a frame importance score — frames with larger \(\Delta_t\) are considered more informative by the model. A cumulative delta sampling algorithm accumulates \(\Delta_t\) and selects a frame when the cumulative value exceeds a threshold \(\delta_{\text{thresh}}\), then resets the accumulator. By default, approximately 50% of frames are retained (384→192).
- Design Motivation: The SSM's internal \(\Delta_t\) is repurposed directly rather than training a separate selector — this step size inherently reflects the information content of the input (large \(\Delta_t\) = larger state update = more new information). Visualizations show \(\Delta_t\) peaks align with scene transitions and key events.
Bidirectional Mamba Spatiotemporal Compressor:
- Function: Processes the entire spatiotemporal token sequence prior to GPA, enabling spatial and temporal information sharing.
- Mechanism: Three layers of bidirectional Mamba2 blocks process sequences of ≈230K tokens. The bidirectional design allows each token to attend to both past and future context.
- Design Motivation: Replacing bidirectional Mamba with unidirectional Mamba causes a drop of approximately 1.7 points, confirming that bidirectional modeling is important for spatiotemporal feature sharing.

Loss & Training¶

Three-stage LLaVA-style training: image understanding → module alignment (compression layers only) → video instruction fine-tuning (LLM unfrozen). Training uses 128 frames; inference uses 384 frames. Delta sampling is applied at inference time only. The compression module contains approximately 247M parameters.

Key Experimental Results¶

Main Results — Long Video Benchmarks¶

Model	LLM	Max Tokens	LVBench	MLVU	VideoMME	VNBench
Qwen2-VL	Qwen2-7B	-	42.0	64.2	55.6	33.9
LLaVA-Video	Qwen2-7B	12.5K	43.8	70.8	63.3	37.0
mPLUG-Owl3	Qwen2-7B	-	43.5	-	53.5	-
MambaMia	Qwen2-7B	4.7K	44.6	68.0	58.3	41.5

Ablation Study — Compression Module Design¶

Configuration	GPA	TAA	LVBench	MLVU	MME	Avg
BIMBA (3D pool)	✗	✗	35.3	53.8	47.3	45.4
+GPA	✓	✗	41.1	62.4	53.2	52.2
+GPA+TAA (Full)	✓	✓	41.1	64.0	55.7	53.6

Key Findings¶

Only 4.7K tokens are needed to achieve performance comparable to LLaVA-Video using 12.5K tokens, representing approximately 2.6× improvement in token efficiency.
Replacing 3D average pooling with GPA yields approximately 7 points average improvement — learnable gated aggregation substantially outperforms fixed pooling.
Delta sampling outperforms uniform sampling by 1.2 points on LVBench (44.6 vs. 43.4), a statistically significant difference (\(p = 0.047\)).
Even when using Mamba as the LLM backbone, dedicated compression is necessary — a vanilla Mamba LLM performs substantially below the compressed variant.
Performance saturates at 384 frames; additional frames provide no further gain.
Strong performance on VNBench (needle-in-a-video-haystack, 41.5) demonstrates that compression does not discard critical fine-grained information.

Highlights & Insights¶

Repurposing Mamba \(\Delta_t\) as a frame importance signal: The most elegant design choice — the SSM step size inherently encodes input information content, eliminating the need for a separate importance predictor. This idea generalizes to any sequence processing scenario utilizing SSMs.
Modular pre-LLM compression: Unlike VAMBA, which compresses inside the LLM, MambaMia performs compression independently before the LLM, maintaining modularity and lightweight design.
Rigorous experimental methodology: The paper emphasizes from-scratch training, controlled variable comparisons, multi-seed statistical validation, and significance testing — the experimental design is exemplary.

Limitations & Future Work¶

Performance saturation at 384 frames suggests an information bottleneck in the compression layer; exploring more frames with improved compression strategies is worthwhile.
The query-conditioned pooling in GPA attends only to the query token without content-aware attention across patches (for efficiency), potentially losing inter-patch relationships.
\(\delta_{\text{thresh}}\) is manually set; an adaptive threshold would be preferable.
Evaluation is limited to 7B-scale models; effectiveness on larger LLMs remains to be verified.
The train-test mismatch between uniform sampling (training) and delta sampling (inference) may constrain performance.

vs. BIMBA: Both share a Mamba + periodic query architecture, but BIMBA uses 3D average pooling while MambaMia employs learnable gated aggregation and delta sampling; ablations show MambaMia consistently outperforms BIMBA.
vs. LLaVA-Video: LLaVA-Video uses 12.5K tokens; MambaMia achieves comparable performance with 4.7K tokens, demonstrating a clear efficiency advantage.
vs. Video-XL: Video-XL aggregates inside the LLM with CLIP-based frame selection, whereas MambaMia compresses independently outside the LLM with learned frame selection, yielding a more modular architecture.
Insight: The inter-frame temporal fusion approach in TTF-VLA and MambaMia's temporal compression are complementary — TTF-style enhancement followed by MambaMia-style compression is a promising direction.

Rating¶

Novelty: ⭐⭐⭐⭐ Repurposing \(\Delta_t\) for frame sampling is clever; GPA design is also novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven benchmarks, five compression comparisons, multi-seed statistical testing, and cost analysis — exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Method descriptions are precise; a 13-section appendix covers all implementation details with strong reproducibility.
Value: ⭐⭐⭐⭐⭐ Processing hour-long videos with only 4.7K tokens offers substantial practical value to the community.