MambaMia: State-Space Hierarchical Compression for Hour-Long Video Understanding in Large Multimodal Models¶
Conference: AAAI 2026 arXiv: 2506.13564 Code: https://github.com/naver-ai/mambamia Area: Large Multimodal Models / Video Understanding Keywords: Long video compression, state space models, Mamba, gated patch aggregation, adaptive frame sampling
TL;DR¶
MambaMia proposes a two-stage hierarchical video token compression framework based on bidirectional Mamba: Gated Patch Aggregation (GPA) for spatial-temporal local compression, and a Temporal Axis Aggregator (TAA) that leverages Mamba's adaptive step size \(\Delta_t\) for data-driven keyframe sampling. The method compresses hour-long videos to only 4.7K tokens, achieving 44.6 on LVBench and surpassing Qwen2-VL and mPLUG-Owl3.
Background & Motivation¶
Background: Large multimodal models (LMMs) excel at image and short video understanding, but processing hour-long videos poses a severe token explosion problem — hundreds of frames can generate hundreds of thousands of tokens, far exceeding the capacity of standard models and hardware.
Limitations of Prior Work: (1) Per-frame spatial pooling and token pruning address only single-frame redundancy, failing to resolve inter-frame temporal accumulation; (2) query-based selection methods are task-specific and sacrifice general-purpose context modeling; (3) brute-force context window scaling demands enormous computational resources, making it impractical for academic or production settings.
Key Challenge: Long videos contain two types of redundancy — intra-frame spatial redundancy (many similar patches) and inter-frame temporal redundancy (highly similar content across consecutive frames) — while simultaneously containing fine-grained key events that must be preserved. A general solution is needed that achieves aggressive compression without losing critical information.
Goal: How to efficiently compress visual tokens from hour-long videos on standard hardware while maintaining understanding performance?
Key Insight: Exploit the linear complexity of state space models (Mamba) for processing ultra-long sequences, and repurpose Mamba's internal adaptive step size \(\Delta_t\) as a frame importance signal for adaptive sampling.
Core Idea: Apply bidirectional Mamba with gated aggregation for spatial compression, then reuse Mamba's step size for adaptive temporal frame selection, achieving hierarchical long-video compression.
Method¶
Overall Architecture¶
A two-stage compression pipeline: 384-frame video input → visual encoder extracts 576 patch tokens per frame (≈221K tokens total) → Stage 1: Spatiotemporal Compression (GPA) compresses each frame to 24 anchor tokens (≈9.2K total) → Stage 2: Temporal Axis Aggregator (TAA) further compresses via delta sampling to ≈4.7K tokens → fed into LLM.
Key Designs¶
-
Gated Patch Aggregation (GPA):
- Function: After bidirectional Mamba processing, learnable query anchors aggregate surrounding patch information within the sequence.
- Mechanism: Each row of 24 patches corresponds to one query anchor. GPA applies query-conditioned weighted pooling over neighboring patches: \(\boldsymbol{\alpha} = \text{softmax}(\mathbf{W}_\alpha \mathbf{q} + \mathbf{b}_\alpha)\), \(\mathbf{a} = \sum_i \alpha_i \mathbf{x}_i\). A gating mechanism then adaptively blends the results: \(\mathbf{f} = (1-g)\mathbf{q} + g \cdot \mathbf{a}\), where \(g = \sigma(\mathbf{W}_g \mathbf{q} + b_g)\).
- Design Motivation: When \(g \approx 0\), the anchor retains its own information (state-space context); when \(g \approx 1\), it absorbs local patch information. This is more flexible than the 3D average pooling used in BIMBA; ablations show GPA yields approximately 7% average improvement.
-
Temporal Axis Aggregator (TAA) + Delta Sampling:
- Function: Models inter-frame dependencies along the temporal axis and performs data-driven keyframe selection using Mamba's adaptive step size \(\Delta_t\).
- Mechanism: A unidirectional Mamba processes the frame-level anchor sequence, with its internal \(\Delta_t = \text{softplus}(\mathbf{W}_\Delta \mathbf{f}_t + \mathbf{b}_\Delta)\) learned end-to-end. \(\Delta_t\) is interpreted as a frame importance score — frames with larger \(\Delta_t\) are considered more informative by the model. A cumulative delta sampling algorithm accumulates \(\Delta_t\) and selects a frame when the cumulative value exceeds a threshold \(\delta_{\text{thresh}}\), then resets the accumulator. By default, approximately 50% of frames are retained (384→192).
- Design Motivation: The SSM's internal \(\Delta_t\) is repurposed directly rather than training a separate selector — this step size inherently reflects the information content of the input (large \(\Delta_t\) = larger state update = more new information). Visualizations show \(\Delta_t\) peaks align with scene transitions and key events.
-
Bidirectional Mamba Spatiotemporal Compressor:
- Function: Processes the entire spatiotemporal token sequence prior to GPA, enabling spatial and temporal information sharing.
- Mechanism: Three layers of bidirectional Mamba2 blocks process sequences of ≈230K tokens. The bidirectional design allows each token to attend to both past and future context.
- Design Motivation: Replacing bidirectional Mamba with unidirectional Mamba causes a drop of approximately 1.7 points, confirming that bidirectional modeling is important for spatiotemporal feature sharing.
Loss & Training¶
Three-stage LLaVA-style training: image understanding → module alignment (compression layers only) → video instruction fine-tuning (LLM unfrozen). Training uses 128 frames; inference uses 384 frames. Delta sampling is applied at inference time only. The compression module contains approximately 247M parameters.
Key Experimental Results¶
Main Results — Long Video Benchmarks¶
| Model | LLM | Max Tokens | LVBench | MLVU | VideoMME | VNBench |
|---|---|---|---|---|---|---|
| Qwen2-VL | Qwen2-7B | - | 42.0 | 64.2 | 55.6 | 33.9 |
| LLaVA-Video | Qwen2-7B | 12.5K | 43.8 | 70.8 | 63.3 | 37.0 |
| mPLUG-Owl3 | Qwen2-7B | - | 43.5 | - | 53.5 | - |
| MambaMia | Qwen2-7B | 4.7K | 44.6 | 68.0 | 58.3 | 41.5 |
Ablation Study — Compression Module Design¶
| Configuration | GPA | TAA | LVBench | MLVU | MME | Avg |
|---|---|---|---|---|---|---|
| BIMBA (3D pool) | ✗ | ✗ | 35.3 | 53.8 | 47.3 | 45.4 |
| +GPA | ✓ | ✗ | 41.1 | 62.4 | 53.2 | 52.2 |
| +GPA+TAA (Full) | ✓ | ✓ | 41.1 | 64.0 | 55.7 | 53.6 |
Key Findings¶
- Only 4.7K tokens are needed to achieve performance comparable to LLaVA-Video using 12.5K tokens, representing approximately 2.6× improvement in token efficiency.
- Replacing 3D average pooling with GPA yields approximately 7 points average improvement — learnable gated aggregation substantially outperforms fixed pooling.
- Delta sampling outperforms uniform sampling by 1.2 points on LVBench (44.6 vs. 43.4), a statistically significant difference (\(p = 0.047\)).
- Even when using Mamba as the LLM backbone, dedicated compression is necessary — a vanilla Mamba LLM performs substantially below the compressed variant.
- Performance saturates at 384 frames; additional frames provide no further gain.
- Strong performance on VNBench (needle-in-a-video-haystack, 41.5) demonstrates that compression does not discard critical fine-grained information.
Highlights & Insights¶
- Repurposing Mamba \(\Delta_t\) as a frame importance signal: The most elegant design choice — the SSM step size inherently encodes input information content, eliminating the need for a separate importance predictor. This idea generalizes to any sequence processing scenario utilizing SSMs.
- Modular pre-LLM compression: Unlike VAMBA, which compresses inside the LLM, MambaMia performs compression independently before the LLM, maintaining modularity and lightweight design.
- Rigorous experimental methodology: The paper emphasizes from-scratch training, controlled variable comparisons, multi-seed statistical validation, and significance testing — the experimental design is exemplary.
Limitations & Future Work¶
- Performance saturation at 384 frames suggests an information bottleneck in the compression layer; exploring more frames with improved compression strategies is worthwhile.
- The query-conditioned pooling in GPA attends only to the query token without content-aware attention across patches (for efficiency), potentially losing inter-patch relationships.
- \(\delta_{\text{thresh}}\) is manually set; an adaptive threshold would be preferable.
- Evaluation is limited to 7B-scale models; effectiveness on larger LLMs remains to be verified.
- The train-test mismatch between uniform sampling (training) and delta sampling (inference) may constrain performance.
Related Work & Insights¶
- vs. BIMBA: Both share a Mamba + periodic query architecture, but BIMBA uses 3D average pooling while MambaMia employs learnable gated aggregation and delta sampling; ablations show MambaMia consistently outperforms BIMBA.
- vs. LLaVA-Video: LLaVA-Video uses 12.5K tokens; MambaMia achieves comparable performance with 4.7K tokens, demonstrating a clear efficiency advantage.
- vs. Video-XL: Video-XL aggregates inside the LLM with CLIP-based frame selection, whereas MambaMia compresses independently outside the LLM with learned frame selection, yielding a more modular architecture.
- Insight: The inter-frame temporal fusion approach in TTF-VLA and MambaMia's temporal compression are complementary — TTF-style enhancement followed by MambaMia-style compression is a promising direction.
Rating¶
- Novelty: ⭐⭐⭐⭐ Repurposing \(\Delta_t\) for frame sampling is clever; GPA design is also novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven benchmarks, five compression comparisons, multi-seed statistical testing, and cost analysis — exceptionally comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Method descriptions are precise; a 13-section appendix covers all implementation details with strong reproducibility.
- Value: ⭐⭐⭐⭐⭐ Processing hour-long videos with only 4.7K tokens offers substantial practical value to the community.