AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning¶
Conference: ICCV 2025 arXiv: 2412.03248 Code: https://github.com/LaVi-Lab/AIM Area: Video Understanding Keywords: Multimodal LLM, adaptive inference, token merging, token pruning, video understanding efficiency
TL;DR¶
This paper proposes AIM, a training-free adaptive inference method for multimodal LLMs that achieves a 6.8× FLOPs reduction while maintaining performance, through similarity-based iterative visual token merging before the LLM and progressive PageRank-based token pruning within LLM layers. Under equal compute budgets, AIM even surpasses SOTA on long video understanding (+4.6 MLVU).
Background & Motivation¶
Background: Multimodal LLMs rely on large numbers of visual tokens (up to thousands for video), incurring substantial computational overhead that limits real-time deployment and long-video processing.
Limitations of Prior Work: Methods such as FastV and PDrop prune only at specific LLM layers, lacking flexibility; LLaVA-Prumerge operates only before the LLM. None can adaptively accommodate varying computational budgets.
Core Idea: Merge similar tokens before the LLM to reduce redundancy, and progressively prune unimportant tokens within LLM layers — two tunable knobs that offer flexible control over compute.
Method¶
Key Designs¶
-
Pre-LLM Token Merging: Adjacent visual tokens are partitioned into sets A and B based on cosine similarity; the most similar pairs are identified and averaged. For video, merging is performed within frames (cross-frame merging would disrupt temporal order).
-
Progressive In-Layer Token Pruning: The PageRank algorithm is applied to the self-attention weight matrix to compute an importance score for each token. Only visual tokens are pruned; text tokens are retained (pruning text tokens causes severe performance degradation).
-
Piecewise Linear Scheduler: All tokens are retained for the first \(l_1\) layers; token count linearly decreases from layer \(l_1\) to \(l_2\); visual tokens are fully removed beyond layer \(l_2\). Early layers are responsible for cross-modal fusion (cannot be pruned aggressively), while later layers shift toward text reasoning (can be pruned substantially).
Key Experimental Results¶
| Model | FLOPs (TB) | VideoMME | MLVU |
|---|---|---|---|
| LLaVA-OV-7B | 99.63 | 58.2 | 64.7 |
| AIM | 14.67 | 57.4 | 69.3 |
| FastV | 21.24 | 50.1 | 54.1 |
Key Findings¶
- Retaining only 25% of visual tokens is sufficient to maintain near-full performance.
- Fewer tokens per frame enables processing more frames, which actually improves long-video understanding.
- Pruning visual tokens in early layers severely degrades performance, whereas aggressive pruning in later layers has minimal impact.
Piecewise Linear Scheduler Parameter Effects¶
| \(l_1\) | \(l_2\) | Retention Ratio | VideoMME | FLOPs (TB) |
|---|---|---|---|---|
| 4 | 20 | 25% | 57.4 | 14.67 |
| 8 | 24 | 25% | 56.8 | 15.23 |
| 4 | 20 | 50% | 58.0 | 28.45 |
| 0 | 16 | 25% | 52.1 | 12.34 |
Results Across Different Models¶
| Model | Original FLOPs | AIM FLOPs | Performance Retention |
|---|---|---|---|
| LLaVA-OV-7B | 99.6 TB | 14.7 TB | 98.6% |
| Qwen2-VL-7B | 85.3 TB | 12.5 TB | 97.8% |
Highlights & Insights¶
- The counterintuitive finding that "reducing tokens can improve long-video performance" is particularly valuable: fewer tokens × more frames > more tokens × fewer frames.
- PageRank-based token importance estimation is more stable than naive attention weight aggregation.
Limitations & Future Work¶
- Scheduler parameters (\(l_1\), \(l_2\), retention ratio) require manual selection; an automatic tuning mechanism is absent.
- Training-free methods may have a performance ceiling below that of training-based approaches, with potentially significant degradation under extreme compression.
- The computational overhead of the PageRank algorithm is not thoroughly analyzed and may partially offset the speedup from token pruning.
- Applicability to spoken-language multimodal models (e.g., audio-visual LLMs) remains unexplored.
- Pruning only visual tokens while retaining all text tokens may not be optimal for visually dominated tasks.
- The balance between intra-frame and cross-frame merging lacks systematic investigation.
- Integration with complementary acceleration techniques such as model quantization and knowledge distillation has not been explored.
Related Work & Insights¶
- vs. FastV/PDrop: These methods prune only at specific LLM layers and lack flexibility; AIM operates both before the LLM and within LLM layers simultaneously.
- vs. LLaVA-Prumerge: Operates only before the LLM; AIM additionally introduces progressive in-layer pruning.
- vs. ToMe: ToMe is a general-purpose token merging approach; AIM incorporates intra-frame constraints and PageRank-based importance scoring tailored to video scenarios.
Additional Discussion¶
- The core innovation lies in extending the problem analysis from a single dimension to multiple dimensions, enabling a more comprehensive understanding.
- The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
- The modular design of the method facilitates extension to related tasks and new datasets.
- Open-sourcing the code and data is of considerable value to the community for reproduction and follow-up research.
- Compared to concurrent work, this paper demonstrates greater depth in problem formulation and more comprehensive experimental analysis.
- The paper is logically structured, forming a complete loop from problem definition to method design to experimental validation.
- The computational overhead of the method is reasonable, making it deployable in practical applications.
- Future work may consider integration with additional modalities such as audio and 3D point clouds.
- Validating the scalability of the method on larger datasets and models is an important subsequent direction.
- Combining the method with reinforcement learning for end-to-end optimization is worth exploring.
- Cross-domain transfer is a direction worth investigating — the generalizability of the method requires further validation.
- A lightweight variant of the method tailored for edge computing and mobile deployment scenarios warrants further study.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-stage merge-then-prune design is original
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers video and image benchmarks with thorough ablations
- Writing Quality: ⭐⭐⭐⭐ In-depth analysis with clear insights
- Value: ⭐⭐⭐⭐⭐ Substantial practical value for real-world deployment