Skip to content

RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

Conference: ACL 2025
arXiv: 2501.19036
Code: https://github.com/L-Hugh/RedundancyLens
Area: Multimodal VLM / Visual Token Efficiency
Keywords: Visual Token Redundancy, Decoder-only MLLM, Training-free Acceleration, Dynamic FFN, Sparse Attention

TL;DR

This work proposes the RedundancyLens framework to systematically reveal the extensive structured and clustered redundancy in self-attention and FFN operations for visual tokens within decoder-only MLLMs. Leveraging this finding, training-free inference acceleration is achieved, which is orthogonal to and combinable with existing token compression methods.

Background & Motivation

Current MLLM architectures face a dilemma between performance and efficiency:

  • Decoder-only Architectures (e.g., LLaVA): Visual tokens are concatenated with textual tokens and jointly processed by the self-attention and FFN layers of the LLM, achieving high performance but low efficiency.
  • Cross-attention Architectures (e.g., Flamingo): Visual tokens bypass the self-attention and FFN layers of the LLM backbone, offering high efficiency but relatively poor performance.

Key observation: In decoder-only architectures, the number of visual tokens typically far exceeds that of textual tokens (accounting for over 90%), meaning that self-attention and FFN operations on visual tokens consume the vast majority of computational resources. This raises a natural question: Is it necessary to perform full self-attention and FFN computations on visual tokens at every single layer?

Directly training new architectures to verify this is prohibitively expensive. Therefore, the authors propose a training-free analysis framework to reveal redundancy patterns by progressively reducing computation.

Method

1. Probe-Activated Dynamic FFN

Inspired by MoE, only a subset of FFN parameters is activated to process visual tokens, without training a router. The core idea is to utilize a small number of sampled tokens as "probes" to determine which parameters to activate.

Given the visual input \(X \in \mathbb{R}^{N \times d_{\text{model}}}\), the standard FFN operation is:

\[H = \text{ReLU}(XW_1 + \mathbf{b_1}) \in \mathbb{R}^{N \times d_{\text{ff}}}\]
\[Y = HW_2 + \mathbf{b_2} \in \mathbb{R}^{N \times d_{\text{model}}}\]

The steps of Probe-Activated Dynamic FFN are:

  1. Sample Probes: Randomly sample \(M\) tokens (\(M \ll N\)) from the \(N\) visual tokens, and compute the hidden representation of the sampled tokens: $\(H^{\text{sample}} = \text{ReLU}(X^{\text{sample}} W_1 + \mathbf{b_1})\)$

  2. Compute Activation Importance: Take the absolute values of the sampled tokens' hidden representations and calculate the mean to obtain the importance score for each FFN dimension: $\(\bar{\mathbf{h}} = \frac{1}{M} \sum_{i=1}^{M} |H_i^{\text{sample}}| \in \mathbb{R}^{d_{\text{ff}}}\)$

  3. Select Top-K Dimensions: Select the \(K\) dimensions with the highest importance, \(S = \text{Top}_K(\bar{\mathbf{h}})\), and activate only the corresponding subset of weights: $\(W_1^{\text{act}} = W_1[:, S] \in \mathbb{R}^{d_{\text{model}} \times K}, \quad W_2^{\text{act}} = W_2[S, :] \in \mathbb{R}^{K \times d_{\text{model}}}\)$

  4. Forward Pass: Compute the output for all visual tokens using the activated subset of parameters: $\(H^{\text{act}} = \text{ReLU}(XW_1^{\text{act}} + \mathbf{b_1}^{\text{act}}), \quad Y = H^{\text{act}} W_2^{\text{act}} + \mathbf{b_2}\)$

This method uses only a subset of FFN parameters (default \(K = 20\%\)) for visual tokens, while textual tokens still use the full FFN.

2. Hollow Attention

A customized sparse attention pattern with the following core modifications:

  • Global attention among visual tokens \(\to\) Local attention (each visual token only attends to the preceding \(R_A\) visual tokens, with default \(R_A = 256\), approximately corresponding to the tokens of one sub-image)
  • Attention from visual tokens to textual tokens \(\to\) Unchanged
  • Attention of textual tokens \(\to\) Unchanged (they can still attend to all tokens)

Since visual tokens are far more numerous than textual tokens, this effectively eliminates the majority of attention computation overhead.

3. Layer Ranking Algorithm

A greedy search strategy to determine which layers exhibit the highest redundancy to prioritize them for computation reduction:

  • Construct a compact validation set (sampling approximately 2350 instances from multiple datasets)
  • Greedy iteration: In each round, the layer that causes the least degradation in model performance is selected from the unsorted layers and added to the sorted list.
  • Hybrid strategy: The last \(L_p\) layers are pre-assigned in descending order of position (as deeper layers show higher redundancy), while the remaining layers are sorted using the search algorithm.
  • Penalty coefficient \(\alpha = 2\): If performance drops after reduction, the penalty is doubled to encourage selecting layers that do not impact performance.

Experimental Results

Experimental Setup

  • Models: InternVL2-8B (32 layers), Qwen2-VL-7B (28 layers), MiniCPM-V 2.6 (28 layers), LLaVA-OneVision-7B (28 layers)
  • 8 Benchmarks: OCRBench, DocVQA, InfoVQA, ChartQA, TextVQA, MME, MMStar, HallusionBench
  • Hardware: NVIDIA A100 GPU
  • Baselines: FastV (token pruning), VTW (token weighting)

Table 1: Comparison of training-free acceleration methods (InternVL2-8B)

Method FLOPs Ratio OCRBench DocVQA ChartQA MME MMStar
Original Model 100% 793 91.6 83.2 2210 61.3
FastV (R=30%) 72% 793 90.6 82.9 2181 60.7
Ours 72% 801 91.3 83.1 2212 61.7
FastV (R=50%) 53% 768 85.4 80.6 2195 59.3
Ours + FastV 52% 797 90.3 83.0 2192 60.9

Table 2: Comparison of training-free acceleration methods (Qwen2-VL-7B)

Method FLOPs Ratio OCRBench DocVQA ChartQA TextVQA MME
Original Model 100% 865 94.5 83.2 84.3 2322
FastV (R=30%) 72% 829 94.4 82.6 84.0 2306
Ours 71% 859 94.5 83.0 84.6 2309
FastV (R=50%) 53% 766 93.4 79.4 83.6 2309
Ours + FastV 53% 832 94.3 81.8 84.2 2310

Key Findings

  1. Substantial Redundancy: After applying computation reduction for visual tokens to approximately half of the layers, the model performance remains largely unchanged or even improves.
  2. Exclusive to Visual Tokens: Applying the same reduction to textual tokens leads to a sharp performance drop, indicating that redundancy is a unique characteristic of visual tokens.
  3. Structural Clustering: Redundant layers tend to cluster in the latter half of the model (especially the final few layers).
  4. FFN is More Sensitive than Attention: When the number of reduced layers exceeds half of the total layers, FFN reduction causes a more severe performance degradation compared to attention reduction.
  5. Orthogonal Complementarity: When combined with token compression methods (e.g., FastV), the performance at 50% FLOPs is significantly superior to using FastV alone.

Ablation Study

  • FFN Activation Parameter Ratio: Higher activation ratios allow more layers to be reduced; 20% represents a good trade-off between efficiency and performance.
  • Attention Range \(R_A\): \(R_A = 256\) yields the optimal performance across most benchmarks.
  • Layer Ranking Strategy: The hybrid strategy (position + search) outperforms purely position-based or search-based strategies.

Highlights & Insights

  • 🔍 Reveals an Important Architectural Insight: The processing of visual tokens in decoder-only MLLMs contains large-scale structured redundancy, providing valuable insights for future architectural designs.
  • 🔧 Training-Free: Achieves approximately 30% FLOPs reduction without retraining, offering high practical utility.
  • Orthogonality: Orthogonal to token compression methods, allowing combination to achieve greater acceleration (approximately 50% FLOPs reduction).
  • 📊 Comprehensive Evaluation: Validated across 4 SOTA models and 8 benchmarks, demonstrating highly consistent findings.

Limitations & Future Work

  1. Layer Ranking Search Overhead: Constructing the validation set and conducting hundreds of evaluations incurs non-negligible computational overhead.
  2. Suboptimal Greedy Search: Constrained by the validation set size and search strategy, the greedy search may fail to find the globally optimal layer combination.
  3. Unexplored Causes of Redundancy: This work only verifies the existence of redundancy without explaining theoretically why specific layers are redundant for visual tokens.
  4. Insufficient Validation of Practical Speedup: The paper mainly reports FLOPs reduction, lacking detailed reports on actual inference latency improvements.
  • MLLM Architectures: LLaVA, Flamingo, NVLM (comparing decoder-only vs. cross-attention), InternLM-XComposer2-4KHD
  • Visual Token Compression: FastV (attention-score-based pruning), VTW (token weighting), ZipVL (dynamic sparsification)
  • Efficient Inference: MoE (Mixture-of-Experts), Sparse Attention (BigBird)

Rating

⭐⭐⭐⭐ (4/5)

  • Novelty: ⭐⭐⭐⭐ — Approaches the problem from the perspective of "per-token computation reduction," which is complementary to mainstream token compression methods, offering a fresh perspective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 SOTA models, 8 benchmarks, and thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ — Clear analysis comparing the two architectures from a unified perspective.
  • Value: ⭐⭐⭐⭐ — Training-free and combinable, though the layer ranking search introduces extra overhead.
  • Impact: ⭐⭐⭐⭐ — The revealed redundancy patterns provide guiding insights for future MLLM architecture designs.