MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping¶
Conference: CVPR2026
arXiv: 2511.15690
Code: ModelTC/MoDES
Area: Multimodal VLM
Keywords: MoE acceleration, expert skipping, multimodal large language models, training-free, inference efficiency
TL;DR¶
This paper proposes MoDES, the first training-free expert skipping framework for MoE multimodal large language models. By leveraging Globally Modulated Local Gating (GMLG) and Dual-Modal Thresholding (DMT), MoDES adaptively skips redundant experts, retaining over 97% of original performance while skipping 88% of experts, and achieving 2.16× prefill speedup.
Background & Motivation¶
Inference bottleneck in MoE MLLMs: MoE multimodal large language models (e.g., Qwen3-VL-MoE-30B) reduce computation via sparse activation, yet each token still interacts with multiple activated experts, resulting in non-trivial inference overhead.
Failure of existing expert skipping methods: Methods such as NAEE, MC-MoE, and DiEP are designed for unimodal LLMs; when directly applied to MLLMs, accuracy degrades by more than 10% at an 83% skipping rate.
Uneven layer-wise contribution (Insight i): Shallow-layer experts contribute far more to the final output than deep-layer ones—errors introduced in shallow layers are amplified by subsequent layers. However, existing methods make skipping decisions solely based on intra-layer routing probabilities, ignoring global layer-level importance.
Modality-specific behavioral differences (Insight ii): Text tokens and visual tokens exhibit significantly different update magnitudes in FFN layers. Visual tokens are more orthogonal to FFN weights (angles approaching 90°), and are therefore less affected by FFN transformations, exhibiting higher redundancy.
Lack of modality-aware skipping strategies: Prior work applies a uniform threshold across all modalities without accounting for the distinct characteristics of text and visual tokens, leading to suboptimal skipping decisions.
High cost of threshold search: Brute-force search for dual-modal thresholds requires \(\mathcal{O}(ND^2)\) time complexity, taking days to complete for 20–30B parameter models.
Method¶
Overall Architecture¶
MoDES is a training-free inference acceleration framework consisting of two core modules: Globally Modulated Local Gating (GMLG), which computes an importance score for each expert, and Dual-Modal Thresholding (DMT), which makes adaptive skipping decisions based on token modality.
Globally Modulated Local Gating (GMLG)¶
To address uneven layer-wise contribution, GMLG combines global layer-level importance with local routing probabilities:
- \(\pi_i^{(l)}\): local routing probability (softmax-normalized) of the \(i\)-th expert in layer \(l\)
- \(\alpha^{(l)}\): global modulation factor obtained via offline calibration, measuring the impact on the final output of skipping all experts in layer \(l\)
\(\alpha^{(l)}\) is computed as the mean KL divergence between the output distributions of the original model and the model with all experts in layer \(l\) skipped, evaluated on a calibration set \(\mathcal{C}\):
The calibration phase uses 1,024 samples from the GQA dataset and is computed offline, incurring no additional overhead during inference.
Dual-Modal Thresholding (DMT)¶
To handle modality-specific behavioral differences, DMT assigns separate skipping thresholds \(\tau_t\) and \(\tau_v\) for text and visual tokens respectively:
Experts whose importance scores fall below the threshold corresponding to their token modality are skipped. Visual tokens, being more redundant, are typically assigned a higher skipping threshold.
Frontier Search Algorithm¶
To efficiently find the optimal \((\tau_t, \tau_v)\), the problem is formulated as minimizing KL divergence subject to a target skipping rate \(\rho\). Exploiting the monotonicity of \(f\) and \(g\) with respect to the thresholds, a dual-pointer strategy identifies the optimal solution on the Pareto frontier in \(\mathcal{O}(ND)\) time—approximately 45× faster than brute-force search at \(\mathcal{O}(ND^2)\)—reducing search time from days to within a few hours.
Experiments¶
Main Results: Comparison on 13 Benchmarks with Kimi-VL-A3B-Instruct¶
| Method | Skip Rate | ChartQA | MME | MMBench | LVB | VMMMU | Avg.(%) |
|---|---|---|---|---|---|---|---|
| Default k=6 | 0% | 89.48 | 2207 | 83.16 | 63.13 | 49.33 | 100.00 |
| DiEP | 83% | 78.31 | 2071 | 76.28 | 52.41 | 43.81 | 87.58 |
| MC-MoE | 83% | 80.25 | 2063 | 73.42 | 54.39 | 44.02 | 88.32 |
| MoDES | 83% | 84.20 | 2162 | 81.44 | 62.60 | 47.11 | 96.25 |
Cross-Model Generalization: Qwen3-VL-MoE-30B at 88% Skip Rate¶
| Method | ChartQA | MME | MMBench | VMMMU | Avg.(%) |
|---|---|---|---|---|---|
| MC-MoE | 71.43 | 2168 | 75.42 | 37.41 | 86.66 |
| DiEP | 70.51 | 2074 | 73.21 | 34.79 | 85.30 |
| MoDES | 78.84 | 2403 | 85.57 | 46.56 | 97.33 |
At an aggressive 88% skipping rate, MoDES outperforms the strongest baseline MC-MoE by 10.67 percentage points.
Ablation Study¶
| Configuration | ChartQA | MME | MMBench | LVB | VMMMU |
|---|---|---|---|---|---|
| Single-threshold baseline | 76.74 | 1956 | 65.48 | 54.67 | 40.33 |
| +GMLG | 79.28 | 2107 | 75.19 | 60.02 | 43.87 |
| +DMT | 82.94 | 2081 | 79.42 | 61.16 | 45.08 |
| +GMLG+DMT (full) | 84.20 | 2162 | 81.44 | 62.60 | 47.11 |
(83% skip rate, Kimi-VL-A3B-Instruct.) Both GMLG and DMT yield significant and independent contributions, with larger gains at higher skipping rates.
Key Findings¶
- Inference speedup: MoDES achieves 2.16× prefill and 1.26× decoding speedup on Qwen3-VL-MoE-30B.
- Compatibility with quantization: MoDES combined with 2.5-bit quantization retains 94.43% of original performance on Qwen3, compared to 89.58% for MC-MoE.
- Skipping pattern visualization: Deep layers exhibit far higher skipping rates than shallow layers; visual tokens are skipped at a much higher rate than text tokens, validating both core insights.
- Calibration data robustness: Replacing the calibration set with COCO or VMMMU yields negligible performance differences.
- Search efficiency: Frontier search achieves ~45× speedup over brute-force search; total overhead (calibration + search) for 20–30B models is within 20 minutes to 4 hours.
Highlights & Insights¶
- First work to systematically analyze uneven layer-wise contribution and modality-specific behavioral differences in MoE MLLMs, with both insights supported by thorough empirical evidence.
- GMLG elegantly combines offline global calibration with online local routing, incurring no additional overhead during inference.
- DMT replaces a uniform threshold with modality-aware dual thresholds, with a clear and logical design motivation.
- The frontier search algorithm exploits monotonicity to reduce complexity from \(\mathcal{O}(ND^2)\) to \(\mathcal{O}(ND)\), offering strong practical utility.
- Experiments span 3 model families × 13 benchmarks, achieving less than 3% accuracy loss when skipping 88% of experts.
Limitations & Future Work¶
- Only text and visual modalities are addressed; the framework has not been extended to additional modalities such as audio.
- \(\alpha^{(l)}\) operates at the layer level and does not differentiate global importance among individual experts within the same layer.
- Evaluation is limited to image/video understanding tasks; generative tasks (e.g., image captioning quality) are not thoroughly assessed.
- Decoding-stage speedup is modest (~1.2×), as decoding is memory-bound and processes only text tokens.
- The frontier search relies on a monotonicity assumption that, while reasonable in practice, lacks rigorous theoretical guarantees.
Related Work & Insights¶
- NAEE [Lu et al.]: Skips minor experts based on routing probability ratios, relying solely on intra-layer information.
- MC-MoE [Huang et al., 2024]: Extends NAEE with attention-aware expert protection and mixed-precision quantization.
- DiEP [Bai et al., 2025]: Differentiable expert pruning that jointly considers routing probabilities and expert similarity for skipping decisions.
- All aforementioned methods are designed for unimodal LLMs and transfer poorly to MLLMs. MoDES is the first to propose a global and modality-aware skipping strategy tailored to multimodal settings.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Both insights are convincing, and the GMLG+DMT combination is well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 3 model families × 13 benchmarks × multiple skipping rates, with complete ablations.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with coherent motivation→method→experiment flow.
- Value: ⭐⭐⭐⭐ — Directly applicable to MoE MLLM deployment, with a concise and efficient design.