MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping¶

Conference: CVPR 2026 arXiv: 2511.15690 Code: https://github.com/ModelTC/MoDES Area: Multimodal VLM / MoE Acceleration / Efficient Inference Keywords: MoE, expert skipping, dual-modal threshold, global-modulated local gating, multimodal large model acceleration

TL;DR¶

MoDES is the first expert skipping framework for MoE multimodal large language models. It incorporates layer-level importance into routing probabilities via Global-Modulated Local Gating (GMLG), applies modality-specific skipping strategies for text and visual tokens via a Dual-Modal Threshold (DMT), and efficiently optimizes thresholds via frontier search. On Qwen3-VL-MoE-30B, MoDES retains 97.33% accuracy with 88% expert skipping, achieving a 2.16× prefill speedup.

Background & Motivation¶

MoE MLLMs (e.g., Kimi-VL, Qwen3-VL-MoE) reduce computational cost through sparse expert activation, yet an efficiency bottleneck remains: fixed top-\(k\) routing activates the same number of experts for every token. Existing expert skipping methods (NAEE, MC-MoE, DiEP) are designed for text-only LLMs; direct transfer to MLLMs causes >10% performance degradation at an 83% skip rate. Analysis reveals two overlooked factors: (1) global contribution mismatch—shallow-layer experts exert far greater influence on final outputs than deep-layer experts (error explosion effect); (2) modality discrepancy—visual tokens are more orthogonal to FFN weights (angle → 90°), expert updates on visual tokens are smaller in magnitude, and redundancy is higher.

Core Problem¶

How to design a modality-aware, layer-aware expert skipping strategy for MoE MLLMs that preserves near-baseline accuracy under extreme skip rates (>80%)?

Method¶

Overall Architecture¶

MoDES comprises two core components: (1) GMLG estimates an importance score for each token–expert pair; (2) DMT selects modality-specific skipping thresholds, which are efficiently determined via a frontier search algorithm. The entire pipeline is training-free.

Key Designs¶

Global-Modulated Local Gating (GMLG): The importance score is defined as \(s_i^{(l)} = \alpha^{(l)} \cdot \pi_i^{(l)}\), where \(\pi_i^{(l)}\) is the standard routing probability (local signal) and \(\alpha^{(l)}\) is a layer-wise global weight computed via offline calibration—measured as the KL divergence when all experts in layer \(l\) are skipped. \(\alpha^{(l)}\) is larger for shallow layers and smaller for deep layers, ensuring shallow-layer experts are skipped less frequently. Calibration requires only 1,024 samples and takes approximately 20 minutes to 4 hours (for 20–30B models).
Dual-Modal Threshold (DMT): Separate skipping thresholds \(\tau_t\) and \(\tau_v\) are set for text tokens and visual tokens, respectively. The decision rule is: \(\{Expert_i^{(l)} \mid s_i^{(l)} < \tau_t \cdot \mathbb{I}_t + \tau_v \cdot \mathbb{I}_v\}\) are skipped. Visualization confirms that in practice, the fraction of skipped experts is much higher for visual tokens than for text tokens across all layers (>90% vs. 50–70%), validating the insight that visual expert redundancy is higher.
Frontier Search: Optimization is performed over a 2D grid \(\mathcal{B}^2\) of \((\tau_t, \tau_v)\). By exploiting the monotonicity of \(f\) (KL divergence) and \(g\) (skip rate), the search complexity is reduced from \(O(ND^2)\) to \(O(ND)\)—yielding an approximately 45× reduction in search time in practice. Correctness and optimality are formally established (Lemma 1–4 + Proposition 1–2).

Loss & Training¶

The method is entirely training-free. Offline calibration of \(\alpha^{(l)}\) and frontier search for \((\tau_t^*, \tau_v^*)\) are both conducted on 1,024 GQA samples. At inference time, only a branch-free masked comparison is added to the MoE layer's router kernel, with no additional kernel launches.

Key Experimental Results¶

Model	Skip Rate	MoDES	MC-MoE	DiEP	NAEE	Reduce-k
Kimi-VL-A3B	50%	99.91%	97.69	98.17	96.44	95.93
Kimi-VL-A3B	67%	98.46%	95.45	94.81	94.03	93.88
Kimi-VL-A3B	83%	96.25%	88.32	87.58	82.81	71.60
Qwen3-VL-MoE-30B	88%	97.33%	86.66	85.30	80.60	60.11
InternVL-3.5-30B	88%	97.03%	86.20	83.26	78.88	59.63

Inference speedup (Qwen3-VL-MoE-30B): prefill 2.16×, decode 1.26×.

Compatible with quantization: 2.5-bit quantization combined with MoDES retains 94.43% accuracy on Qwen3 (vs. 89.58% for MC-MoE).

Ablation Study¶

Both GMLG and DMT are essential: At 83% skip rate, plain thresholding achieves 82.81% → +GMLG: 84.48% → +DMT: 85.50% → GMLG+DMT: 96.25%.
Modality discrepancy is real: Reducing top-\(k\) to 1 for visual tokens causes only marginal degradation, while the same reduction for text tokens leads to severe performance drops—confirming higher expert redundancy for visual tokens.
Calibration data insensitivity: Results are nearly identical across GQA/COCO/VMMMU calibration sets (~96% in all cases).
Consistent \(\alpha^{(l)}\) pattern: Layer-wise KL divergence distributions are similar across datasets, with shallow layers consistently exceeding deep layers.
Frontier search vs. exhaustive search: Accuracy is nearly identical (96.24% vs. 96.25%) with a 45× reduction in search time.

Highlights & Insights¶

First expert skipping framework for MoE MLLMs—all prior methods target unimodal LLMs and degrade significantly upon direct transfer.
The finding that "visual tokens exhibit higher expert redundancy" aligns with V2Drop/ApET's observation that "a large number of visual tokens are redundant"—extended here from the token dimension to the expert dimension.
The frontier search is supported by rigorous mathematical proofs (monotonicity → feasible region structure → optimality), demonstrating strong theoretical foundations.
Retaining 97% accuracy at an 88% skip rate is remarkable, suggesting that MoE models substantially over-allocate experts.
Orthogonal composability with quantization enables future three-stage compression: MoDES + quantization + token compression.

Limitations & Future Work¶

Thresholds are determined via offline search and may not generalize across tasks or inputs—input-adaptive dynamic thresholds are worth exploring.
Evaluation is limited to three MoE MLLMs (Kimi-VL / Qwen3-VL / InternVL3.5); broader architectural coverage is needed.
Decode-stage speedup is modest (1.26×), primarily because decoding is memory-bound and processes only text tokens.
Calibrating \(\alpha^{(l)}\) requires a forward pass with each layer's experts skipped, incurring overhead that grows linearly with the number of layers.
Dynamic top-\(k\) adjustment strategies (as opposed to skipping after fixing top-\(k\)) remain unexplored.

vs. NAEE/MC-MoE (LLM expert skipping): Designed for unimodal LLMs; accuracy drops below 89% at an 83% skip rate when transferred to MLLMs. MoDES achieves 96.25%—a substantial gap.
vs. DiEP (differentiable expert pruning): DiEP performs expert similarity and routing probability pruning in a training-aware framework but ignores layer-wise and modality-wise differences. MoDES is training-free and achieves superior results.
vs. V2Drop/DUET-VLM (token compression): Orthogonally complementary—V2Drop reduces the number of visual tokens, while MoDES reduces the number of experts activated per token. The two approaches can be combined.
vs. ApET (approximation error compression): ApET reduces tokens from an information-theoretic perspective; MoDES reduces computation from an expert perspective. Different mechanisms, same objective.

Highlights & Insights (Extended)¶

The MoDES finding that "shallow layers are more important" corroborates the Overthinking paper's observation that "hypotheses become unstable from middle to deep layers → hallucinations"—both point to the conclusion that not all layers contribute equally.
Combination idea: MoDES (expert skipping) + V2Drop (token dropping) + ApET (token merging)—three levels of compression for maximal VLM inference acceleration.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First adaptation of expert skipping to multimodal MoE; both insights (layer-wise and modality-wise) and the frontier search algorithm are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three MoE model families, 13 benchmarks, multiple skip rates, quantization combinations, detailed ablations, and mathematical proofs.
Writing Quality: ⭐⭐⭐⭐⭐ Perfect logical flow from motivation → analysis → method → validation; appendix includes complete proofs.
Value: ⭐⭐⭐⭐⭐ MoE MLLMs have become mainstream (Kimi-VL / DeepSeek / Qwen3 all adopt MoE); the method is directly deployable.