MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition¶
Conference: NeurIPS 2025
arXiv: 2510.04136
Code: N/A
Area: Audio & Speech
Keywords: audio-visual speech recognition, Matryoshka representation learning, Mixture-of-Experts, elastic inference, LLM
TL;DR¶
MoME integrates sparse MoE into the Matryoshka representation learning framework for LLM-based audio-visual speech recognition. Through a shared router, it enables cross-granularity knowledge transfer, supporting elastic inference at multiple compression rates under a single set of model weights, while achieving state-of-the-art performance on AVSR/ASR/VSR.
Background & Motivation¶
LLM-based AVSR faces a fundamental tension: token hunger vs. computational cost. The temporal resolution of audio-visual speech signals far exceeds that of text, resulting in an enormous number of input tokens. Existing token compression methods (concatenation, resampling, average pooling, etc.) require a fixed compression rate to be specified in advance, producing fixed-length output sequences that cannot dynamically balance accuracy and efficiency at inference time.
Matryoshka Representation Learning (MRL) trains a single model with multiple compression rates, enabling dynamic granularity adjustment at inference. However, existing Matryoshka approaches suffer from two major shortcomings:
Independent training per granularity — each resolution is treated as an independent problem, lacking cross-scale interaction, with severe information loss at high compression rates.
Uniform monolithic representations — all scales share the same monolithic network architecture, precluding specialization for different granularities.
The core idea of MoME is to leverage sparse MoE experts to achieve cross-granularity knowledge transfer: the same set of experts is activated similarly across different compression rates, allowing low-resolution sequences to reuse expert pathways trained on high-resolution sequences.
Method¶
Overall Architecture¶
Input audio/video → pretrained encoders (Whisper/AV-HuBERT) → Matryoshka token sequences at multiple compression rates → frozen LLM (Llama 3) with parallel MoME modules → autoregressive decoding to transcription text. All \(G \times L\) audio-visual compression rate combinations are trained jointly; any compression rate can be selected at inference time.
Key Designs¶
-
MoME module structure: Each MoME module contains \(N_r\) routed experts and \(N_s\) shared experts. Each expert follows a bottleneck design (linear down-projection → GELU → linear up-projection), with the bottleneck dimension compressible to as small as 1. The router is a linear layer that selects \(K\) routed experts via top-k sparse gating: \(\text{MoME}(\mathbf{H}_l^{ij}) = \sum_{n=1}^{N_s} E_n(\mathbf{H}_l^{ij}) + \sum_{n=N_s+1}^{N_s+N_r} g_n E_n(\mathbf{H}_l^{ij})\) where \(g_n\) is determined by top-k sparse gating.
-
Shared router and cross-granularity alignment: The central innovation lies in sharing the experts and router of the MoME module across all Matryoshka sequences. This means the router simultaneously processes high-resolution (information-rich) and low-resolution (heavily compressed) inputs during training, naturally learning to activate similar expert subsets at different granularities. Experiments (Figure 5) confirm this implicit alignment — expert activation distributions within the same layer are highly consistent across compression rates, while activation patterns differ significantly across layers, achieving layer-wise diversity.
-
Shared experts: Inspired by DeepSeekMoE and Llama 4, one or two always-active shared experts are introduced to capture global, cross-modal, scale-invariant knowledge. Ablation studies confirm that shared experts yield measurable WER improvements.
-
Flexible insertion positions: MoME modules can be inserted in parallel at three positions within each LLM layer: the MHSA module, the FFN module, or the entire Transformer layer. The LLM backbone is frozen; only the MoME modules are trained (parameter-efficient fine-tuning).
Loss & Training¶
Multi-granularity average cross-entropy loss: $\(\mathcal{L}_{LM} = -\frac{1}{G \cdot L} \sum_{i=1}^{G}\sum_{j=1}^{L} \log p(\mathbf{Y}|\mathbf{Z}^{ij}) \cdot c_{ij}\)$ where \(c_{ij}=1\) denotes equal weighting across granularities. A load balancing loss \(\mathcal{L}_B\) (coefficient 0.01) is added to prevent routing collapse. Training compression rates are audio \(\{4, 16\}\) and video \(\{2, 5\}\), yielding four combinations.
Key Experimental Results¶
Main Results (AVSR, WER%↓)¶
| Method | Active Params | LRS2 (4,2) | (4,5) | (16,2) | (16,5) | LRS3 (4,2) | (4,5) | (16,2) | (16,5) |
|---|---|---|---|---|---|---|---|---|---|
| Llama-AVSR (independent) | 27.5M | 4.1 | 4.5 | 5.3 | 8.1 | 2.4 | 2.8 | 3.3 | 4.1 |
| Llama-MTSK SS | 27.5M | 3.4 | 4.7 | 4.8 | 6.4 | 2.3 | 2.2 | 3.3 | 3.6 |
| Llama-MTSK MSS | 55.0M | 3.6 | 4.8 | 6.1 | 9.0 | 2.4 | 2.4 | 3.2 | 3.5 |
| MoME-23/4-MHSA | 12.7M | 2.9 | 3.0 | 4.2 | 4.3 | 1.8 | 1.7 | 2.9 | 2.9 |
| MoME-23/4-LAYER | 12.7M | 2.7 | 2.7 | 4.2 | 4.2 | 1.5 | 1.8 | 3.1 | 3.2 |
MoME outperforms all baselines across all compression rates while using 2–4× fewer active parameters.
Ablation Study (MoME-MHSA on LRS2)¶
| Routed Experts | Shared Experts | Bottleneck Size | Top-k | (4,2) | (4,5) | (16,2) | (16,5) |
|---|---|---|---|---|---|---|---|
| 1 | 0 | 48 | / | 3.4 | 3.4 | 4.9 | 5.1 |
| 4 | 0 | 24 | 2 | 3.3 | 3.3 | 4.8 | 5.0 |
| 4 | 1 | 24 | 2 | 3.2 | 3.2 | 4.4 | 4.7 |
| 23 | 1 | 12 | 4 | 2.9 | 3.0 | 4.2 | 4.3 |
| 23 | 2 | 12 | 4 | 2.8 | 3.0 | 4.1 | 4.7 |
Key Findings¶
- Noise robustness: At SNR = −5 dB, MoME (32.6% WER) substantially outperforms Llama-AVSR (41.8%) and Llama-MTSK (44.9%).
- Extreme compression: With bottleneck dimension reduced to 1 (0.9M active parameters), WER degrades only marginally (LRS3: 1.8 → 2.0).
- Cross-modal token analysis (Figure 4): Audio-visual tokens at different compression rates exhibit strong linear correlation; high-compression tokens approximately correspond to 2–3 low-compression tokens.
- Computational efficiency: At (16, 5) compression, TFLOPs are reduced by 8×, and inference time decreases from 12.75 s to 6.74 s on 23-second speech.
Highlights & Insights¶
- The first framework to unify MoE and Matryoshka representation learning, cleverly leveraging sparse experts for cross-granularity knowledge transfer.
- The shared router design allows cross-scale alignment to emerge naturally as an implicit property rather than an explicit constraint.
- The analogy to the shallow brain hypothesis — deep LLM backbone plus parallel shallow MoME modules — is an insightful conceptual framing.
- A single set of model weights supporting elastic inference is a deployment-friendly design well suited for on-device applications.
Limitations & Future Work¶
- Validation is limited to English speech recognition; generalizability to multilingual and multi-task settings remains unexplored.
- The optimal MoME insertion position (MHSA/FFN/LAYER) varies by dataset, and no automatic selection mechanism is proposed.
- No comparison with the adaptive compression strategy (speech-rate-based) of MMS-LLaMA in terms of flexibility.
- Performance degrades when the number of shared experts exceeds 2; the underlying cause is not thoroughly analyzed.
Related Work & Insights¶
- Key distinction from Llama-MTSK: The latter employs multi-scale LoRA but treats each scale independently, whereas MoME achieves cross-scale coupling through a shared router.
- The shared expert design from DeepSeekMoE proves effective in the multimodal Matryoshka setting.
- Inspiration: The MoME framework is generalizable to other multimodal tasks such as vision-language, requiring only replacement of the encoder and compression strategy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (first unification of MoE and MRL; elegant design)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (LRS2 + LRS3, three tasks, detailed ablations and visualizations)
- Writing Quality: ⭐⭐⭐⭐ (clear structure, rich experimental figures and tables)
- Value: ⭐⭐⭐⭐⭐ (elastic inference + SOTA performance; high value for on-device deployment)