Skip to content

Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model

Conference: ACL 2025
arXiv: 2506.12388
Code: GitHub
Area: LLM Efficiency/Multilingual
Keywords: multilingual LLM, mixture of experts, curse of multilinguality, language grouping, parameter deviation

TL;DR

DMoE is proposed—a method combining parameter deviation-based dynamic language grouping with selective MoE layer expansion. By quantifying language similarity through only 10 steps of fine-tuning, similar languages are grouped to share the same expert. MoE expansion is applied only to layers with large parameter deviations (language-specific layers). It reduces PPL by 11.4% compared to continual pre-training across 18–128 languages, and outperforms X-ELM by 9.6% with 3.6x fewer parameters.

Background & Motivation

Background: The performance of multilingual LLMs on low-and-medium-resource languages lags far behind that on high-resource languages. The core reason is the "curse of multilinguality"—a large number of languages compete for limited model capacity, causing negative transfer among dissimilar languages.

Limitations of Prior Work: (a) X-MOD trains an adapter for each language, which requires language identification and results in parameter counts growing linearly with the number of languages; (b) X-ELM trains separate models for each language group and fuses top-\(m\) models during inference, which is costly; (c) There lacks a method to precisely quantify language similarity and locate "which layers require more capacity."

Key Challenge: How to flexibly expand the LLM capacity for a massive number of languages while promoting positive transfer among similar languages?

Goal: (a) Identify which layers are language-specific (requiring expansion); (b) Group languages to maximize intra-group similarity; (c) Support the dynamic adaptation of new languages.

Key Insight: Parameter deviation (\(\Delta\theta^x\)) is utilized to resolve both issues simultaneously: large layer-wise deviation implies that the layer requires MoE expansion; high similarity in deviation between languages implies they should be grouped to share an expert.

Core Idea: Fine-tune on each language for 10 steps to obtain \(\Delta\theta^x\) -> Group languages based on cosine similarity -> Expand layers with large deviations into MoE (each expert serves one language group) -> Dynamically adapt new languages by copying and fine-tuning the expert of the nearest group.

Method

Overall Architecture

A three-step pipeline: (1) Compute parameter deviations for each language -> (2) Group languages + Select layers to expand -> (3) Train the MoE model (fine-tune corresponding experts for each language group).

Key Designs

  1. Parameter Deviation as Language Representation

    • Function: Characterize the features of each language using \(\Delta\theta^x = \theta_{\text{finetuned}} - \theta_{\text{base}}\) (with only 10 steps of fine-tuning).
    • Mechanism: Language similarity = \(\cos(\Delta\theta^x, \Delta\theta^y)\). It is observed that layers near the input/output exhibit the largest deviations (language-specific), while middle layers show smaller deviations (representing a shared "conceptual space").
    • Design Motivation: Better captures representation differences inside the model compared to purely linguistic features like LANG2VEC.
  2. Language Grouping (Maximin Optimization)

    • Function: Evenly partition \(K\) languages into \(G\) groups to maximize the minimum similarity within each group.
    • Mechanism: A greedy algorithm—merging the most similar language pairs at each step and gradually expanding until the groups are full.
    • Design Motivation: High intra-group similarity leads to fewer gradient conflicts, making expert sharing more effective.
  3. Dynamic MoE Layer Expansion

    • Function: Expand only the top-\(\epsilon\) layers with the largest deviations into MoE layers (rather than all layers).
    • Mechanism: Each MoE layer contains \(g\) experts (1 per language group). During training, tokens from the same language group route to the same expert. The router is trained with a language group classification loss: \(\mathcal{L}_{RC} = -\sum_x \sum_i \log P_i(l|x;\theta)\).
    • Total Loss: \(\mathcal{L} = \mathcal{L}_{CLM} + \alpha \mathcal{L}_{RC}\)
  4. Dynamic Adaptation for New Languages

    • Function: Locate the closest expert for a new language and copy-fine-tune it.
    • Mechanism: Input the new language into all experts -> The expert yielding the lowest PPL is the most similar -> Copy this expert and fine-tune it alone (with other parameters frozen).
    • Design Motivation: Updating only the new expert avoids affecting learned languages, thereby mitigating catastrophic forgetting.

Key Experimental Results

18-Language PPL (BLOOM-560M)

Method Params High-Resource Avg Mid-Resource Avg Low-Resource Avg Overall Avg
BLOOM (Original) 560M 56.9
+ Pre-train 560M 21.6
X-ELM 5.03B 21.5
Branch-Train-Mix 1.57B 21.2
DMoE (6 groups) 937M 19.5

Ablation: Comparison of Grouping Methods

Grouping Method 18-Language Avg PPL
Random Grouping 20.8
LANG2VEC Grouping 19.8
Parameter Deviation Grouping (Ours) 19.5

128-Language Scaling

Method Avg PPL
+ Pre-train 26.3
DMoE 23.1

Key Findings

  • Expanding only top layers is sufficient: Layers near the inputs and outputs show the largest deviation. Expanding only these layers yields most of the benefits, while middle layers can be shared without performance degradation.
  • Parameter deviation grouping outperforms linguistic grouping: The similarity "learned by the model" is more effective than "a priori linguistic knowledge"—for example, Chinese and Japanese are grouped together (sharing the Kanji writing system).
  • DMoE outperforms X-ELM (5.03B params) with only 937M parameters—achieving 5.4x parameter efficiency.
  • New language adaptation does not harm old languages: Freezing other parameters while only fine-tuning the new expert results in nearly unchanged PPL on old languages.
  • Scalable from 18 to 128 languages: The grouping strategy remains effective on large-scale language sets.

Highlights & Insights

  • "Quantifying language similarity with only 10 fine-tuning steps" is extremely efficient: It avoids training extra language vectors or calculating gradient matrices; only 10 steps are sufficient to produce meaningful parameter deviation representations.
  • The intuition "large deviation = more capacity required" is highly precise: This aligns with the discovery that the conceptual space resides in the middle layers, whereas language-specific layers lie at both ends.
  • The combination of language grouping and MoE is highly natural: Compared to one adapter per language (X-MOD) or one model per language (X-ELM), language group-based MoE offers the optimal balance between parameter efficiency and performance.
  • The dynamic adaptation design is valuable for practical deployment: Adding new languages does not require retraining the entire model, but only copying and fine-tuning the nearest expert.

Limitations & Future Work

  • The number of groups \(K\) is a hyperparameter: It must be manually selected, and different models/language sets may require different values of \(K\).
  • Routing relies on language classification: Although explicit language IDs are not required, the router's training is essentially language group classification.
  • Validated only on small models: Primarily BLOOM-560M and Gemma-2B; no experiments were conducted on models larger than 7B.
  • Fixed language groups: The group structures cannot be dynamically adjusted (e.g., merged or split) after training.
  • No comparison with the latest MoE routing strategies: Such as ST-MoE or the load balancing of Switch Transformer.
  • vs X-MOD (Pfeiffer et al., 2022): X-MOD assigns one adapter per language and requires language IDs, whereas DMoE assigns one expert per group with automatic routing.
  • vs X-ELM (Blevins et al., 2024): X-ELM trains a full model for each group, whereas DMoE shares most parameters and only expands key layers, achieving 5x+ higher parameter efficiency.
  • vs Branch-Train-Mix (Sukhbaatar et al., 2024): BTM employs domain branches, whereas DMoE utilizes language branches + dynamic layer selection.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Quantifying language similarity with parameter deviation + selective MoE expansion; the idea is precise and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covering 18–128 languages + multiple models + multiple grouping strategy ablations, though the model scales are relatively small.
  • Writing Quality: ⭐⭐⭐⭐ Clearly described method, with an intuitive three-step flowchart in Figure 2.
  • Value: ⭐⭐⭐⭐⭐ Holds direct practical value for the capacity expansion and language specialization of multilingual LLMs.