Expert Divergence Learning for MoE-based Language Models¶
Conference: ICLR 2026 arXiv: 2603.00054 Code: Not released Area: LLM Efficiency / MoE Keywords: Mixture of Experts, expert homogenization, routing diversity, Jensen-Shannon divergence, domain specialization
TL;DR¶
This paper addresses the expert homogenization problem in MoE training by maximizing the Jensen-Shannon divergence of routing distributions across different data domains, encouraging distinct expert subsets to be activated for different domains. The approach improves expert specialization and language modeling performance on a 15B-A1.5B model.
Background & Motivation¶
Background: Mixture-of-Experts (MoE) models achieve high parameter counts with low computation through sparse activation, but training frequently suffers from "expert homogenization"—different experts learn highly similar functions, wasting parameter capacity.
Limitations of Prior Work: Existing methods such as load balancing losses only ensure uniform expert utilization, without guaranteeing that different experts learn distinct skills. Experts may be used uniformly yet remain functionally equivalent.
Key Challenge: Load balancing and functional specialization are fundamentally different objectives—uniform utilization does not imply distinct expertise.
Core Idea: Different data domains should activate different combinations of experts. Expert specialization can be encouraged by maximizing the JS divergence between inter-domain routing distributions.
Method¶
Overall Architecture¶
An expert divergence loss \(\mathcal{L}_{ED}\) is added to the standard MoE training objective (language modeling loss \(\mathcal{L}_{LM}\) + load balancing loss \(\mathcal{L}_{LB}\)): $\(\mathcal{L}_{final} = \mathcal{L}_{LM} + \alpha \mathcal{L}_{LB} + \beta \mathcal{L}_{ED}\)$
Key Designs¶
-
Three-level aggregation: Hierarchical aggregation of routing probabilities at the Token→Sequence→Domain levels
- Token level: each token obtains a probability distribution over \(N\) experts via the router, denoted \(p(x_t)\)
- Sequence level: \(\bar{p}_s = \frac{1}{T}\sum_{t=1}^T p(x_t)\), averaging all token distributions within a sequence
- Domain level: \(\bar{p}_j = \frac{1}{|\mathcal{B}_j|}\sum_{s \in \mathcal{B}_j} \bar{p}_s\), grouped and averaged by domain label
-
JS divergence maximization: \(\mathcal{L}_{ED} = \frac{1}{\binom{M_B}{2}}\sum_{j<k} -\log(D_{JS}(\bar{p}_j || \bar{p}_k) + \epsilon)\)
- Maximizes the Jensen-Shannon divergence between routing distributions of all domain pairs
- The negative log transformation amplifies gradients for small divergence values, preventing gradient vanishing
-
Domain labeling schemes: Two granularities
- 3-Class: three coarse domains—English / Chinese / Math (derived directly from data sources)
- 49-Class: a classifier maps English → 24 topics, Chinese → 24 topics, and Math → 1 category, yielding 49 fine-grained domains
Theoretical Motivation — Diversity Decomposition¶
- Decomposition Theorem (Proposition 1): Total routing diversity \(D_{total} = D_{inter} + D_{intra}\)
- \(D_{inter}\): inter-domain divergence — the degree to which different domains activate different experts
- \(D_{intra}\): intra-domain divergence — the degree to which tokens within the same domain activate different experts
- Proposition 2: \(\mathcal{L}_{ED}\) directly increases \(D_{inter}\), redistributing global diversity toward inter-domain differences
- Standard \(\mathcal{L}_{LB}\) only attends to \(D_{total}\) without controlling its allocation, whereas \(\mathcal{L}_{ED}\) provides finer directional guidance
- The two losses are complementary: \(\mathcal{L}_{LB}\) ensures sufficient total diversity, while \(\mathcal{L}_{ED}\) channels that diversity into inter-domain differences, promoting expert specialization
Key Experimental Results¶
Main Results (three model scales, pre-trained from scratch on 100B tokens)¶
| Model | Method | CEval | MMLU | CMMLU | ARC-e | ARC-c | RACE-m | RACE-h | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| 15B-A1.5B | Standard MoE | 28.0 | 25.8 | 25.6 | 47.4 | 28.2 | 50.5 | 43.6 | 35.59 |
| 15B-A1.5B | +ED (49-class) | 28.9 | 27.1 | 26.3 | 48.6 | 28.5 | 51.7 | 45.5 | 36.65 |
| 8B-A0.8B | Standard MoE | 25.8 | 24.5 | 25.0 | 43.2 | 23.6 | 42.7 | 36.5 | 31.61 |
| 8B-A0.8B | +ED (49-class) | 26.1 | 25.2 | 25.2 | 44.1 | 24.9 | 44.3 | 38.2 | 32.57 |
| 3B-A0.3B | Standard MoE | 23.8 | 23.1 | 24.2 | 35.0 | 22.6 | 37.8 | 32.1 | 28.37 |
| 3B-A0.3B | +ED (49-class) | 24.5 | 23.4 | 24.5 | 36.2 | 22.8 | 37.5 | 32.8 | 28.81 |
Training Dynamics and Expert Analysis¶
| Analysis Dimension | Finding |
|---|---|
| LM loss | All ED configurations converge to lower \(\mathcal{L}_{LM}\); all \(\beta\) settings outperform the baseline |
| Domain granularity | 49-class > 3-class > baseline; finer-grained domain labels yield greater benefit |
| Expert specialization | Layer 4 exhibits substantially higher specialization than other layers (middle-layer experts are most differentiated) |
| Computational overhead | Negligible additional training cost (only inter-domain divergence computation per batch is required) |
| Scaling effect | Performance gains increase with model scale (15B > 8B > 3B) |
Key Findings¶
- Load balancing \(\neq\) functional specialization: uniform utilization does not guarantee distinct expertise
- The ED loss guides experts to develop differentiated routing strategies across domains, forming an organized expert team
- The 49-class fine-grained domain classification outperforms the 3-class scheme, indicating that the informativeness of domain labels directly affects specialization quality
Highlights & Insights¶
- Paradigm shift from balance to specialization: Standard MoE training focuses on load balancing (\(D_{total}\)), whereas this paper targets functional specialization (\(D_{inter}\))—a more fundamental objective
- Leveraging domain labels: Pre-training data domain labels serve as free supervisory signals for guiding expert specialization, requiring zero additional annotation cost
- Choice of JS divergence: The symmetric and bounded JS divergence is more appropriate than KL divergence for measuring differences between routing distributions
- Theoretical clarity: The diversity decomposition theorem elegantly reveals the complementary relationship between \(\mathcal{L}_{LB}\) and \(\mathcal{L}_{ED}\)
Limitations & Future Work¶
- Domain labels are required; the method does not directly apply in fully unlabeled settings (though automatic labeling via a classifier is feasible, as demonstrated in this work)
- Validation is limited to three model scales (3B/8B/15B) and a relatively small training budget (100B tokens)
- The domain classification granularity (49 vs. 3) must be set manually; adaptive determination of optimal granularity remains an open problem
- Interaction effects with shared-expert architectures (e.g., DeepSeek-MoE) have not been explored
- Whether domain-label-guided fine-tuning after pre-training can retroactively induce specialization is left unexplored
Related Work & Insights¶
- vs. DeepSeek-MoE: DeepSeek employs shared experts to capture common knowledge and mitigate routing expert redundancy, whereas this work uses inter-domain divergence maximization to directly drive routing expert differentiation—the two approaches are orthogonal and potentially composable
- vs. ERNIE 4.5: ERNIE enforces orthogonality of router weight matrices (unsupervised), while this work uses domain labels (supervised) to guide specialization—the supervised approach proves more effective
- vs. Qiu et al. (global LB): Global batch load balancing enhances overall diversity; this work further steers the allocation of that diversity
- Insight: The "divide and conquer" design intent of MoE requires explicit support from the training objective; otherwise, experts degrade into "redundant generalists"
Supplementary Analysis¶
- Core insight: load balancing only encourages global routing diversity without directing how that diversity is distributed — \(\mathcal{L}_{ED}\) uses domain labels to redirect diversity toward inter-domain differences
- The Divergence Decomposition (\(D_{total} = D_{inter} + D_{intra}\)) is elegant — \(\mathcal{L}_{LB}\) promotes \(D_{total}\), while \(\mathcal{L}_{ED}\) steers toward \(D_{inter}\)
- The superiority of 49-class over 3-class suggests that finer domain granularity enables more refined division of labor among experts
- Performance gains scale positively with model size (3B < 8B < 15B), indicating that larger models have greater untapped potential that benefits from structured specialization
- Computational overhead is nearly zero — \(\mathcal{L}_{ED}\) is computed solely from existing routing logits via JSD
Rating¶
- Novelty: ⭐⭐⭐⭐ Expert specialization via inter-domain divergence maximization is a novel angle; the theoretical decomposition is elegant
- Experimental Thoroughness: ⭐⭐⭐⭐ Three model scales + two domain classification granularities + expert behavior analysis
- Writing Quality: ⭐⭐⭐⭐ Problem analysis is clear; theoretical motivation is complete
- Value: ⭐⭐⭐⭐ Practically useful for MoE training; domain label utilization incurs low cost