Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss¶

Conference: ICLR 2026 arXiv: 2512.23447 Code: None Area: Model Architecture / MoE Keywords: Mixture-of-Experts, router-expert coupling, auxiliary loss, expert specialization, large language models

TL;DR¶

This paper proposes Expert-Router Coupling (ERC) Loss, a lightweight auxiliary loss function that treats router parameter rows as proxy tokens for cluster centroids and constrains expert activation norms with respect to them, achieving tight coupling between router decisions and expert capabilities. The method requires only $n^2$ activation computations and yields significant performance gains in MoE-LLMs.

Background & Motivation¶

Mixture-of-Experts (MoE) is a core architecture in modern large language models, employing a router to select top-K experts per token and enabling efficient parameter scaling through sparse activation. However, conventional MoE suffers from a fundamental issue: there is no explicit constraint ensuring that routing decisions align with the actual capabilities of the experts.

Specifically: - The router is a linear classifier $\mathbf{R} \in \mathbb{R}^{n \times d}$ that assigns tokens via $\text{softmax}(\mathbf{x}\mathbf{R}^\top)$ - Experts are independent FFN modules with their own parameters $\mathbf{W}_g, \mathbf{W}_p, \mathbf{W}_o$ - The router has no direct access to expert parameters and can only learn routing strategies through trial and error - This frequently leads to misrouting—tokens are dispatched to experts ill-suited for them, and the resulting gradients interfere with expert specialization

The prior solution Autonomy-of-Experts (AoE) addresses this by having all experts partially process every token to obtain routing signals, but incurs substantially higher computational and memory overhead than standard MoE (1.6× training time, 1.3× memory), with costs that scale linearly with the number of tokens.

Method¶

Overall Architecture¶

The core idea of ERC Loss stems from an elegant observation: each row $\mathbf{R}[i]$ of the router parameter matrix $\mathbf{R}$ can be interpreted as the cluster centroid of the token set $\mathcal{X}_i$ assigned to expert $i$. Consequently, $\mathbf{R}[i]$ serves as a proxy for tokens in $\mathcal{X}_i$ and can be used to probe expert $i$'s responsiveness, without routing all tokens through all experts.

Key Designs¶

Proxy Token Generation (Step 1: Noise Injection):
- Bounded multiplicative random noise is applied to each cluster centroid: $\tilde{\mathbf{R}}[i] = \mathbf{R}[i] \odot \boldsymbol{\delta}_i$
- Noise $\boldsymbol{\delta}_i \sim \mathcal{U}(1-\epsilon_i, 1+\epsilon_i)^d$ simulates intra-cluster token variation
- Noise bound derivation: $\epsilon_i \leq \frac{\|\mathbf{R}[i] - \mathbf{R}[j]\|}{2\|\mathbf{R}[i]\|}$ (where $j$ is the nearest neighboring centroid), ensuring that perturbed proxies do not cross cluster boundaries
- $\epsilon_i$ is computed dynamically per layer per step, reflecting cluster evolution during training
- Key: The perturbed $\tilde{\mathbf{R}}$ is used solely for loss computation; the original $\mathbf{R}$ is used for actual routing
Activation Matrix Computation (Step 2: Probing Expert Responses):
- Each proxy token $\tilde{\mathbf{R}}[i]$ is passed through the $\mathbf{W}_g$ parameters of all $n$ experts
- An $n \times n$ activation matrix is constructed: $\mathbf{M}[i,j] = \|\tilde{\mathbf{R}}[i] \cdot \mathbf{W}_g^j\|$
- $\mathbf{M}[i,j]$ reflects the response strength of expert $j$ to the proxy token representing expert $i$'s assigned cluster
- The $\mathbf{W}_g$ activation norm is selected over the final output, as experiments show it is the most effective intermediate signal
ERC Loss Function (Step 3: Bidirectional Constraints): $$\mathcal{L}_{\text{ERC}} = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j \neq i}^{n} \left(\max(\mathbf{M}[i,j] - \alpha \mathbf{M}[i,i], 0) + \max(\mathbf{M}[j,i] - \alpha \mathbf{M}[i,i], 0)\right)$$

The two constraint terms are interpreted as follows: - Constraint 1 ($\mathbf{M}[i,j] < \alpha \mathbf{M}[i,i]$): Expert specialization — the activation of proxy token $\tilde{\mathbf{R}}[i]$ on expert $i$ must be substantially stronger than on any other expert, ensuring that expert $i$ has specialized for its assigned token cluster - Constraint 2 ($\mathbf{M}[j,i] < \alpha \mathbf{M}[i,i]$): Routing precision — expert $i$'s response to its own proxy $\tilde{\mathbf{R}}[i]$ must exceed its response to any other proxy $\tilde{\mathbf{R}}[j]$, ensuring that $\mathbf{R}[i]$ accurately represents expert $i$'s capability

Role of Hyperparameter $\alpha$:
- $\alpha \in [0, 1]$ controls the strength of coupling
- $\alpha \to 0$: encourages orthogonality between $\mathbf{R}[i]$ and other expert parameters, maximizing specialization
- $\alpha \to 1$: relaxes constraints, permitting greater overlap among experts
- $\alpha$ also serves as an exploratory tool for understanding expert specialization—comparing performance across different $\alpha$ values reveals the optimal trade-off between specialization and collaboration

Efficiency Analysis¶

Computational overhead: requires only $2n^2 D d$ additional FLOPs, independent of the number of tokens $T$
Practical impact: 0.18% increase for a 3B model ($n=64$); 0.82% for a 15B model ($n=256$)
Comparison with AoE: AoE adds $2T(n-K)dr$ FLOPs, scaling linearly with token count
Zero inference overhead: ERC Loss is used exclusively during training

Key Experimental Results¶

Main Results (3B Parameter Model)¶

64 experts, $K=8$, trained on 500B tokens
ERC Loss substantially outperforms vanilla MoE and narrows the performance gap with AoE
AoE requires ~1.6× training time and ~1.3× memory

Scaling to 15B Parameter Model¶

Benchmark	MoE	MoE + ERC	Gain
ARC-C	63.2	64.6	+1.4
HellaSwag	67.5	69.0	+1.5
MMLU	31.0	31.9	+0.9
MMLU-Pro	42.0	44.2	+2.2
BBH	44.3	45.6	+1.3
MATH	25.7	26.1	+0.4
GSM8K	45.2	45.8	+0.6
Average	47.2	49.1	+1.9

AoE is prohibitively expensive to train at the 15B scale.

Ablation Study¶

Configuration	Key Finding
Different $\alpha$ values	$\alpha=1$ is optimal for 3B ($n=64$); $\alpha=0.5$ is optimal for 15B ($n=256$)
Removing noise $\boldsymbol{\delta}$	Significant performance drop; coupling overfits to $\mathbf{R}$ itself
Router orthogonalization only	Limited gain, as baseline routers are already near-orthogonal (cosine similarity 0.15)
$\alpha > 1$	$\alpha=2$ yields marginal improvement; $\alpha=3$ is nearly ineffective
Different activation choices	$\tilde{\mathbf{R}} \mathbf{W}_g$ yields the best performance

Key Findings¶

Specialization–collaboration trade-off: Extreme specialization is not always beneficial; an optimal degree of specialization exists. Smaller $n$ favors generalist experts, while larger $n$ supports higher specialization. The optimal $\alpha=1$ for the 3B model ($n=64$) and $\alpha=0.5$ for the 15B model ($n=256$).
Noise bound $\epsilon$ as a specialization metric: $\epsilon$ is strongly correlated with $\alpha$ and can quantitatively track changes in expert specialization throughout training.
t-SNE visualization: Expert parameters in vanilla MoE form no meaningful clusters, whereas adding ERC Loss produces clearly defined clusters.
Parameter norm analysis: The model reduces ERC Loss by learning meaningful coupling rather than simply manipulating parameter norms.

Highlights & Insights¶

Elegant clustering perspective: Interpreting router parameters as cluster centroids and using them as proxies to probe expert capabilities is both concise and powerful, circumventing the high cost of routing all tokens through all experts.
Fixed vs. variable cost: The $O(n^2)$ computation is independent of batch size; in pre-training settings where each batch contains millions of tokens, this fixed overhead is negligible.
Controllable specialization exploration: $\alpha$ functions both as a training hyperparameter and an exploratory tool, while $\epsilon$ provides a quantitative measure, offering a new perspective for understanding MoE behavior.
Challenging conventional wisdom: Experiments demonstrate that "more specialization is not always better"—over-specialization is detrimental in smaller-scale MoE models.

Limitations & Future Work¶

Manual tuning of $\alpha$: The optimal $\alpha$ varies across model configurations ($n$, $K$, depth), and no automatic selection method is currently available.
Linear router assumption: The cluster centroid interpretation relies on the softmax linear router; applicability to non-linear routing mechanisms is unexplored.
Not tested with shared expert mechanisms: Shared experts as used in DeepSeek-style models may alter the optimal degree of specialization.
Validated only in pre-training: Effects on fine-tuning and continual learning remain unknown.
Limited comparison with MoE variants: Comparisons with architectures such as Megablocks and Switch Transformer are absent.

Autonomy-of-Experts (AoE) (Lv et al., 2025): Encodes routing into expert parameters and selects routes via activation norms; effective but costly. The proposed method can be viewed as a lightweight alternative to AoE.
Switch Transformer (Fedus et al., 2022): Introduces a load balancing loss; ERC Loss is compatible with it (load balance discrepancy on the order of $10^{-5}$).
OLMoE: The implementation in this paper is built upon this open-source MoE framework.
DeepSeek-MoE: Introduces shared experts to promote specialization, complementary in direction to ERC Loss.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The clustering perspective combined with proxy token probing is a clever design; a fixed-cost auxiliary loss with strong practical utility.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Scales from 3B to 15B with extensive ablations, analyses, and specialization exploration.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear exposition, intuitive three-step framework, and comprehensive appendix.
Value: ⭐⭐⭐⭐⭐ — Practical, efficient, and generalizable; directly improves MoE pre-training with a concise implementation.

Configuration	Key Finding
Different \(\alpha\) values	\(\alpha=1\) is optimal for 3B (\(n=64\)); \(\alpha=0.5\) is optimal for 15B (\(n=256\))
Removing noise \(\boldsymbol{\delta}\)	Significant performance drop; coupling overfits to \(\mathbf{R}\) itself
Router orthogonalization only	Limited gain, as baseline routers are already near-orthogonal (cosine similarity 0.15)
\(\alpha > 1\)	\(\alpha=2\) yields marginal improvement; \(\alpha=3\) is nearly ineffective
Different activation choices	\(\tilde{\mathbf{R}} \mathbf{W}_g\) yields the best performance