Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making¶

Conference: NeurIPS 2025 arXiv: 2510.21788 Code: None Area: Online Learning / Ensemble Methods Keywords: online learning, mixture of experts, bandit, no-regret, LLM ensemble

TL;DR¶

This paper proposes an Online Mixture of Experts (OMoE) framework comprising two algorithms — UCB-Successive Elimination and Online Weighted Majority Voting — with theoretical no-regret guarantees, and applies them to the online dynamic aggregation of LLM experts.

Background & Motivation¶

Background: Mixture of Experts (MoE) has been widely studied in offline settings, but systematic research on dynamically aggregating multiple expert outputs for optimal collective decision-making in online settings remains lacking.

Limitations of Prior Work: Traditional MoE assumes fixed gating/routing mechanisms and cannot adapt to dynamic changes in expert quality under streaming data; multi-armed bandit frameworks assume independent arms and fail to exploit the combinatorial structure among experts.

Key Challenge: The classical exploration–exploitation trade-off — trying new expert combinations versus leveraging known optimal ones — is considerably more complex in the aggregation dimension than standard bandit problems.

Key Insight: Expert aggregation is modeled as a contextual bandit problem, combining UCB and weighted voting effectively.

Core Idea: MoE is reformulated as an online bandit learning problem, yielding theoretically grounded no-regret expert aggregation algorithms.

Method¶

Overall Architecture¶

Given \(K\) experts and context \(x_t\), at each round the algorithm selects a subset of experts (a committee) and aggregation weights, then observes the accuracy of the aggregated output as feedback. The objective is to minimize cumulative regret \(R_T = \sum_{t=1}^T [r^* - r_t]\).

Key Designs¶

Algorithm 1: UCB-Successive Elimination (UCB-SE)
- Function: Progressively eliminates suboptimal expert committees via UCB confidence bounds.
- Mechanism: Maintains UCB estimates \(\hat{\mu}_k + \sqrt{\frac{2\ln t}{n_k}}\) for each candidate committee.
- Elimination Rule: A committee is eliminated when its UCB falls below the LCB of the current best committee.
- Regret Bound: \(R_T = O(\sqrt{KT \log T})\)
- Novelty: Efficient pruning reduces exploration waste.
Algorithm 2: Online Weighted Majority Voting (OWMV)
- Function: Dynamically assigns voting weights to experts based on their historical performance.
- Update Rule: \(w_k^{(t+1)} = w_k^{(t)} \cdot \beta^{\mathbb{1}[\text{expert } k \text{ incorrect}]}\)
- The decay factor \(\beta \in (0,1)\) penalizes incorrect experts.
- Regret Bound: \(R_T = O(\sqrt{T \log K})\)
- Novelty: Eliminates the need to enumerate all committee combinations.
LLM Application: Online Expert LLM Aggregation
- Scenario: Dynamic aggregation of \(K\) specialized LLMs (code, reasoning, creative writing, etc.).
- Expert weights or the optimal committee are updated after each query.
- Aggregation: Weighted voting combined with top-\(k\) selection.

Theoretical Guarantees¶

Theorem 1 (UCB-SE): For \(K\) experts and \(T\) rounds, \(R_T \leq O\!\left(\sum_{k \neq k^*} \frac{\log T}{\Delta_k}\right)\), where \(\Delta_k\) denotes the suboptimality gap.
Theorem 2 (OWMV): \(R_T \leq O(\sqrt{T \log K})\), consistent with Hedge/MW algorithms but applicable under bandit feedback.

Key Experimental Results¶

Main Results — Synthetic Data: Cumulative Regret Comparison (\(K=10\), \(T=10000\))¶

Algorithm	Cumulative Regret	Convergence Round
Uniform Random	2847	—
EXP3	892	~3000
UCB1	743	~2500
UCB-SE (Ours)	531	~1800
OWMV (Ours)	487	~1500

LLM Expert Aggregation — Accuracy (\(K=5\) Expert LLMs)¶

Method	MMLU↑	GSM8K↑	HumanEval↑	Avg.↑
Best Single Expert	68.2	71.5	62.8	67.5
Fixed Uniform Weights	65.1	68.3	60.2	64.5
Oracle (Hindsight Optimal)	73.8	76.2	68.1	72.7
UCB-SE	71.2	73.8	65.9	70.3
OWMV	72.1	74.5	66.7	71.1

Ablation Study — Effect of Number of Experts \(K\) (OWMV)¶

\(K\)	Regret at \(T=1000\)	Regret at \(T=5000\)	Regret at \(T=10000\)
3	87	195	312
5	134	289	421
10	198	387	487
20	276	512	634

Key Findings¶

OWMV achieves the lowest regret and fastest convergence across all settings.
UCB-SE is particularly efficient when the number of experts is small, owing to effective pruning.
In the LLM aggregation setting, OWMV surpasses the best single expert by 3–4% and approaches the Oracle upper bound.
The regret growth rate is consistent with the theoretical prediction of \(O(\sqrt{T \log K})\).
From approximately round 1000 onward, OWMV stably identifies the optimal expert committee.

Highlights & Insights¶

Dual Theoretical-Applied Contribution: The paper provides both rigorous regret bound proofs and practical application to LLM aggregation.
Algorithmic Simplicity: OWMV is straightforward to implement with negligible additional computational overhead.
Dynamic LLM Routing: The framework offers a theoretical foundation for expert selection during large model inference.

Limitations & Future Work¶

The current framework assumes stationary expert quality over time; non-stationary settings require extensions.
The aggregation mechanism in LLM experiments is relatively simple (majority voting); token-level mixing may be more effective.
Theoretical guidance for selecting committee size is lacking.
The relationship with MoE routing methods (e.g., Switch Transformer) warrants further investigation.

Hedge (Freund & Schapire 1997): Classic weighted majority voting algorithm.
EXP3 (Auer et al. 2002): Online learning under bandit feedback.
Switch Transformer (Fedus et al. 2022): Sparse MoE routing.
Insight: No-regret theory can be applied to online optimization of MoE routing.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel bandit perspective on MoE
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on both synthetic and LLM settings
Writing Quality: ⭐⭐⭐⭐ Algorithm presentation is clear
Value: ⭐⭐⭐⭐ Practical value for dynamic LLM aggregation