Near-Optimal Online Deployment and Routing for Streaming LLMs¶
Conference: ICLR 2026 arXiv: 2506.17254 Code: None Area: LLM NLP / Systems Optimization Keywords: LLM routing, online deployment, streaming bandits, concurrency cap, budget constraints
TL;DR¶
This work provides the first formal treatment of the joint LLM streaming online deployment and routing problem, where new models continuously arrive and existing models may become obsolete. Under a concurrency deployment cap \(M_{\max}\) and cost budget constraints, the paper proposes StageRoute, a hierarchical algorithm that achieves a provable \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound with a matching lower bound, establishing near-optimality.
Background & Motivation¶
Background: Extensive prior work on LLM routing (e.g., RouteLLM, Hybrid-LLM, Zooter) assumes a fixed model pool. In practice, however, new models are continuously released and older ones gradually become obsolete — for instance, Azure OpenAI limits each resource to 32 standard deployments plus 5 fine-tuned deployments, and GPT-4.1 enforces rate limits of 1,000 RPM and 1M TPM.
Coupling Across Two Time Scales: Real-world LLM serving requires decisions at two fundamentally different time scales: - Macro (stage-wise): Deciding which models to keep deployed (subject to the concurrency cap \(M_{\max}\), fixed for the entire stage). This is a high-stakes decision — activating an uncertain new model may necessitate evicting a known-reliable model for a full stage. - Micro (per-query): Routing each query to one of the currently deployed models (subject to budget and throughput constraints). - Deployment decisions determine the entire action space for routing — a fundamental prerequisite overlooked by prior routing work.
Limitations of Prior Work: Static-pool routing (RouteLLM) lacks dynamic deployment; budget-aware routing (TensorOpera) ignores concurrency caps; streaming bandits (UniRoute/CSCR) lack stage-level commitment; BwK models consumable budgets but not active set replacement. No existing framework simultaneously addresses all five dimensions: streaming arrivals + dynamic deployment + concurrency cap + budget + throughput.
Core Idea: The paper models LLM deployment and routing as a coupled two-timescale online decision problem and proposes StageRoute — using optimistic estimation for macro-level deployment selection and LP-based real-time routing at the micro level — achieving a near-optimal \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound.
Method¶
Overall Architecture¶
StageRoute is a two-tier hierarchical algorithm mirroring the problem's layered structure: - Strategic Layer (Deployment Stage): At each update point \(\tau_k\), selects \(\leq M_{\max}\) models to deploy from a candidate pool that includes newly arrived models; the deployment set remains fixed throughout the stage. - Tactical Layer (Routing Stage): Upon each query arrival, solves an LP over the deployed models to obtain a routing distribution, then samples and dispatches accordingly.
Key Designs¶
- Deployment Optimization (DeployOPT): Formulated as a mixed-integer program (MIP) with binary variables \(z_m\) for model selection and continuous variables \(d_m\) for weights. UCB estimates \(\mu_m^U\) upper-bound performance and LCB estimates \(c_m^L\) lower-bound costs; the objective maximizes optimistic performance subject to budget and concurrency constraints. The solution \(d^*\) is used solely for model set selection and not as routing probabilities.
- Routing Optimization (RouteOPT): At each query arrival, solves a linear program over the deployed set to maximize UCB-estimated expected reward subject to budget and throughput constraints, with UCB/LCB updated in real time using current statistics.
- Confidence Bound Design: \(\mu_m^U = \text{proj}_{[0,1]}(\bar{\mu}_m + 2f_{rad})\), \(c_m^L = \text{proj}_{[c_1,c_2]}(\bar{c}_m - 2f_{rad})\), where \(f_{rad}(v,n) = \sqrt{\gamma v/n} + \gamma/n\).
Theoretical Guarantees¶
- Regret Upper Bound (Theorem 1): \(\text{Regret} \leq \mathcal{O}(\sqrt{M_{\max}KT\log(NT/\delta)} + NT/(M_{\max}K))\)
- First term: statistical learning cost of routing within the deployed set.
- Second term: structural model-discovery bottleneck (difficulty of identifying new models).
- Setting \(K = \Theta(T^{1/3})\) and \(M_{\max} = \Omega(N^{2/3})\) yields \(\tilde{\mathcal{O}}(N^{1/3}T^{2/3})\).
- Matching Lower Bound (Theorem 2): \(\Omega(T^{2/3})\), establishing near-optimality.
Key Experimental Results¶
Comparison with Existing LLM Routing Frameworks¶
| Method | Streaming Models | Dynamic Deployment (\(M_{\max}\) limit) | Budget-Aware | Throughput Limit |
|---|---|---|---|---|
| RouteLLM | ✗ | ✗ | ✓ | ✗ |
| UniRoute | ✓ | ✗ | ✓ | ✗ |
| CSCR | ✓ | ✗ | ✓ | ✗ |
| StageRoute | ✓ | ✓ | ✓ | ✓ |
Simulation Experiments (RouterBench Real Cost/Score Data)¶
| Experimental Setting | StageRoute vs. Oracle Gap | Notes |
|---|---|---|
| Tight budget \((b=0.3)\) | <5% sub-optimality | Closely tracks oracle under strict budget |
| Loose budget \((b=0.7)\) | <2% sub-optimality | Near-optimal when budget is relaxed |
| Varying \(M_{\max}\) | Performance monotonically improves with \(M_{\max}\) | Validates the theoretical role of the concurrency cap |
| Varying \(K\) | Optimal near \(K=T^{1/3}\) | Validates the theoretically optimal number of stages |
| Multi-task / multilingual queries | Consistently effective | Robust across diverse workloads |
Key Findings¶
- The two-term regret decomposition clearly reveals the fundamental trade-off between routing learning and model discovery.
- StageRoute outperforms static deployment strategies most significantly under tight budgets — dynamic replacement of obsolete models yields substantial gains.
- Throughput constraints naturally throttle high-load models, mitigating latency spikes.
Highlights & Insights¶
- Problem Formalization: The first work to formally model LLM deployment and routing as a coupled two-timescale online decision problem, capturing the complete practical scenario of streaming arrivals, concurrency caps, budget constraints, and throughput limits.
- The two-term regret decomposition is particularly insightful — the trade-off between statistical learning cost and structural discovery bottleneck offers principled guidance for system design.
- The matching lower bound confirms that \(\tilde{\mathcal{O}}(T^{2/3})\) is unimprovable — this reflects the intrinsic difficulty of the problem rather than any algorithmic shortcoming.
- DeployOPT solutions are used exclusively for model selection rather than as routing probabilities, decoupling deployment from routing execution and enabling rapid query-level adaptation within each stage.
Limitations & Future Work¶
- The performance distribution of each model is assumed stationary (fixed mean \(\mu_m\)); model degradation and query distribution shift are not modeled.
- Routing is context-free (not a contextual bandit), ignoring query feature information — incorporating a contextual estimator may yield further improvements.
- Throughput constraints are modeled via probabilities \(p_t(m) \leq \alpha_m\), assuming instantaneous load sharing; practical settings may require accounting for queuing delays.
- Simulation experiments are based on offline RouterBench data and have not been validated in real deployment environments.
Related Work & Insights¶
- vs. RouteLLM/Hybrid-LLM: Fixed model pool with no concurrency limit → StageRoute handles the full dynamic scenario.
- vs. BwK (Bandits with Knapsacks): BwK models consumable budgets but lacks stage-level commitment and active set replacement.
- vs. CMAB (Combinatorial MAB): CMAB selects super-arms from a fixed base set without streaming arrivals.
- vs. Streaming Bandits: Permits streaming arrivals but does not couple stage-level deployment with per-query routing.
- Insight: The two-tier decision architecture (strategic deployment + tactical routing) is generalizable to other resource-constrained online serving systems.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First complete formalization + near-optimal algorithm + matching lower bound
- Experimental Thoroughness: ⭐⭐⭐ Primarily simulation-based; lacks real deployment validation
- Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with clearly motivated problem formulation
- Value: ⭐⭐⭐⭐ Provides theoretically grounded guidance for LLM serving system design