Near-Optimal Online Deployment and Routing for Streaming LLMs¶

Conference: ICLR 2026 arXiv: 2506.17254 Code: None Area: LLM NLP / Systems Optimization Keywords: LLM routing, online deployment, streaming bandits, concurrency cap, budget constraints

TL;DR¶

This work provides the first formal treatment of the joint LLM streaming online deployment and routing problem, where new models continuously arrive and existing models may become obsolete. Under a concurrency deployment cap \(M_{\max}\) and cost budget constraints, the paper proposes StageRoute, a hierarchical algorithm that achieves a provable \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound with a matching lower bound, establishing near-optimality.

Background & Motivation¶

Background: Extensive prior work on LLM routing (e.g., RouteLLM, Hybrid-LLM, Zooter) assumes a fixed model pool. In practice, however, new models are continuously released and older ones gradually become obsolete — for instance, Azure OpenAI limits each resource to 32 standard deployments plus 5 fine-tuned deployments, and GPT-4.1 enforces rate limits of 1,000 RPM and 1M TPM.

Coupling Across Two Time Scales: Real-world LLM serving requires decisions at two fundamentally different time scales: - Macro (stage-wise): Deciding which models to keep deployed (subject to the concurrency cap \(M_{\max}\), fixed for the entire stage). This is a high-stakes decision — activating an uncertain new model may necessitate evicting a known-reliable model for a full stage. - Micro (per-query): Routing each query to one of the currently deployed models (subject to budget and throughput constraints). - Deployment decisions determine the entire action space for routing — a fundamental prerequisite overlooked by prior routing work.

Limitations of Prior Work: Static-pool routing (RouteLLM) lacks dynamic deployment; budget-aware routing (TensorOpera) ignores concurrency caps; streaming bandits (UniRoute/CSCR) lack stage-level commitment; BwK models consumable budgets but not active set replacement. No existing framework simultaneously addresses all five dimensions: streaming arrivals + dynamic deployment + concurrency cap + budget + throughput.

Core Idea: The paper models LLM deployment and routing as a coupled two-timescale online decision problem and proposes StageRoute — using optimistic estimation for macro-level deployment selection and LP-based real-time routing at the micro level — achieving a near-optimal \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound.

Method¶

Overall Architecture¶

StageRoute is a two-tier hierarchical algorithm mirroring the problem's layered structure: - Strategic Layer (Deployment Stage): At each update point \(\tau_k\), selects \(\leq M_{\max}\) models to deploy from a candidate pool that includes newly arrived models; the deployment set remains fixed throughout the stage. - Tactical Layer (Routing Stage): Upon each query arrival, solves an LP over the deployed models to obtain a routing distribution, then samples and dispatches accordingly.

Key Designs¶

Deployment Optimization (DeployOPT): Formulated as a mixed-integer program (MIP) with binary variables \(z_m\) for model selection and continuous variables \(d_m\) for weights. UCB estimates \(\mu_m^U\) upper-bound performance and LCB estimates \(c_m^L\) lower-bound costs; the objective maximizes optimistic performance subject to budget and concurrency constraints. The solution \(d^*\) is used solely for model set selection and not as routing probabilities.
Routing Optimization (RouteOPT): At each query arrival, solves a linear program over the deployed set to maximize UCB-estimated expected reward subject to budget and throughput constraints, with UCB/LCB updated in real time using current statistics.
Confidence Bound Design: \(\mu_m^U = \text{proj}_{[0,1]}(\bar{\mu}_m + 2f_{rad})\), \(c_m^L = \text{proj}_{[c_1,c_2]}(\bar{c}_m - 2f_{rad})\), where \(f_{rad}(v,n) = \sqrt{\gamma v/n} + \gamma/n\).

Theoretical Guarantees¶

Regret Upper Bound (Theorem 1): \(\text{Regret} \leq \mathcal{O}(\sqrt{M_{\max}KT\log(NT/\delta)} + NT/(M_{\max}K))\)
- First term: statistical learning cost of routing within the deployed set.
- Second term: structural model-discovery bottleneck (difficulty of identifying new models).
Setting \(K = \Theta(T^{1/3})\) and \(M_{\max} = \Omega(N^{2/3})\) yields \(\tilde{\mathcal{O}}(N^{1/3}T^{2/3})\).
Matching Lower Bound (Theorem 2): \(\Omega(T^{2/3})\), establishing near-optimality.

Key Experimental Results¶

Comparison with Existing LLM Routing Frameworks¶

Method	Streaming Models	Dynamic Deployment (\(M_{\max}\) limit)	Budget-Aware	Throughput Limit
RouteLLM	✗	✗	✓	✗
UniRoute	✓	✗	✓	✗
CSCR	✓	✗	✓	✗
StageRoute	✓	✓	✓	✓

Simulation Experiments (RouterBench Real Cost/Score Data)¶

Experimental Setting	StageRoute vs. Oracle Gap	Notes
Tight budget \((b=0.3)\)	<5% sub-optimality	Closely tracks oracle under strict budget
Loose budget \((b=0.7)\)	<2% sub-optimality	Near-optimal when budget is relaxed
Varying \(M_{\max}\)	Performance monotonically improves with \(M_{\max}\)	Validates the theoretical role of the concurrency cap
Varying \(K\)	Optimal near \(K=T^{1/3}\)	Validates the theoretically optimal number of stages
Multi-task / multilingual queries	Consistently effective	Robust across diverse workloads

Key Findings¶

The two-term regret decomposition clearly reveals the fundamental trade-off between routing learning and model discovery.
StageRoute outperforms static deployment strategies most significantly under tight budgets — dynamic replacement of obsolete models yields substantial gains.
Throughput constraints naturally throttle high-load models, mitigating latency spikes.

Highlights & Insights¶

Problem Formalization: The first work to formally model LLM deployment and routing as a coupled two-timescale online decision problem, capturing the complete practical scenario of streaming arrivals, concurrency caps, budget constraints, and throughput limits.
The two-term regret decomposition is particularly insightful — the trade-off between statistical learning cost and structural discovery bottleneck offers principled guidance for system design.
The matching lower bound confirms that \(\tilde{\mathcal{O}}(T^{2/3})\) is unimprovable — this reflects the intrinsic difficulty of the problem rather than any algorithmic shortcoming.
DeployOPT solutions are used exclusively for model selection rather than as routing probabilities, decoupling deployment from routing execution and enabling rapid query-level adaptation within each stage.

Limitations & Future Work¶

The performance distribution of each model is assumed stationary (fixed mean \(\mu_m\)); model degradation and query distribution shift are not modeled.
Routing is context-free (not a contextual bandit), ignoring query feature information — incorporating a contextual estimator may yield further improvements.
Throughput constraints are modeled via probabilities \(p_t(m) \leq \alpha_m\), assuming instantaneous load sharing; practical settings may require accounting for queuing delays.
Simulation experiments are based on offline RouterBench data and have not been validated in real deployment environments.

vs. RouteLLM/Hybrid-LLM: Fixed model pool with no concurrency limit → StageRoute handles the full dynamic scenario.
vs. BwK (Bandits with Knapsacks): BwK models consumable budgets but lacks stage-level commitment and active set replacement.
vs. CMAB (Combinatorial MAB): CMAB selects super-arms from a fixed base set without streaming arrivals.
vs. Streaming Bandits: Permits streaming arrivals but does not couple stage-level deployment with per-query routing.
Insight: The two-tier decision architecture (strategic deployment + tactical routing) is generalizable to other resource-constrained online serving systems.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First complete formalization + near-optimal algorithm + matching lower bound
Experimental Thoroughness: ⭐⭐⭐ Primarily simulation-based; lacks real deployment validation
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with clearly motivated problem formulation
Value: ⭐⭐⭐⭐ Provides theoretically grounded guidance for LLM serving system design