Skip to content

Near-Optimal Online Deployment and Routing for Streaming LLMs

Conference: ICLR 2026 arXiv: 2506.17254 Code: None Area: LLM NLP / Systems Optimization Keywords: LLM routing, online deployment, streaming bandits, concurrency cap, budget constraints

TL;DR

This work provides the first formal treatment of the joint LLM streaming online deployment and routing problem, where new models continuously arrive and existing models may become obsolete. Under a concurrency deployment cap \(M_{\max}\) and cost budget constraints, the paper proposes StageRoute, a hierarchical algorithm that achieves a provable \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound with a matching lower bound, establishing near-optimality.

Background & Motivation

Background: Extensive prior work on LLM routing (e.g., RouteLLM, Hybrid-LLM, Zooter) assumes a fixed model pool. In practice, however, new models are continuously released and older ones gradually become obsolete — for instance, Azure OpenAI limits each resource to 32 standard deployments plus 5 fine-tuned deployments, and GPT-4.1 enforces rate limits of 1,000 RPM and 1M TPM.

Coupling Across Two Time Scales: Real-world LLM serving requires decisions at two fundamentally different time scales: - Macro (stage-wise): Deciding which models to keep deployed (subject to the concurrency cap \(M_{\max}\), fixed for the entire stage). This is a high-stakes decision — activating an uncertain new model may necessitate evicting a known-reliable model for a full stage. - Micro (per-query): Routing each query to one of the currently deployed models (subject to budget and throughput constraints). - Deployment decisions determine the entire action space for routing — a fundamental prerequisite overlooked by prior routing work.

Limitations of Prior Work: Static-pool routing (RouteLLM) lacks dynamic deployment; budget-aware routing (TensorOpera) ignores concurrency caps; streaming bandits (UniRoute/CSCR) lack stage-level commitment; BwK models consumable budgets but not active set replacement. No existing framework simultaneously addresses all five dimensions: streaming arrivals + dynamic deployment + concurrency cap + budget + throughput.

Core Idea: The paper models LLM deployment and routing as a coupled two-timescale online decision problem and proposes StageRoute — using optimistic estimation for macro-level deployment selection and LP-based real-time routing at the micro level — achieving a near-optimal \(\tilde{\mathcal{O}}(T^{2/3})\) regret bound.

Method

Overall Architecture

StageRoute is a two-tier hierarchical algorithm mirroring the problem's layered structure: - Strategic Layer (Deployment Stage): At each update point \(\tau_k\), selects \(\leq M_{\max}\) models to deploy from a candidate pool that includes newly arrived models; the deployment set remains fixed throughout the stage. - Tactical Layer (Routing Stage): Upon each query arrival, solves an LP over the deployed models to obtain a routing distribution, then samples and dispatches accordingly.

Key Designs

  1. Deployment Optimization (DeployOPT): Formulated as a mixed-integer program (MIP) with binary variables \(z_m\) for model selection and continuous variables \(d_m\) for weights. UCB estimates \(\mu_m^U\) upper-bound performance and LCB estimates \(c_m^L\) lower-bound costs; the objective maximizes optimistic performance subject to budget and concurrency constraints. The solution \(d^*\) is used solely for model set selection and not as routing probabilities.
  2. Routing Optimization (RouteOPT): At each query arrival, solves a linear program over the deployed set to maximize UCB-estimated expected reward subject to budget and throughput constraints, with UCB/LCB updated in real time using current statistics.
  3. Confidence Bound Design: \(\mu_m^U = \text{proj}_{[0,1]}(\bar{\mu}_m + 2f_{rad})\), \(c_m^L = \text{proj}_{[c_1,c_2]}(\bar{c}_m - 2f_{rad})\), where \(f_{rad}(v,n) = \sqrt{\gamma v/n} + \gamma/n\).

Theoretical Guarantees

  • Regret Upper Bound (Theorem 1): \(\text{Regret} \leq \mathcal{O}(\sqrt{M_{\max}KT\log(NT/\delta)} + NT/(M_{\max}K))\)
    • First term: statistical learning cost of routing within the deployed set.
    • Second term: structural model-discovery bottleneck (difficulty of identifying new models).
  • Setting \(K = \Theta(T^{1/3})\) and \(M_{\max} = \Omega(N^{2/3})\) yields \(\tilde{\mathcal{O}}(N^{1/3}T^{2/3})\).
  • Matching Lower Bound (Theorem 2): \(\Omega(T^{2/3})\), establishing near-optimality.

Key Experimental Results

Comparison with Existing LLM Routing Frameworks

Method Streaming Models Dynamic Deployment (\(M_{\max}\) limit) Budget-Aware Throughput Limit
RouteLLM
UniRoute
CSCR
StageRoute

Simulation Experiments (RouterBench Real Cost/Score Data)

Experimental Setting StageRoute vs. Oracle Gap Notes
Tight budget \((b=0.3)\) <5% sub-optimality Closely tracks oracle under strict budget
Loose budget \((b=0.7)\) <2% sub-optimality Near-optimal when budget is relaxed
Varying \(M_{\max}\) Performance monotonically improves with \(M_{\max}\) Validates the theoretical role of the concurrency cap
Varying \(K\) Optimal near \(K=T^{1/3}\) Validates the theoretically optimal number of stages
Multi-task / multilingual queries Consistently effective Robust across diverse workloads

Key Findings

  • The two-term regret decomposition clearly reveals the fundamental trade-off between routing learning and model discovery.
  • StageRoute outperforms static deployment strategies most significantly under tight budgets — dynamic replacement of obsolete models yields substantial gains.
  • Throughput constraints naturally throttle high-load models, mitigating latency spikes.

Highlights & Insights

  • Problem Formalization: The first work to formally model LLM deployment and routing as a coupled two-timescale online decision problem, capturing the complete practical scenario of streaming arrivals, concurrency caps, budget constraints, and throughput limits.
  • The two-term regret decomposition is particularly insightful — the trade-off between statistical learning cost and structural discovery bottleneck offers principled guidance for system design.
  • The matching lower bound confirms that \(\tilde{\mathcal{O}}(T^{2/3})\) is unimprovable — this reflects the intrinsic difficulty of the problem rather than any algorithmic shortcoming.
  • DeployOPT solutions are used exclusively for model selection rather than as routing probabilities, decoupling deployment from routing execution and enabling rapid query-level adaptation within each stage.

Limitations & Future Work

  • The performance distribution of each model is assumed stationary (fixed mean \(\mu_m\)); model degradation and query distribution shift are not modeled.
  • Routing is context-free (not a contextual bandit), ignoring query feature information — incorporating a contextual estimator may yield further improvements.
  • Throughput constraints are modeled via probabilities \(p_t(m) \leq \alpha_m\), assuming instantaneous load sharing; practical settings may require accounting for queuing delays.
  • Simulation experiments are based on offline RouterBench data and have not been validated in real deployment environments.
  • vs. RouteLLM/Hybrid-LLM: Fixed model pool with no concurrency limit → StageRoute handles the full dynamic scenario.
  • vs. BwK (Bandits with Knapsacks): BwK models consumable budgets but lacks stage-level commitment and active set replacement.
  • vs. CMAB (Combinatorial MAB): CMAB selects super-arms from a fixed base set without streaming arrivals.
  • vs. Streaming Bandits: Permits streaming arrivals but does not couple stage-level deployment with per-query routing.
  • Insight: The two-tier decision architecture (strategic deployment + tactical routing) is generalizable to other resource-constrained online serving systems.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First complete formalization + near-optimal algorithm + matching lower bound
  • Experimental Thoroughness: ⭐⭐⭐ Primarily simulation-based; lacks real deployment validation
  • Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations with clearly motivated problem formulation
  • Value: ⭐⭐⭐⭐ Provides theoretically grounded guidance for LLM serving system design