MESS+: Dynamically Learned Inference-Time LLM Routing in Model Zoos with Service Level Guarantees¶

Conference: NeurIPS 2025 arXiv: 2505.19947 Code: GitHub Area: Optimization Keywords: Cost Optimization, LLM Routing, Virtual Queue, SLA Guarantee, Online Learning

TL;DR¶

MESS+ is the first framework to formalize LLM request routing as a constrained stochastic optimization problem with SLA guarantees. By combining an online-learned request satisfaction predictor with a virtual queue mechanism, it dynamically selects models per request. Across 3 reasoning and 5 question-answering benchmarks, MESS+ achieves an average 2× cost reduction while satisfying SLA constraints, with theoretical guarantees on both cost optimality and constraint satisfaction.

Background & Motivation¶

Background: The open-source LLM ecosystem (Llama, Qwen, Granite) provides multiple models at varying scales, forming a "model zoo." Each model family contains at least three variants (e.g., 1B/8B/70B) with vastly different performance and cost profiles. Users face difficulty selecting appropriate models and typically default to the largest available—wasting resources and incurring uncontrolled costs.

Limitations of Prior Work: Existing routing approaches each have notable shortcomings—RouteLLM supports routing between only two models; Zooter and RouterDC lack formalized cost guarantees; and none provide theoretical guarantees for SLA compliance. What users require is a hard commitment such as "at least X% of requests are satisfactorily answered."

Key Challenge: A three-way conflict of requirements: (1) users demand high-quality responses without technical expertise; (2) service providers aim to minimize operational costs; (3) enterprise clients require SLA guarantees. These three objectives must be simultaneously optimized within a unified framework.

Goal: How to design an LLM routing algorithm that strictly guarantees SLA compliance (i.e., a minimum request satisfaction rate over time) while minimizing operational cost?

Key Insight: Drawing on the Lyapunov drift-plus-penalty framework, SLA constraints are encoded as virtual queues, and request satisfaction prediction is integrated into per-request optimization as an online learning problem.

Core Idea: Virtual queues track cumulative SLA violations; an online-learned satisfaction predictor estimates per-model performance; and a lightweight optimization problem is solved for each request to achieve cost-optimal model selection.

Method¶

Overall Architecture¶

The problem is formalized as a constrained stochastic optimization: the objective is to minimize average operational cost \(\frac{1}{T}\sum_t\sum_m \mathbb{E}[y_{m,t}E_{m,t}]\), subject to a request satisfaction rate no lower than a target \(\alpha\): \(\frac{1}{T}\sum_t\sum_m \mathbb{E}[y_{m,t}s_{m,t}] \geq \alpha\), where \(y_{m,t} \in \{0,1\}\) is the model selection variable, \(E_{m,t}\) is the cost, and \(s_{m,t} \in \{0,1\}\) is the unknown satisfaction label. The system operates online, making routing decisions upon each request arrival.

Key Designs¶

Virtual Queue Mechanism:
- Function: Transforms long-term SLA constraints into step-wise actionable signals.
- Mechanism: Maintains a virtual queue \(Q_{t+1} = \max\{0, Q_t + \alpha - s_{m^*,t}\}\), representing the cumulative SLA violation. When \(Q_t\) is large (i.e., SLA is frequently violated), the optimization automatically favors larger models with higher satisfaction rates; when \(Q_t\) is small (i.e., SLA headroom exists), cheaper smaller models are permitted.
- Design Motivation: Inspired by the Lyapunov drift-plus-penalty framework, with a critical extension—integration of an online-learned satisfaction predictor, which is absent from the original framework.
Online Request Satisfaction Predictor:
- Function: Prior to dispatching a request to any model, predicts whether each candidate model will satisfy the request.
- Mechanism: Uses ModernBERT as a frozen backbone with a trainable multi-label classifier (trained via SGD). Outputs a satisfaction probability \(\hat{s}_{m,t} \in [0,1]\) per model. A probabilistic exploration strategy (with probability \(p_t = \min(1, c/\sqrt[4]{t})\) decaying over time) balances exploration and exploitation—during exploration, all models are queried to obtain ground-truth labels for updating the predictor. Training uses regularized cross-entropy loss.
- Design Motivation: The decay rate \(p_t \propto 1/\sqrt[4]{t}\) is critical—too fast leads to an inaccurate predictor; too slow incurs excessive exploration costs. The fourth-root decay achieves an optimal balance between prediction accuracy and exploration overhead, with theoretical guarantees.
Per-Request Optimization Problem:
- Function: Makes a cost-optimal routing decision for each incoming request.
- Mechanism: \(m^* = \arg\min_m V \cdot E_{m,t} + Q_t(\alpha - \hat{s}_{m,t})\). The first term is the weighted cost (\(V\) controls cost emphasis); the second term penalizes SLA violations (queue length multiplied by predicted satisfaction gap). When the queue is long, the penalty term dominates and favors high-satisfaction large models; when the queue is short, the cost term dominates and favors cheaper small models.
- Design Motivation: The parameter \(V\) controls the tradeoff between convergence speed and cost—larger \(V\) yields lower cost but slower SLA convergence; smaller \(V\) achieves SLA compliance faster but at higher cost. Theoretically, constraint violation is \(O(V/T + 1/\sqrt{T})\) and the cost gap is \(O(M/\sqrt[4]{T} + 1/V)\).

Loss & Training¶

Satisfaction predictor: Regularized cross-entropy loss with SGD-based online learning.
Hyperparameter settings: \(V=0.0001\), \(c=0.1\), held fixed across all benchmarks without per-task tuning.
Cost metric: Energy consumption per request (MJ), obtained via offline profiling.

Key Experimental Results¶

Main Results¶

Representative results across 8 benchmarks (Llama 1B/8B/70B model zoo):

Method	ARC-C (α=50%) Cost (MJ)	Satisfaction %	BoolQ (α=80%) Cost (MJ)	Satisfaction %
L70B only	2.35	60.8%	3.40	88.8%
L8B only	0.46	54.4%	0.43	84.2%
RouteLLM	1.24	51.2%	2.96	86.8%
RouterDC	2.09	60.9%	2.14	87.1%
MESS+	0.83	53.6%	0.90	82.2%

Average across 8 benchmarks (α=66%):

Method	Avg. Cost (MJ)	Avg. Satisfaction %	Model Call Ratio (70B/8B/1B)
L70B only	2.79	77.3%	100/0/0
RouteLLM	2.11	74.2%	75/0/25
RouterDC	2.09	76.1%	76/22/2
MESS+	1.07	67.8%	35/38/27

Ablation Study¶

Configuration	Description
Effect of exploration constant \(c\)	\(c=0.1\) is optimal; larger \(c\) incurs excessive exploration cost, smaller \(c\) yields an inaccurate predictor
Tradeoff under parameter \(V\)	Larger \(V\) lowers cost but slows SLA convergence; smaller \(V\) achieves SLA compliance quickly but at higher cost
Predictor accuracy	Prediction accuracy converges rapidly after the initial exploration phase, supporting theoretical assumptions
Non-i.i.d. setting	Although theory assumes i.i.d. inputs, experiments show robustness under non-i.i.d. (topic-sorted) settings

Key Findings¶

MESS+ achieves approximately 2× cost reduction: Under SLA-compliant conditions, MESS+ incurs an average cost of only 1.07 MJ, compared to approximately 2.09–2.11 MJ for the next-best methods (RouterDC and RouteLLM)—a near-halving of cost.
Full utilization of model zoo diversity: MESS+ distributes calls at 35%/38%/27% (large/medium/small), achieving true demand-driven allocation. In contrast, RouteLLM routes to only two models (75%/0%/25%), and RouterDC nearly always selects the largest model (76%/22%/2%).
SLA guarantees are met within a finite number of requests: Although the theoretical guarantee requires \(T \to \infty\), in practice SLA compliance stabilizes after approximately 1,000 requests, demonstrating strong practical utility.
Fixed hyperparameters generalize across tasks: \(V=0.0001\) and \(c=0.1\) require no per-benchmark tuning across all 8 tasks, enabling deployment-friendly operation.

Highlights & Insights¶

Theoretical guarantees as the core contribution: Theorem 1 proves constraint violation of \(O(V/T + 1/\sqrt{T})\); Theorem 2 proves a cost gap of \(O(M/\sqrt[4]{T} + 1/V + M \cdot F_{\min})\). This is the first LLM routing method with rigorous theoretical guarantees, enabling service providers to contractually commit to SLAs with mathematical assurance of compliance.
Integration of virtual queues and online learning: The framework unifies Lyapunov optimization with online predictor training and analyzes how predictor error propagates into overall optimization—representing a substantive theoretical extension of the classical framework.
Energy consumption as cost metric: Beyond API call fees, the framework accounts for energy consumption (MJ), aligning with the energy monitoring requirements of EU AI Act Article 95 and reinforcing its practical deployment relevance.

Limitations & Future Work¶

Limited model zoo scale: Experiments use only three Llama models; performance in larger-scale or heterogeneous zoos (mixing different model families) remains unvalidated.
Satisfaction label acquisition: The framework assumes users provide feedback immediately; in practice, delayed and sparse feedback is common.
Simplified cost model: Costs are assumed known upon request receipt (estimated via token count); in reality, cost uncertainty arises from variable generation lengths across models.
I.i.d. assumption: Theoretical analysis relies on i.i.d. inputs; although experiments demonstrate robustness under non-i.i.d. conditions, formal non-i.i.d. guarantees are absent.
Uniform SLA targets: All requests share the same SLA objective; differentiated service for requests with varying priorities is not supported.
Future directions include extending to heterogeneous model zoos, supporting delayed feedback, incorporating request prioritization, and modeling generation-length cost uncertainty.

vs. RouteLLM: RouteLLM trains a router on human preference data and supports only two models, with no SLA guarantees. MESS+ supports an arbitrary number of models, provides rigorous theoretical guarantees, and requires no human preference data (relying instead on binary satisfaction signals).
vs. RouterDC: RouterDC employs contrastive learning to train a multi-model router but overwhelmingly selects the largest model (76% call share), with no cost optimization or SLA guarantees. MESS+ distributes load across all models (35%/38%/27%).
vs. Lyapunov optimization framework: The classical framework assumes constraint signals are known. MESS+ extends this to settings where constraint signals must be predicted via online learning—a substantive theoretical contribution.

Rating¶

Novelty: ⭐⭐⭐⭐ — First framework to formalize LLM routing as a constrained optimization problem with SLA guarantees; theoretically rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 8 benchmarks with multiple baselines and thorough hyperparameter analysis, though the model zoo scale is limited.
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clearly framed through a three-party needs analysis; theoretical derivations are rigorous.
Value: ⭐⭐⭐⭐⭐ — Addresses a core practical challenge in LLM deployment (cost vs. quality), with strong theoretical and empirical contributions.