Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective¶

Conference: ICLR 2026 arXiv: 2601.18999 Code: GitHub Area: LLM Serving / KV Cache / Load Balancing Keywords: KV cache eviction policy, randomized algorithms, load-balanced routing, multi-LLM serving, competitive ratio analysis

TL;DR¶

This paper proposes the first unified mathematical model for KV cache-aware load balancing, introducing a randomized leaf-node eviction algorithm RLT (with \(O(\log n)\) competitive ratio) and a learning-based greedy router LBGR, achieving up to 11.96× latency reduction and 14.06× TTFT reduction in multi-LLM serving scenarios.

Background & Motivation¶

KV caching is central to LLM inference acceleration: by reusing key-value pairs from prior queries to avoid redundant computation, yet its effectiveness under memory constraints is highly sensitive to the choice of eviction policy.

LRU eviction has fundamental flaws: the traditional Least Recently Used (LRU) policy is fragile under dynamic query arrival patterns — in the worst case, it evicts precisely the tokens needed by the next query, causing a sharp drop in cache hit rate.

Inherent tension in multi-LLM serving: maximizing per-LLM cache hit rate (routing similar queries to the same LLM) and achieving global load balance (avoiding queue delays) are conflicting objectives.

Existing methods rely on heuristics: SGLang uses a fixed threshold to switch routing strategies; NVIDIA/llm-d uses a static linear scoring function. Neither approach offers formal modeling or adaptability to dynamic query patterns.

Lack of theoretical analysis: a significant gap exists between practical system design and theoretical understanding, with no unified model capturing the coupling between cache eviction and load balancing.

Coarse-grained queue load modeling: existing methods measure congestion solely by the number of pending queries, ignoring variation in actual service time across queries.

Method¶

Unified Formal Model¶

\(M\) workers (LLMs), each equipped with a KV cache of size \(B_i\), process \(N\) queries \(Q = \{q_j\}\) online.
Service time: \(\text{Cost}_{ij} = \alpha_{\text{CACHED}} \cdot h_{ij} + \alpha_{\text{MISS}} \cdot (|q_j| - h_{ij}) + O_{ij}\), where \(h_{ij}\) is the number of cache-hit tokens.
Routing decision \(x_{ij} \in \{0, 1\}\) determines query assignment; the objective is to minimize makespan (maximum cumulative load across all workers).
Cache state is updated via an UpdateCache function; queue load \(P_i\) accumulates by service time.
The model formally exposes the theoretical limitation of LRU: L-LRU (SGLang's leaf-node LRU) has a competitive ratio of \(O(n)\), degenerating severely when query lengths are highly imbalanced.

RLT: Randomized Leaf-node evicTion¶

A marking set \(T\) tracks visited tokens; when the mark count reaches \(B_i + 1\), marks are cleared (retaining only the most recent token).
When the cache is full and new tokens must be loaded, a leaf node is chosen uniformly at random from unmarked leaf nodes for eviction.
Theoretical guarantee: competitive ratio \(\Theta(\log(B_i - L))\), an exponential improvement over L-LRU's \(O(n)\).
This is proven to match the optimal lower bound for all randomized eviction algorithms (information-theoretically optimal).

LBGR: Learning-Based Greedy Routing¶

Service time estimation: a global Radix Tree tracks per-worker cache state to estimate cache-hit token count \(\tilde{h}_{ij}\).
Queue load estimation: \(\tilde{P}_i\) tracks per-worker load using exponential decay (factor \(\rho\)) to simulate natural load reduction over time; residual load is released upon query completion.
Residual correction: an online linear regression model \(\theta_i\) captures latency deviations caused by environmental fluctuations, with feature vector \(\phi\) containing hit count, miss count, and current load.
Greedy decision: \(\hat{E}_{ij} = \text{Cost}_{ij} + \tilde{P}_i + \theta_i^T \cdot \phi_{ij}\); queries are routed to the worker with the lowest predicted latency.
Online update: the regression model is continuously updated by minimizing \((E_{ij} - \hat{E}_{ij})^2\) upon observing actual latency.
Background decay thread: every \(\Delta t\) time units, all workers execute \(\tilde{P}_i \leftarrow \rho \cdot \tilde{P}_i\) to prevent accumulated distortion in load estimates.

Experiments¶

Main Results¶

Baseline	Median Latency Reduction	Median TTFT Reduction	Cache Hit Rate Gain	Throughput Gain
vs Random+LRU	30.9×	44.49×	—	—
vs Cache-Aware+LRU (SOTA)	11.96×	14.06×	36.45%	36.51%
vs Cache-Aware+LRU (P95)	2.03×	2.62×	—	—

Ablation Study¶

Method	P50 Latency (ms)	P50 TTFT (ms)	Hit Rate	Overhead
Cache-Aware+LRU	26680	25022	23.89%	baseline
Cache-Aware+RLT	19191	14332	26.36%	+0.58ms eviction
LBGR+LRU	6025	2958	33.33%	+0.56ms routing
LBGR+RLT	2263	1088	37.31%	+2ms total

Generalization across Model Scales and Architectures¶

Model	Hardware	Latency Reduction vs Cache-Aware+LRU	TTFT Reduction vs Cache-Aware+LRU
Llama-3.1-8B	10×L40	11.96×	14.06×
Llama-3.1-70B	4×H200	5.46×	7.19×
Mixtral-8×7B (MoE)	4×H200	significantly outperforms all baselines	significantly outperforms all baselines

Key Findings¶

Consistent gains across Llama-3.1-8B (L40), 70B (H200), and Mixtral-8×7B, generalizing across dense and sparse architectures.
Under worst-case round-robin arrival order, LBGR+RLT still leads substantially (22.8× latency reduction, 15.5× TTFT reduction).
Robust across cache sizes from 50% to 90%, worker counts from 2 to 10, and request rates of 4–20 req/s.
Single-worker experiments also confirm RLT's superiority over L-LRU: up to 6.92× higher cache hit rate and 77.4% throughput gain.
The performance gap between LBGR+RLT and baselines widens further as cache capacity increases.
Throughput saturates across methods for worker counts \(\geq 6\) under fixed request rate (12 req/s), indicating sufficient system capacity.

Highlights & Insights¶

The paper establishes the first unified mathematical model for KV cache-aware load balancing, bridging the gap between theory and practice.
RLT's \(O(\log n)\) competitive ratio is proven to be information-theoretically optimal (not improvable), representing an elegant application of classical online algorithm theory to LLM systems.
Both algorithms introduce negligible runtime overhead (~2ms per query), making them highly practical and directly plug-and-play in existing systems.
LBGR's exponential decay combined with online regression is simple yet effective, avoiding the training cost of complex RL or predictive models.
Experiments cover 4 benchmarks (GSP/ShareGPT/UltraChat/Loogle), 3 prefix-sharing scenarios, and multiple model scales and architectures, providing comprehensive empirical coverage.

Limitations & Future Work¶

Evaluation is limited to text inference; multi-modal KV caching (where image token cache behavior may differ) is not addressed.
Deployment is restricted to single-domain setups of up to 10 workers; the impact of network latency in large-scale or geo-distributed deployments remains unvalidated.
Theoretical analysis assumes online adversarial arrivals; distributional properties of real workloads (e.g., Zipf distributions) could be exploited for tighter bounds.
The linear regression residual model in LBGR may lack expressiveness in highly non-linear latency scenarios.
Output token length experiments are limited to 128; longer generation settings (e.g., 4K+) are unexplored.

KV cache compression: H2O (retaining high-attention tokens), StreamingLLM (attention sink tokens) — model-level compression approaches, complementary to the system-level eviction strategy proposed here.
KV cache systems: vLLM (paged virtual memory to reduce fragmentation), SGLang RadixAttention (Radix tree prefix sharing) — this paper replaces the eviction and routing modules on top of SGLang.
Load-balanced routing: SGLang (threshold-based switching between hit-rate and least-load routing), NVIDIA TensorRT-LLM/llm-d (static linear scoring) — all heuristic-based with no theoretical guarantees.
Online algorithm theory: the classical marking algorithm (Fiat et al., 1991) provides the theoretical foundation for RLT; this paper extends it to leaf-node eviction on Radix tree structures.
Competitive ratio analysis: this work connects online algorithm competitive analysis with LLM system optimization, exemplifying a theory-guided systems research paradigm.
This is the first work to jointly model cache eviction and routing in a unified framework, serving as a canonical example of theoretically driven system optimization.

Rating ⭐⭐⭐⭐⭐¶

Novelty: 5/5 — First unified formalization with optimal competitive ratio proof; solid theoretical contribution.
Experimental Thoroughness: 5/5 — 4 benchmarks × 3 settings × multiple models × extensive parameter sweeps; thorough ablation.
Writing Quality: 5/5 — Theory and experiments are well-integrated; problem motivation is clearly articulated.
Value: 5/5 — Only ~2ms overhead per query; directly integrable into systems such as SGLang.