Skip to content

Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

Conference: ICLR 2026 arXiv: 2601.18999 Code: GitHub Area: LLM Serving / KV Cache / Load Balancing Keywords: KV cache eviction policy, randomized algorithms, load-balanced routing, multi-LLM serving, competitive ratio analysis

TL;DR

This paper proposes the first unified mathematical model for KV cache-aware load balancing, introducing a randomized leaf-node eviction algorithm RLT (with \(O(\log n)\) competitive ratio) and a learning-based greedy router LBGR, achieving up to 11.96× latency reduction and 14.06× TTFT reduction in multi-LLM serving scenarios.

Background & Motivation

KV caching is central to LLM inference acceleration: by reusing key-value pairs from prior queries to avoid redundant computation, yet its effectiveness under memory constraints is highly sensitive to the choice of eviction policy.

LRU eviction has fundamental flaws: the traditional Least Recently Used (LRU) policy is fragile under dynamic query arrival patterns — in the worst case, it evicts precisely the tokens needed by the next query, causing a sharp drop in cache hit rate.

Inherent tension in multi-LLM serving: maximizing per-LLM cache hit rate (routing similar queries to the same LLM) and achieving global load balance (avoiding queue delays) are conflicting objectives.

Existing methods rely on heuristics: SGLang uses a fixed threshold to switch routing strategies; NVIDIA/llm-d uses a static linear scoring function. Neither approach offers formal modeling or adaptability to dynamic query patterns.

Lack of theoretical analysis: a significant gap exists between practical system design and theoretical understanding, with no unified model capturing the coupling between cache eviction and load balancing.

Coarse-grained queue load modeling: existing methods measure congestion solely by the number of pending queries, ignoring variation in actual service time across queries.

Method

Unified Formal Model

  • \(M\) workers (LLMs), each equipped with a KV cache of size \(B_i\), process \(N\) queries \(Q = \{q_j\}\) online.
  • Service time: \(\text{Cost}_{ij} = \alpha_{\text{CACHED}} \cdot h_{ij} + \alpha_{\text{MISS}} \cdot (|q_j| - h_{ij}) + O_{ij}\), where \(h_{ij}\) is the number of cache-hit tokens.
  • Routing decision \(x_{ij} \in \{0, 1\}\) determines query assignment; the objective is to minimize makespan (maximum cumulative load across all workers).
  • Cache state is updated via an UpdateCache function; queue load \(P_i\) accumulates by service time.
  • The model formally exposes the theoretical limitation of LRU: L-LRU (SGLang's leaf-node LRU) has a competitive ratio of \(O(n)\), degenerating severely when query lengths are highly imbalanced.

RLT: Randomized Leaf-node evicTion

  • A marking set \(T\) tracks visited tokens; when the mark count reaches \(B_i + 1\), marks are cleared (retaining only the most recent token).
  • When the cache is full and new tokens must be loaded, a leaf node is chosen uniformly at random from unmarked leaf nodes for eviction.
  • Theoretical guarantee: competitive ratio \(\Theta(\log(B_i - L))\), an exponential improvement over L-LRU's \(O(n)\).
  • This is proven to match the optimal lower bound for all randomized eviction algorithms (information-theoretically optimal).

LBGR: Learning-Based Greedy Routing

  • Service time estimation: a global Radix Tree tracks per-worker cache state to estimate cache-hit token count \(\tilde{h}_{ij}\).
  • Queue load estimation: \(\tilde{P}_i\) tracks per-worker load using exponential decay (factor \(\rho\)) to simulate natural load reduction over time; residual load is released upon query completion.
  • Residual correction: an online linear regression model \(\theta_i\) captures latency deviations caused by environmental fluctuations, with feature vector \(\phi\) containing hit count, miss count, and current load.
  • Greedy decision: \(\hat{E}_{ij} = \text{Cost}_{ij} + \tilde{P}_i + \theta_i^T \cdot \phi_{ij}\); queries are routed to the worker with the lowest predicted latency.
  • Online update: the regression model is continuously updated by minimizing \((E_{ij} - \hat{E}_{ij})^2\) upon observing actual latency.
  • Background decay thread: every \(\Delta t\) time units, all workers execute \(\tilde{P}_i \leftarrow \rho \cdot \tilde{P}_i\) to prevent accumulated distortion in load estimates.

Experiments

Main Results

Baseline Median Latency Reduction Median TTFT Reduction Cache Hit Rate Gain Throughput Gain
vs Random+LRU 30.9× 44.49×
vs Cache-Aware+LRU (SOTA) 11.96× 14.06× 36.45% 36.51%
vs Cache-Aware+LRU (P95) 2.03× 2.62×

Ablation Study

Method P50 Latency (ms) P50 TTFT (ms) Hit Rate Overhead
Cache-Aware+LRU 26680 25022 23.89% baseline
Cache-Aware+RLT 19191 14332 26.36% +0.58ms eviction
LBGR+LRU 6025 2958 33.33% +0.56ms routing
LBGR+RLT 2263 1088 37.31% +2ms total

Generalization across Model Scales and Architectures

Model Hardware Latency Reduction vs Cache-Aware+LRU TTFT Reduction vs Cache-Aware+LRU
Llama-3.1-8B 10×L40 11.96× 14.06×
Llama-3.1-70B 4×H200 5.46× 7.19×
Mixtral-8×7B (MoE) 4×H200 significantly outperforms all baselines significantly outperforms all baselines

Key Findings

  • Consistent gains across Llama-3.1-8B (L40), 70B (H200), and Mixtral-8×7B, generalizing across dense and sparse architectures.
  • Under worst-case round-robin arrival order, LBGR+RLT still leads substantially (22.8× latency reduction, 15.5× TTFT reduction).
  • Robust across cache sizes from 50% to 90%, worker counts from 2 to 10, and request rates of 4–20 req/s.
  • Single-worker experiments also confirm RLT's superiority over L-LRU: up to 6.92× higher cache hit rate and 77.4% throughput gain.
  • The performance gap between LBGR+RLT and baselines widens further as cache capacity increases.
  • Throughput saturates across methods for worker counts \(\geq 6\) under fixed request rate (12 req/s), indicating sufficient system capacity.

Highlights & Insights

  • The paper establishes the first unified mathematical model for KV cache-aware load balancing, bridging the gap between theory and practice.
  • RLT's \(O(\log n)\) competitive ratio is proven to be information-theoretically optimal (not improvable), representing an elegant application of classical online algorithm theory to LLM systems.
  • Both algorithms introduce negligible runtime overhead (~2ms per query), making them highly practical and directly plug-and-play in existing systems.
  • LBGR's exponential decay combined with online regression is simple yet effective, avoiding the training cost of complex RL or predictive models.
  • Experiments cover 4 benchmarks (GSP/ShareGPT/UltraChat/Loogle), 3 prefix-sharing scenarios, and multiple model scales and architectures, providing comprehensive empirical coverage.

Limitations & Future Work

  • Evaluation is limited to text inference; multi-modal KV caching (where image token cache behavior may differ) is not addressed.
  • Deployment is restricted to single-domain setups of up to 10 workers; the impact of network latency in large-scale or geo-distributed deployments remains unvalidated.
  • Theoretical analysis assumes online adversarial arrivals; distributional properties of real workloads (e.g., Zipf distributions) could be exploited for tighter bounds.
  • The linear regression residual model in LBGR may lack expressiveness in highly non-linear latency scenarios.
  • Output token length experiments are limited to 128; longer generation settings (e.g., 4K+) are unexplored.
  • KV cache compression: H2O (retaining high-attention tokens), StreamingLLM (attention sink tokens) — model-level compression approaches, complementary to the system-level eviction strategy proposed here.
  • KV cache systems: vLLM (paged virtual memory to reduce fragmentation), SGLang RadixAttention (Radix tree prefix sharing) — this paper replaces the eviction and routing modules on top of SGLang.
  • Load-balanced routing: SGLang (threshold-based switching between hit-rate and least-load routing), NVIDIA TensorRT-LLM/llm-d (static linear scoring) — all heuristic-based with no theoretical guarantees.
  • Online algorithm theory: the classical marking algorithm (Fiat et al., 1991) provides the theoretical foundation for RLT; this paper extends it to leaf-node eviction on Radix tree structures.
  • Competitive ratio analysis: this work connects online algorithm competitive analysis with LLM system optimization, exemplifying a theory-guided systems research paradigm.
  • This is the first work to jointly model cache eviction and routing in a unified framework, serving as a canonical example of theoretically driven system optimization.

Rating ⭐⭐⭐⭐⭐

  • Novelty: 5/5 — First unified formalization with optimal competitive ratio proof; solid theoretical contribution.
  • Experimental Thoroughness: 5/5 — 4 benchmarks × 3 settings × multiple models × extensive parameter sweeps; thorough ablation.
  • Writing Quality: 5/5 — Theory and experiments are well-integrated; problem motivation is clearly articulated.
  • Value: 5/5 — Only ~2ms overhead per query; directly integrable into systems such as SGLang.