Skip to content

Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

Conference: ICLR 2026
arXiv: 2601.18999
Code: GitHub
Area: LLM Serving / KV Cache / Load Balancing
Keywords: KV Cache Eviction Policy, Randomized Algorithms, Load Balancing Routing, Multi-LLM Serving, Competitive Analysis

TL;DR

The authors propose the first unified mathematical model for KV cache-aware load balancing, designing the Randomized Leaf-node Token eviction algorithm (RLT) with an \(O(\log n)\) competitive ratio and the Learning-Based Greedy Routing (LBGR). This approach reduces latency by up to 11.96× and TTFT by 14.06× in multi-LLM serving scenarios.

Background & Motivation

Background: KV caching is a core technology for LLM inference acceleration, avoiding redundant computation by reusing key-value pairs. However, its effectiveness under memory constraints depends heavily on eviction policies.

Limitations of Prior Work: - LRU Flaws: Traditional Least Recently Used (LRU) policies are fragile under dynamic query patterns. In the worst case, they evict tokens required for the next query, causing cache hit rates to plummet. - Key Challenge: Maximizing individual LLM cache hit rates (routing similar queries to the same worker) conflicts with global load balancing (avoiding queue delays). - Heuristic Reliance: Existing systems like SGLang use fixed thresholds, and NVIDIA/llm-d uses static linear scoring, both lacking formal modeling to adapt to dynamic patterns. - Theoretical Gap: There is a significant disconnect between system design and theoretical understanding, with no unified model capturing the coupling between cache eviction and load balancing. - Coarse Modeling: Current methods often use the number of pending queries to measure congestion, ignoring actual service time variations between queries.

Method

Overall Architecture

Mechanism: The paper integrates "cache eviction" and "query routing" into a single mathematical model: \(M\) workers (LLMs) each possess a KV cache of size \(B_i\), processing an online query sequence \(Q=\{q_j\}\). Each query is assigned to a worker, subsequently altering that worker's cache state and queue load. The objective is to minimize the makespan (maximum cumulative load). This unified model reveals that the Leaf-node LRU (L-LRU) used in SGLang has an \(O(n)\) worst-case competitive ratio. Consequently, two coupled components are designed: RLT for intra-worker eviction to achieve an \(O(\log n)\) ratio, and LBGR for inter-worker routing using greedy load estimation.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    Q["Online Query Sequence q_j"] --> MODEL["Unified Cost Model<br/>Service Time Cost_ij + Queue Load P_i<br/>Goal: Minimize Makespan"]
    MODEL -->|Reveals L-LRU competitive ratio degrades to O(n)| RLT["RLT Randomized Eviction<br/>Marking set size reaches B_i+1 -> Clear<br/>Unmarked leaf nodes: Uniform Random Eviction -> O(log n)"]
    RLT -->|Robust Worker Cache Hits| LBGR["LBGR Learning Greedy Routing<br/>Estimate Ê_ij = Cost + Decayed Load + Residual<br/>Select Worker with Min Predicted Latency"]
    LBGR --> OUT["Assign Query → Service → Update Radix Tree<br/>Approximate Min Makespan / E2E Latency"]
    OUT -.Observe Real Latency E_ij.-> UPD["Update Regression θ_i Online<br/>+ Exponential Load Decay ρ"]
    UPD -.Continuous Calibration.-> LBGR

Key Designs

1. Unified Formal Model: Coupling Eviction and Routing Design Motivation: Prior systems struggle with the trade-off between grouping similar queries for hits and spreading them for balancing. The paper formalizes service time as a weighted sum of hit and miss portions: \(\text{Cost}_{ij} = \alpha_{\text{CACHED}}\cdot h_{ij} + \alpha_{\text{MISS}}\cdot(|q_j| - h_{ij}) + O_{ij}\), where \(h_{ij}\) is the number of hit tokens. Routing decisions \(x_{ij} \in \{0, 1\}\) update the cache state and queue load \(P_i\). This formalization quantifies L-LRU's fragility (\(O(n)\) competitive ratio) under unbalanced query lengths.

2. RLT (Randomized Leaf-node Token Eviction) Novelty: Deterministic LRU is vulnerable to adversarial arrival orders. Inspired by marking algorithms in online theory, RLT introduces randomness. It maintains a marking set \(T\) for accessed tokens, clearing it when it reaches \(B_i+1\). When the cache is full, a token is selected for eviction uniformly at random from the unmarked leaf nodes. This randomness prevents adversarial exploitation, reducing the competitive ratio to \(\Theta(\log(B_i - L))\), which is an exponential improvement over \(O(n)\) and theoretically optimal.

3. LBGR (Learning-Based Greedy Routing) Function: LBGR estimates the expected latency for worker \(i\) as \(\hat{E}_{ij} = \text{Cost}_{ij} + \tilde{P}_i + \theta_i^{\top}\varphi_{ij}\) and routes to the worker with the minimum value. - \(\text{Cost}_{ij}\): Tracks cache hits via a global Radix Tree. - \(\tilde{P}_i\): Simulates load decay using \(\tilde{P}_i \leftarrow \rho\cdot\tilde{P}_i\) to account for service completion. - \(\theta_i^{\top}\varphi_{ij}\): An online linear regression residual term that captures environmental fluctuations. Parameters are updated online using \((E_{ij} - \hat{E}_{ij})^2\) to adapt to workload shifts.

Key Experimental Results

Main Results

Method Median Latency Gain Median TTFT Gain Cache Hit Rate Gain Throughput Gain
vs Random+LRU 30.9× 44.49× - -
vs Cache-Aware+LRU (SOTA) 11.96× 14.06× 36.45% 36.51%
vs Cache-Aware+LRU (P95) 2.03× 2.62× - -

Ablation Study

Method P50 Latency (ms) P50 TTFT (ms) Hit Rate Overhead
Cache-Aware+LRU (Baseline) 26680 25022 23.89% -
Cache-Aware+RLT 19191 14332 26.36% +0.58ms (Eviction)
LBGR+LRU 6025 2958 33.33% +0.56ms (Routing)
LBGR+RLT 2263 1088 37.31% +2ms (Total)

Key Findings

  • Generalization: Effective across Llama-3.1-8B/70B and Mixtral-8×7B (MoE) architectures and various GPU hardware (L40, H200).
  • Robustness: In worst-case round-robin arrival sequences, LBGR+RLT maintains a massive lead (22.8× latency reduction).
  • Scalability: Performance gains over baselines increase as cache capacity grows.
  • Single-Worker Efficiency: RLT alone improves hit rates by up to 6.92× and throughput by 77.4% compared to L-LRU.
  • Efficiency: Total runtime overhead is ~2ms per query, making it highly practical for production.

Highlights & Insights

  • Sets a theoretical foundation for KV cache load balancing, closing the gap between theory and system practice.
  • RLT's \(O(\log n)\) competitive ratio is information-theoretically optimal.
  • The "Analytical Cost + Decaying Load + Online Regression" combination in LBGR avoids heavy RL or prediction models while remaining adaptive.
  • Demonstrates that theoretical insights (randomization) can solve practical system bottlenecks (LRU fragility).

Limitations & Future Work

  • Modality: Evaluated only on text; multi-modal KV caches (e.g., image tokens) may exhibit different characteristics.
  • Scale: Focused on single-domain deployment; wide-area network latency in geo-distributed scenarios is not explored.
  • Theoretical Bounds: While \(O(\log n)\) is optimal for adversarial input, specific workload distributions (e.g., Zipf) might allow for tighter bounds.
  • Generation Length: Generation tests were limited to 128 tokens; behavior in extremely long-context scenarios (4K+) needs verification.
  • KV Compression: Works like H2O and StreamingLLM compress at the model level; this work optimizes the system-level eviction policy.
  • Serving Systems: Builds upon SGLang's RadixAttention, replacing the heuristic routing and eviction modules.
  • Online Algorithms: RLT is a non-trivial application of the classic marking algorithm (Fiat et al., 1991) to Radix tree structures.

Rating ⭐⭐⭐⭐⭐

  • Novelty: 5/5
  • Experimental Thoroughness: 5/5
  • Writing Quality: 5/5
  • Value: 5/5