vCache: Verified Semantic Prompt Caching¶

Conference: ICLR2026 arXiv: 2502.03771 Code: GitHub | Benchmarks Area: LLM Evaluation Keywords: Semantic Caching, LLM Inference Optimization, Error-Rate Guarantee, online learning, Per-Embedding Threshold Authors: Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez (UC Berkeley, TU Munich, ETH Zurich, Stanford)

TL;DR¶

This paper proposes vCache — the first semantic caching system with user-defined error-rate guarantees — which employs online learning to independently estimate the optimal similarity threshold for each cached embedding. Without any pre-training, vCache achieves up to a 12.5× improvement in cache hit rate and a 26× reduction in error rate while satisfying correctness constraints.

Background & Motivation¶

High LLM inference cost: Each prompt requires multiple forward passes, making deployment expensive and latency-intensive.

Value of semantic caching: When a response for "Which city is Canada's capital?" is already cached, a semantically equivalent query such as "加拿大首都是哪里？" should reuse the same response, potentially reducing latency by up to 100×.

Fundamental limitations of static thresholds: - Existing systems (GPTCache, Azure, AWS, etc.) employ a globally fixed threshold \(t\) applied uniformly to all prompts. - However, experiments (Figure 3) show that the similarity distributions of correct and incorrect cache hits overlap substantially. - The optimal threshold varies dramatically across embeddings, making a single threshold incapable of simultaneously achieving low error rates and high hit rates.

Absence of error-rate guarantees: Existing systems cannot guarantee the correctness of returned cached responses, making them difficult to trust in production environments.

Limitations of embedding fine-tuning: Fine-tuning requires supervised training, is restricted to open-source models, and generalizes poorly to out-of-distribution data.

Core Contributions¶

The first verifiable semantic cache providing user-defined error-rate guarantees.
An online threshold learning algorithm: no pre-training required; independently learns a threshold per cached embedding and is agnostic to the choice of embedding model.
Empirical and theoretical demonstration that dynamic per-embedding thresholds outperform static thresholds and fine-tuned embeddings.
Release of four open-source semantic caching benchmarks (LMArena / Classification / SearchQueries / Combo).

Method¶

System Overview¶

The basic pipeline of semantic caching: 1. Embed prompt \(x\) as \(\mathcal{E}(x) \in \mathbb{R}^d\). 2. Retrieve the nearest neighbor from the vector store: \(\text{nn}(x) = \arg\max_{y \in C} \text{sim}(\mathcal{E}(x), \mathcal{E}(y))\). 3. Compute similarity \(s(x) = \text{sim}(\mathcal{E}(x), \mathcal{E}(\text{nn}(x)))\). 4. Decision: if \(s(x) \ge t\), exploit (return the cached response); otherwise, explore (invoke the LLM).

Data Model¶

The cache stores triples:

\[\mathcal{D} = \{(\mathcal{E}(x_i), r(x_i), \mathcal{O}(x_i))\}_{i=0}^{n}\]

where the per-embedding metadata \(\mathcal{O}(x_i)\) records the similarity scores and match indicators for all subsequent prompts whose nearest neighbor is \(x_i\):

\[\mathcal{O}(x_i) = \{(s(x_j), c(x_j)) \mid \text{nn}(x_j) = x_i\}_{j=i+1}^{n}\]

\[c(x) = \begin{cases} 1 & \text{if } r(\text{nn}(x)) = r(x) \\ 0 & \text{otherwise} \end{cases}\]

User Guarantee¶

The user specifies a maximum error rate \(\delta\), and vCache guarantees:

\[\Pr(\mathbf{vCache}(x) = r(x)) \ge 1 - \delta\]

Decomposing the correctness probability into two mutually exclusive events:

\[\Pr(\mathbf{vCache}(x) = r(x)) = \Pr(\text{explore} \mid x, \mathcal{D}) + (1 - \Pr(\text{explore} \mid x, \mathcal{D})) \cdot \Pr(c(x) = 1 \mid x, \mathcal{D})\]

This yields a lower bound on the exploration probability:

\[\Pr(\text{explore} \mid x, \mathcal{D}) \ge \frac{(1-\delta) - \Pr(c(x)=1 \mid x, \mathcal{D})}{1 - \Pr(c(x)=1 \mid x, \mathcal{D})} = \tau_{\text{nn}(x)}(s(x))\]

Sigmoid Parameterization¶

For each embedding, a sigmoid function models the relationship between similarity and correctness:

\[\Pr(c(x) = 1 \mid x, \mathcal{D}) = \mathcal{L}(s(x), t, \gamma) = \frac{1}{1 + e^{-\gamma(s(x) - t)}}\]

where \(t\) is the embedding-specific decision boundary and \(\gamma\) controls the steepness. Parameters are estimated via maximum likelihood (binary cross-entropy loss):

\[\hat{t}, \hat{\gamma} = \arg\min_{t, \gamma} \mathbb{E}_{(s,c) \in \mathcal{O}_{\text{nn}(x)}} \left[ c \cdot \log(\mathcal{L}(s, t, \gamma)) + (1-c) \cdot \log(1 - \mathcal{L}(s, t, \gamma)) \right]\]

Confidence Interval Calibration¶

Due to finite samples, directly using \(\hat{t}, \hat{\gamma}\) cannot guarantee exactness. vCache adopts a pessimistic estimate — a conservative value \(t'(\epsilon), \gamma'(\epsilon)\) from the \((1-\epsilon)\) confidence band — to compute an upper bound on the exploration probability:

\[\hat{\tau} = \min_{\epsilon \in (0,1)} \frac{(1-\delta) - (1-\epsilon)\mathcal{L}(s(x), t'(\epsilon), \gamma'(\epsilon))}{1 - (1-\epsilon)\mathcal{L}(s(x), t'(\epsilon), \gamma'(\epsilon))} \ge \tau_{\text{nn}(x)}(s(x))\]

Decision Rule (Algorithm 2)¶

Compute the embedding and retrieve the nearest neighbor \(y = \text{nn}(x)\).
Fit a sigmoid to \(\mathcal{O}(y)\) to obtain \(\hat{t}, \hat{\gamma}\).
Sweep over \(\epsilon \in (0,1)\) to compute \(\hat{\tau}\).
Sample \(u \sim \text{Uniform}(0,1)\).
If \(u \le \hat{\tau}\): explore (invoke the LLM and update \(\mathcal{O}(y)\)).
Otherwise: exploit (return \(r(y)\)).

Theoretical guarantee (Theorem 4.1): Under the i.i.d. assumption and correct sigmoid specification, for any prompt \(x\) and any time step \(n\):

\[\Pr(\mathbf{vCache}(x) = r(x) \mid \mathcal{D}) \ge 1 - \delta\]

Key Experimental Results¶

Experimental Setup¶

Embedding models: GteLargeENv1-5, E5-large-v2, OpenAI text-embedding-3-small
LLMs: Llama-3-8B-Instruct, GPT-4o-mini
Vector store: HNSW with cosine similarity
Hardware: Intel Xeon Platinum 8570 + NVIDIA Blackwell 192 GB

Benchmark Datasets¶

Benchmark	Size	Characteristics
SemCacheLMArena	60K	Open user prompts from LM-Arena
SemCacheClassification	45K	3 classification datasets
SemCacheSearchQueries	150K	Web search queries
SemCacheCombo	27.5K	Mixed, including no-hit prompts

Main Results¶

vs. static threshold (GPTCache): - On SemCacheLMArena: up to 26× lower error rate and 12.5× higher cache hit rate. - On SemCacheClassification: comprehensively outperforms the static baseline when \(\delta > 1.5\%\). - At \(\delta < 1.5\%\), vCache is more conservative (prioritizing correctness by increasing exploration).

Error-rate guarantee validation (Figure 4): - vCache consistently satisfies the error-rate constraint across all values of \(\delta\). - GPTCache's error rate continuously increases with more samples, exposing the weakness of static thresholds. - vCache's cache hit rate grows steadily over time, reflecting the online learning effect.

Pareto frontier comparison (Figure 5): - On the error rate vs. cache hit rate Pareto plot, vCache dominates static threshold approaches across all datasets. - vCache also outperforms GPTCache with fine-tuned embeddings.

vs. Fine-tuned Embedding Baseline¶

vCache matches or exceeds the fine-tuned embedding method of Zhu et al. (2024) without any training.
Fine-tuned embeddings generalize poorly to out-of-distribution data, while vCache's online learning naturally adapts to distribution shifts.

Highlights & Insights¶

Formal correctness guarantee: vCache is the first semantic cache to provide a user-controllable error-rate upper bound \(\delta\), addressing the trust barrier for production-level deployment.
Per-embedding adaptive thresholds: Figure 3 demonstrates that the optimal threshold varies substantially across embeddings, rendering a single global threshold infeasible; vCache naturally captures this variability.
Zero-pretraining online learning: Model-agnostic, embedding-model-agnostic, and requiring no labeled data — vCache learns continuously during inference.
Elegant probabilistic framework: The explore/exploit decision is formalized as an optimization problem under a correctness-probability constraint.
Unified theory and practice: Combines the theoretical guarantee of Theorem 4.1 with large-scale benchmark validation.

Limitations & Future Work¶

Long responses require LLM-as-a-judge for equivalence checking (Algorithm 1, L8), introducing additional LLM calls (though these can be executed asynchronously without impacting latency).
Reliance on the i.i.d. assumption: If the prompt distribution shifts abruptly, the guarantee may be temporarily violated.
Sigmoid modeling assumption: If the true correctness probability does not follow a sigmoid relationship with similarity, the theoretical analysis may not hold.
Cold-start problem: For newly added embeddings with no historical observations, vCache defaults to a conservative strategy (full exploration).
Applicable only to semantically similar scenarios: Semantic caching is inherently unsuitable for prompts requiring precise reasoning (e.g., mathematical computation).
Multi-turn dialogue caching not evaluated (the paper focuses on single-turn prompts).

Method	Threshold Type	Error-Rate Guarantee	Training Required	Model-Agnostic	Online Learning
GPTCache	Global static	✗	✗	✓	✗
Azure/AWS Cache	Global static	✗	✗	✓	✗
Zhu et al. 2024	Global static	✗	✓	✗	✗
SCalM	Global static	✗	✗	✓	✗
EVM (Rudd et al.)	Per-class	✗	✓	—	✗
vCache	Per-embedding dynamic	✓	✗	✓	✓

Inspirations and connections: - Semantic caching + RAG: vCache can be directly integrated into RAG systems to cache retrieval results for high-frequency queries. - Connection to conformal prediction: vCache's probabilistic guarantee framework is conceptually analogous to conformal prediction, but requires no labeled calibration set. - Online learning paradigm: Combining the explore/exploit framework with correctness guarantees is a generalizable design pattern applicable to recommendation systems and A/B testing. - Multimodal extension: The semantic caching principle extends naturally to image/video prompts, and vCache's per-embedding threshold mechanism is directly applicable.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first semantic cache with correctness guarantees; per-embedding online threshold learning is an entirely new direction.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated across 3 embedding models × 2 LLMs × 4 benchmarks, though real-deployment latency comparisons are absent.
Writing Quality: ⭐⭐⭐⭐⭐ — Formally rigorous, clearly motivated, with outstanding figure design (Figures 1–3 are particularly intuitive).
Value: ⭐⭐⭐⭐⭐ — Addresses the core trust problem in semantic caching with an elegant theoretical framework and strong practical utility; fully open-sourced.