vCache: Verified Semantic Prompt Caching¶
Conference: ICLR2026 arXiv: 2502.03771 Code: GitHub | Benchmarks Area: LLM Evaluation Keywords: Semantic Caching, LLM Inference Optimization, Error-Rate Guarantee, online learning, Per-Embedding Threshold Authors: Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Matei Zaharia, Joseph E. Gonzalez (UC Berkeley, TU Munich, ETH Zurich, Stanford)
TL;DR¶
This paper proposes vCache — the first semantic caching system with user-defined error-rate guarantees — which employs online learning to independently estimate the optimal similarity threshold for each cached embedding. Without any pre-training, vCache achieves up to a 12.5× improvement in cache hit rate and a 26× reduction in error rate while satisfying correctness constraints.
Background & Motivation¶
High LLM inference cost: Each prompt requires multiple forward passes, making deployment expensive and latency-intensive.
Value of semantic caching: When a response for "Which city is Canada's capital?" is already cached, a semantically equivalent query such as "加拿大首都是哪里?" should reuse the same response, potentially reducing latency by up to 100×.
Fundamental limitations of static thresholds: - Existing systems (GPTCache, Azure, AWS, etc.) employ a globally fixed threshold \(t\) applied uniformly to all prompts. - However, experiments (Figure 3) show that the similarity distributions of correct and incorrect cache hits overlap substantially. - The optimal threshold varies dramatically across embeddings, making a single threshold incapable of simultaneously achieving low error rates and high hit rates.
Absence of error-rate guarantees: Existing systems cannot guarantee the correctness of returned cached responses, making them difficult to trust in production environments.
Limitations of embedding fine-tuning: Fine-tuning requires supervised training, is restricted to open-source models, and generalizes poorly to out-of-distribution data.
Core Contributions¶
- The first verifiable semantic cache providing user-defined error-rate guarantees.
- An online threshold learning algorithm: no pre-training required; independently learns a threshold per cached embedding and is agnostic to the choice of embedding model.
- Empirical and theoretical demonstration that dynamic per-embedding thresholds outperform static thresholds and fine-tuned embeddings.
- Release of four open-source semantic caching benchmarks (LMArena / Classification / SearchQueries / Combo).
Method¶
System Overview¶
The basic pipeline of semantic caching: 1. Embed prompt \(x\) as \(\mathcal{E}(x) \in \mathbb{R}^d\). 2. Retrieve the nearest neighbor from the vector store: \(\text{nn}(x) = \arg\max_{y \in C} \text{sim}(\mathcal{E}(x), \mathcal{E}(y))\). 3. Compute similarity \(s(x) = \text{sim}(\mathcal{E}(x), \mathcal{E}(\text{nn}(x)))\). 4. Decision: if \(s(x) \ge t\), exploit (return the cached response); otherwise, explore (invoke the LLM).
Data Model¶
The cache stores triples:
where the per-embedding metadata \(\mathcal{O}(x_i)\) records the similarity scores and match indicators for all subsequent prompts whose nearest neighbor is \(x_i\):
User Guarantee¶
The user specifies a maximum error rate \(\delta\), and vCache guarantees:
Decomposing the correctness probability into two mutually exclusive events:
This yields a lower bound on the exploration probability:
Sigmoid Parameterization¶
For each embedding, a sigmoid function models the relationship between similarity and correctness:
where \(t\) is the embedding-specific decision boundary and \(\gamma\) controls the steepness. Parameters are estimated via maximum likelihood (binary cross-entropy loss):
Confidence Interval Calibration¶
Due to finite samples, directly using \(\hat{t}, \hat{\gamma}\) cannot guarantee exactness. vCache adopts a pessimistic estimate — a conservative value \(t'(\epsilon), \gamma'(\epsilon)\) from the \((1-\epsilon)\) confidence band — to compute an upper bound on the exploration probability:
Decision Rule (Algorithm 2)¶
- Compute the embedding and retrieve the nearest neighbor \(y = \text{nn}(x)\).
- Fit a sigmoid to \(\mathcal{O}(y)\) to obtain \(\hat{t}, \hat{\gamma}\).
- Sweep over \(\epsilon \in (0,1)\) to compute \(\hat{\tau}\).
- Sample \(u \sim \text{Uniform}(0,1)\).
- If \(u \le \hat{\tau}\): explore (invoke the LLM and update \(\mathcal{O}(y)\)).
- Otherwise: exploit (return \(r(y)\)).
Theoretical guarantee (Theorem 4.1): Under the i.i.d. assumption and correct sigmoid specification, for any prompt \(x\) and any time step \(n\):
Key Experimental Results¶
Experimental Setup¶
- Embedding models: GteLargeENv1-5, E5-large-v2, OpenAI text-embedding-3-small
- LLMs: Llama-3-8B-Instruct, GPT-4o-mini
- Vector store: HNSW with cosine similarity
- Hardware: Intel Xeon Platinum 8570 + NVIDIA Blackwell 192 GB
Benchmark Datasets¶
| Benchmark | Size | Characteristics |
|---|---|---|
| SemCacheLMArena | 60K | Open user prompts from LM-Arena |
| SemCacheClassification | 45K | 3 classification datasets |
| SemCacheSearchQueries | 150K | Web search queries |
| SemCacheCombo | 27.5K | Mixed, including no-hit prompts |
Main Results¶
vs. static threshold (GPTCache): - On SemCacheLMArena: up to 26× lower error rate and 12.5× higher cache hit rate. - On SemCacheClassification: comprehensively outperforms the static baseline when \(\delta > 1.5\%\). - At \(\delta < 1.5\%\), vCache is more conservative (prioritizing correctness by increasing exploration).
Error-rate guarantee validation (Figure 4): - vCache consistently satisfies the error-rate constraint across all values of \(\delta\). - GPTCache's error rate continuously increases with more samples, exposing the weakness of static thresholds. - vCache's cache hit rate grows steadily over time, reflecting the online learning effect.
Pareto frontier comparison (Figure 5): - On the error rate vs. cache hit rate Pareto plot, vCache dominates static threshold approaches across all datasets. - vCache also outperforms GPTCache with fine-tuned embeddings.
vs. Fine-tuned Embedding Baseline¶
- vCache matches or exceeds the fine-tuned embedding method of Zhu et al. (2024) without any training.
- Fine-tuned embeddings generalize poorly to out-of-distribution data, while vCache's online learning naturally adapts to distribution shifts.
Highlights & Insights¶
- Formal correctness guarantee: vCache is the first semantic cache to provide a user-controllable error-rate upper bound \(\delta\), addressing the trust barrier for production-level deployment.
- Per-embedding adaptive thresholds: Figure 3 demonstrates that the optimal threshold varies substantially across embeddings, rendering a single global threshold infeasible; vCache naturally captures this variability.
- Zero-pretraining online learning: Model-agnostic, embedding-model-agnostic, and requiring no labeled data — vCache learns continuously during inference.
- Elegant probabilistic framework: The explore/exploit decision is formalized as an optimization problem under a correctness-probability constraint.
- Unified theory and practice: Combines the theoretical guarantee of Theorem 4.1 with large-scale benchmark validation.
Limitations & Future Work¶
- Long responses require LLM-as-a-judge for equivalence checking (Algorithm 1, L8), introducing additional LLM calls (though these can be executed asynchronously without impacting latency).
- Reliance on the i.i.d. assumption: If the prompt distribution shifts abruptly, the guarantee may be temporarily violated.
- Sigmoid modeling assumption: If the true correctness probability does not follow a sigmoid relationship with similarity, the theoretical analysis may not hold.
- Cold-start problem: For newly added embeddings with no historical observations, vCache defaults to a conservative strategy (full exploration).
- Applicable only to semantically similar scenarios: Semantic caching is inherently unsuitable for prompts requiring precise reasoning (e.g., mathematical computation).
- Multi-turn dialogue caching not evaluated (the paper focuses on single-turn prompts).
Related Work & Insights¶
| Method | Threshold Type | Error-Rate Guarantee | Training Required | Model-Agnostic | Online Learning |
|---|---|---|---|---|---|
| GPTCache | Global static | ✗ | ✗ | ✓ | ✗ |
| Azure/AWS Cache | Global static | ✗ | ✗ | ✓ | ✗ |
| Zhu et al. 2024 | Global static | ✗ | ✓ | ✗ | ✗ |
| SCalM | Global static | ✗ | ✗ | ✓ | ✗ |
| EVM (Rudd et al.) | Per-class | ✗ | ✓ | — | ✗ |
| vCache | Per-embedding dynamic | ✓ | ✗ | ✓ | ✓ |
Inspirations and connections: - Semantic caching + RAG: vCache can be directly integrated into RAG systems to cache retrieval results for high-frequency queries. - Connection to conformal prediction: vCache's probabilistic guarantee framework is conceptually analogous to conformal prediction, but requires no labeled calibration set. - Online learning paradigm: Combining the explore/exploit framework with correctness guarantees is a generalizable design pattern applicable to recommendation systems and A/B testing. - Multimodal extension: The semantic caching principle extends naturally to image/video prompts, and vCache's per-embedding threshold mechanism is directly applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The first semantic cache with correctness guarantees; per-embedding online threshold learning is an entirely new direction.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated across 3 embedding models × 2 LLMs × 4 benchmarks, though real-deployment latency comparisons are absent.
- Writing Quality: ⭐⭐⭐⭐⭐ — Formally rigorous, clearly motivated, with outstanding figure design (Figures 1–3 are particularly intuitive).
- Value: ⭐⭐⭐⭐⭐ — Addresses the core trust problem in semantic caching with an elegant theoretical framework and strong practical utility; fully open-sourced.