Skip to content

Predictive Prefetching for Retrieval-Augmented Generation

Conference: ICML2026
arXiv: 2605.17989
Code: TBD
Area: information_retrieval
Keywords: RAG, asynchronous retrieval, predictive prefetching, LLM serving, latency optimization

TL;DR

By learning "semantic precursors that appear 8–16 tokens before uncertainty" from Transformer hidden states and attention patterns, this paper transforms RAG from a synchronous blocking process into predictive asynchronous prefetching using a trio of RetrievalPredictor + ContextMonitor + QueryGenerator. On benchmarks such as HotpotQA, it reduces end-to-end latency by 43.5% and TTFT by 62.4%, while maintaining answer quality within 1% of synchronous RAG.

Background & Motivation

Background: RAG has become the mainstream solution for injecting real-time/factual knowledge into LLMs and suppressing hallucinations. However, in production deployments, retrieval itself becomes a latency bottleneck—a single external API retrieval takes 100–500 ms, and complex multi-hop queries may trigger hundreds of calls, leading to poor user experience.

Limitations of Prior Work: Current RAG retrieval is synchronous and blocking; once entropy exceeds a threshold to trigger retrieval, token generation stops completely to wait for the result. A few asynchronous methods (e.g., TeleRAG with fixed windows, PipeRAG using stale tokens as queries) merely "hide retrieval behind generation using heuristic schedules." These assume "information needs remain stable," which is fragile in real-world scenarios involving multi-domain topics or changing entity references: prefetched documents might not match the actual information need, introducing noise.

Key Challenge: A structural conflict exists between high factual precision and low latency. In synchronous architectures, quality requires tolerating multi-round retrieval latency, while low latency requires sacrificing completeness by reducing retrieval depth. Asynchronous architectures can "hide latency" but fail to correct the "mismatch between prefetched content and real information needs."

Goal: Decompose this contradiction into three independent sub-problems: (1) When should retrieval be triggered for prefetching? (2) Is the currently accumulated context sufficient to support an effective query? (3) What to retrieve to truly match the information needs on the generation path?

Key Insight: The authors observe that retrieval needs do not appear out of thin air; they are pre-encoded by "semantic precursors" in generation dynamics (entropy trajectory features, attention patterns, value representation dynamics). These signals emerge 8–16 tokens before uncertainty actually explodes and can encode not just "when" it is needed, but also "what" is needed.

Core Idea: A lightweight predictor reads hidden/attention/value signals from Transformer middle layers to predict when future tokens will trigger high entropy. A context monitor then decides how many steps to wait before issuing the query. Finally, a fine-tuned T5-small generates a query oriented toward "future information needs" rather than just "restating current context," enabling true concurrency between retrieval and generation where prefetch content aligns with the generation path.

Method

Overall Architecture

The input is the LLM decoding stream, and the output remains the token sequence generated by the original LLM. A "Predict-Prefetch-Cache" pipeline is inserted into a parallel thread. For every generated token, the trio operates sequentially:

  1. RetrievalPredictor extracts hidden states (16-token sliding window), attention weights, and value vectors from mid-layers (depth 30–45%), combined with output distribution statistics, to estimate the probability \(\hat p_t \in [0,1]\) that "entropy will exceed the threshold within the next \(\Delta=10\) tokens."
  2. If \(\hat p_t\) is high, the ContextMonitor intervenes: it uses a ContextScore to select the optimal waiting steps \(k^* \in \{0,...,5\}\). At step \(t+k^*\), it uses a SufficiencyClassifier to determine "if the cache already contains the information" and a ClarityScore to judge "if the phrase is complete."
  3. If retrieval is still necessary, the QueryGenerator (fine-tuned T5-small) generates an asynchronous query based on the accumulated context \(\mathbf{c}_{t+k^*}\). Results are returned to a shared Result Cache. The generation thread is never blocked until entropy actually exceeds the threshold, at which point it pulls documents directly from the cache.

The entire stack adds only ~62M parameters to an 8B backbone (2-layer transformer predictor ~2M + three MLP heads <0.3M + T5-small 60M). The components are jointly pre-trained with a unified multi-task loss and adapted online using policy gradient with action-specific feedback. If prefetching fails, it automatically falls back to synchronous retrieval.

Key Designs

  1. RetrievalPredictor —— Predicting "when retrieval will be needed" via internal signals:

    • Function: Outputs a probability \(\hat p_t\) at token \(t\), estimating if entropy \(\mathcal{H}_\tau\) will first exceed threshold \(\theta\) within \(\Delta=10\) tokens.
    • Mechanism: Concatenates hidden states \(\mathbf{H}_t\), attention matrices \(\mathbf{A}_t\), and value vectors \(\mathbf{V}_t\) from a 16-token window, passed through a 2-layer transformer encoder to get \(\mathbf{z}_t \in \mathbb{R}^{512}\). This is merged with output statistics \(\mathbf{o}_t\) (entropy, top-k margin) for a sigmoid head: \(\hat p_t = \sigma(\mathbf{W}_p \cdot [\mathbf{z}_t;\mathbf{o}_t] + b_p)\). Middle layers (e.g., layers 10–14 of Llama-3.1-8B) are chosen because they capture high-level semantic abstractions while retaining uncertainty signals.
    • Design Motivation: Pure entropy thresholds are "reactive"—one only knows to retrieve after entropy rises, which is too late. Mid-layer signals show identifiable precursors 8–16 tokens before entropy spikes (Pearson correlation 0.42 at 10-token offset). Advancing the retrieval decision into this window enables true prefetching. AUROC 0.81 vs. 0.66 using current entropy alone validates this.
  2. ContextMonitor —— Deciding "the most cost-effective waiting time":

    • Function: When the predictor signals a need, it avoids immediate retrieval, instead judging (a) if context is sufficient, (b) if it's in cache, and (c) phrase completeness.
    • Mechanism: Phase 1 uses ContextScore (on T5) to pick \(k^* = \arg\max_k \mathrm{ContextScore}(\mathbf{c}_{t+k})\) where \(k \in \{0, \dots, 5\}\). Phase 2 uses SufficiencyClassifier to calculate max cosine similarity between current context Contriever embedding \(\mathbf{e}_c\) and cached documents \(\mathbf{e}_d\): \(\sigma(\mathbf{W}_{\text{suff}} \cdot [\mathbf{e}_c; \max_d \cos(\mathbf{e}_c, \mathbf{e}_d)] + b_{\text{suff}})\). If >0.8, cache is reused. ClarityScore triggers a wait of up to 2 tokens if <0.7, completing phrases like "The main cause of the" to "The main cause of the 2008 financial crisis."
    • Design Motivation: Queries issued at the exact moment of trigger are often incomplete, leading to poor quality. Waiting a few steps completes phrases, disambiguates needs, and allows self-correction. "Waiting 3–4 tokens" improved Query Relevance Score by 23% (0.65 → 0.86) and avoided 21% redundant retrievals.
  3. QueryGenerator + Multi-task Joint Training + Online Contextual-Bandit Adaptation:

    • Function: Uses T5-small to translate "accumulated context" into a query for future information needs; optimized via joint pre-training and online policy gradients.
    • Mechanism: Query \(\mathbf{q} = \mathrm{T5}(\mathbf{c}_{t+k^*})\). Pre-training loss \(\mathcal{L} = \alpha\mathcal{L}_{\text{pred}} + \beta\mathcal{L}_{\text{timing}} + \gamma\mathcal{L}_{\text{suff}} + \delta\mathcal{L}_{\text{clarity}} + \epsilon\mathcal{L}_{\text{query}}\) uses automatically generated labels via utility \(s = \mathrm{EM}_{\text{with}} - \mathrm{EM}_{\text{without}}\). Online, four actions (Generate/Reuse/Accumulate/Fetch) are modeled as a gated cascade using \(\nabla_\phi J = \mathbb{E}_{s\sim\rho}[\nabla_\phi \log \pi_\phi(a|s) \cdot R(s,a)]\). Rewards are specific: Successful non-retrieval +0.5, cache reuse +1.0, effective retrieval +1.0, unnecessary retrieval −0.5, late retrieval −2.0.
    • Design Motivation: PipeRAG's use of stale tokens fails because "past tokens don't express future needs." Fine-tuned T5-small learns to "infer what will be asked from the precursor context," outperforming raw context and 8B LLM direct generation (QRS 0.79 vs 0.74). The contextual bandit converges quickly (AUROC 0.760 → 0.809 within 2000 queries).

Key Experimental Results

Main Results

Evaluation using Llama-3.1-8B on HotpotQA / 2WikiMultiHopQA / NQ / TriviaQA, sharing a Wikipedia corpus + FAISS IVF + Contriever (125ms median retrieval latency).

Method HotpotQA EM/F1 NQ EM/F1 TTFT (ms) E2E (s) Ret/1K Efficiency↑
Sync-RAG 69.2 / 75.1 73.4 / 79.1 287 9.2 86.0 8.2
Self-RAG 67.8 / 73.6 72.1 / 77.8 234 7.8 72.0 9.4
FLARE 65.9 / 72.1 70.2 / 75.8 206 6.8 71.0 10.6
PipeRAG 66.8 / 72.9 70.1 / 75.8 118 5.6 66.8 13.0
Ours 68.7 / 75.1 72.5 / 78.7 108 5.2 59.0 14.4
Oracle 70.3 / 76.2 74.2 / 80.1 45 3.1 48.0 24.6

Conclusion: Compared to Sync-RAG, TTFT decreased by 62.4%, E2E decreased by 43.5%, and retrieval count per 1K tokens decreased by 31.4%. HotpotQA EM is within 0.5% of Sync-RAG while significantly outperforming other asynchronous/adaptive methods. Tail latency is robust: P95 is 33% lower than Sync-RAG.

Ablation Study (HotpotQA)

Configuration EM F1 TTFT E2E Description
Full System 68.7 75.1 108ms 5.2s Complete system
w/o Async architecture 68.4 74.8 287ms 7.8s Removes async; TTFT ↑ 2.7x
w/o Retrieval Predictor 65.1 71.3 118ms 5.8s Reverts to entropy; EM ↓ 3.6%
w/o Online learning 66.2 72.5 112ms 5.5s Pre-train only; EM ↓ 2.5%
w/o T5 query generator 67.1 73.8 108ms 5.4s Uses raw context; EM ↓ 1.6%

Key Findings

  • Async architecture is the main source of latency gains (TTFT increases 2.7x without it), while Retrieval Predictor is the main source of quality gains (EM drops 3.6% without it)—showing that "prediction" and "asynchrony" provide complementary benefits.
  • Gains scale with model size: Llama-70B AUROC increased to 0.83.
  • False positives account for 38.7% of triggers, but 72% of these are reused within the next 50 tokens, resulting in only 28% actual waste and a net latency overhead of only ~8% (factored into the 43.5% total gain).

Highlights & Insights

  • "Semantic precursors appearing 8–16 tokens early" is the core insight: Rather than building a more accurate entropy estimator, the model uses mid-layer internal representations to advance the retrieval decision window. This "internal representation for external signal prediction" can be adapted to speculative decoding or tool calling.
  • Gated 4-action contextual bandit with attributed rewards is sophisticated: Assigning rewards only to the component responsible for a specific node in the decision tree avoids the instability of long-range credit assignment in RL.
  • ContextMonitor's "active waiting" breaks the reactive paradigm: Waiting 3–4 tokens to complete a phrase appears to sacrifice latency but actually improves query quality by 23% and avoids 21% redundancy.
  • The failure fallback to Sync-RAG ensures "worst-case latency = synchronous baseline," locking down the downside risk of the predictor for production use.

Limitations & Future Work

  • Dependency on internal states: Inaccessible for closed-source APIs (GPT-4/Claude); logit-only variants lose most gains (AUROC 0.66).
  • OOD Adaptation: While transferable, OOD tasks still rely on online adaptation to recover 5–9 AUROC points.
  • Wasted Retrievals: 28% of false positives are never used; further compression of this overhead is a direct area for improvement.
  • Multi-hop constraints: Bridge-style questions where the second hop depends on the first hop's result remain a challenge for prediction.
  • vs FLARE / Self-RAG: These are "reactive synchronous" methods that block generation upon trigger; Ours advances the trigger and executes asynchronously, cutting TTFT by 50%+.
  • vs TeleRAG: TeleRAG uses fixed lookahead windows regardless of need; Ours uses \(\hat p_t\) to trigger dynamically, reducing retrieval frequency.
  • vs PipeRAG: PipeRAG uses stale tokens; Ours uses T5-small to generate future-oriented queries, leading to 14.4 vs 13.0 Efficiency.

Rating

  • Novelty: ⭐⭐⭐⭐ First complete "predictive asynchronous prefetching" scheme for RAG.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 benchmarks, 9 baselines, scale-up across 6 model families, and tail latency analysis.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and quantitative explanations of trade-offs.
  • Value: ⭐⭐⭐⭐⭐ Retrieval latency is the #1 bottleneck for RAG; a 43.5% E2E reduction for <1% quality loss is highly practical for production.