SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications¶

Conference: NeurIPS 2025 arXiv: 2411.04975 Code: https://github.com/snowflakedb/ArcticInference Area: LLM Agent / Inference Optimization Keywords: Speculative Decoding, Suffix Tree, Agent Inference, Inference Acceleration, Training-Free

TL;DR¶

Caches long token sequences via suffix trees and achieves 5.3× speedup through adaptive speculation length, targeting highly predictable repetitive inference tasks in agent scenarios.

Background & Motivation¶

State of the Field¶

Background: Speculative decoding has become a standard technique for reducing LLM inference latency, with draft model + verifier pipelines widely adopted.

Limitations of Prior Work:

Root Cause¶

Key Challenge: Conventional speculative decoding is optimized for diverse requests, whereas agent workloads involve repetitive inference (multi-agent pipelines, self-refinement loops).

Mechanism¶

Mechanism: Draft models must learn diverse task distributions and struggle to capture the repetitive nature of agent scenarios.

Key Challenge: Agent inference contains abundant long token sequences amenable to caching, which existing methods fail to exploit sufficiently.

Key Insight: Rather than training a draft model, use suffix trees to exactly match historical sequences and adaptively determine speculation length.

Core Idea: A suffix tree caches long token sequences from prompts and prior outputs, enabling training-free, extreme speculation.

Method¶

Overall Architecture¶

SuffixDecoding consists of two core components: (1) a suffix tree index that caches all prefixes from prompts and outputs, constructed in \(O(n)\) and queried for the longest exact match in \(O(m)\); and (2) adaptive speculation that dynamically adjusts speculation length based on match length and acceptance rate.

Key Designs¶

Suffix Tree Index:
- Function: Caches all token sequences from prompts and prior outputs.
- Mechanism: Performs suffix queries on the current generation token sequence and returns subsequent tokens following the longest historical match.
- Design Motivation: Approximately 70% of sequences in agent inference consist of predictable, repetitive patterns.
Adaptive Speculation Length:
- Function: Dynamically adjusts the number of speculated tokens based on acceptance rate.
- Mechanism: High acceptance rate > T1 → increase speculation length; low acceptance rate < T2 → decrease speculation length.
- Design Motivation: Maximize speculation under high certainty; remain conservative under uncertainty.

Key Experimental Results¶

Main Results¶

Method	Speedup	Notes
EAGLE-2	1.9×	Model-based, requires training
Token Recycling	1.9×	Training-free but limited speculation
Draft Model	2.5×	Requires training
SuffixDecoding	5.3×	Training-free, extreme speculation

Workload Analysis¶

Metric	SWE Workflow	Text-to-SQL	General Reasoning
Predictable sequence ratio	~70%	~65%	~30%
Average match length	8–12 tokens	5–9 tokens	1–3 tokens
Speculation acceptance rate	>90%	>85%	~60%
Speedup	5.3×	4.7×	2.1×

Key Findings¶

Agent workloads exhibit pronounced sequence repetitiveness, clearly distinct from diverse reasoning (5.3× vs. 2.1×).
No model training or fine-tuning required; directly applicable to any LLM.
Memory overhead scales linearly with cache size and remains controllable.

Highlights & Insights¶

Revival of Suffix Trees: An innovative application of a classical data structure; training-free exact matching outperforms learned approximate speculation.
Advantage of Being Training-Free: Eliminates draft model training overhead, enabling plug-and-play deployment.
Practical Significance of 5.3× Speedup: Equivalent to a 5× improvement in GPU throughput.

Limitations & Future Work¶

Agent-specificity: Limited benefit for diverse reasoning workloads (2.1×).
Cache memory pressure at large-scale deployment.
Interaction effects with KV cache compression, quantization, and other techniques are not discussed.

vs. EAGLE-2/3: Requires training a draft model; SuffixDecoding is training-free and faster in agent scenarios.
vs. Token Recycling: Model-free but with limited speculation efficiency; SuffixDecoding's suffix tree can cache significantly longer sequences.

Rating¶

Novelty: ⭐⭐⭐⭐ Innovative application of a classical data structure to agent inference.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on dual benchmarks: SWE-Bench and Text-to-SQL.
Writing Quality: ⭐⭐⭐⭐ Well-motivated with a concise solution.
Value: ⭐⭐⭐⭐⭐ Training-free, plug-and-play, 5.3× speedup. Code: Unavailable