Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling¶
Conference: ACL 2025 (Outstanding Paper)
arXiv: 2504.15754
Code: None
Area: LLM/NLP
Keywords: Token Recycling, Speculative Decoding, Adjacency Matrix, BFS Tree Construction, Training-Free Acceleration
TL;DR¶
Proposes Token Recycling—a training-free speculative decoding method that stores rejected candidate tokens during the decoding process into a lightweight adjacency matrix, constructs a draft tree via a BFS algorithm, and verifies it utilizing tree attention. It achieves approximately a \(2\times\) speedup across all LLM scales with less than 2MB of storage overhead.
Background & Motivation¶
Background: The rapid growth of LLM parameter scales has made inference latency a core bottleneck. Speculative decoding provides lossless acceleration through a "draft-then-verify" paradigm and has become a mainstream direction for LLM inference acceleration. Existing methods fall into two main categories: draft-model-based methods (such as Medusa and EAGLE) which require additional training, and retrieval-based methods (such as REST) which rely on large n-gram retrieval databases.
Limitations of Prior Work: Draft-model-based methods require training small models as drafters, incurring high training costs and needing separate training for each target model. Retrieval-based methods (such as REST) require gigabyte-scale storage to build domain-specific n-gram datastores, which suffer from slow retrieval and poor cross-domain adaptability. Both approaches introduce significant resource overheads, limiting the widespread application of speculative decoding in practical deployments.
Key Challenge: The core value of speculative decoding lies in "zero-cost" inference acceleration, but existing methods introduce significant training or storage costs themselves—the overhead of the acceleration method offsets its benefits, making it counterproductive especially in small-scale deployment scenarios.
Goal: Design a truly training-free, extremely low-storage speculative decoding method that: (1) requires no additional draft model or training processes; (2) requires no pre-built large retrieval datastores; (3) has minimal storage overhead (\(<2\text{MB}\)); and (4) adaptively adjusts to the current task and domain.
Key Insight: Candidate tokens sampled during decoding but ultimately rejected are "incorrect" for the current step, yet they contain rich next-token probability information—these "discarded" tokens have a high probability of appearing in future sequences. The key insight is to treat these rejected tokens as recyclable "treasure," storing their successor relationships via an adjacency matrix to provide high-quality draft candidates for subsequent steps.
Core Idea: Recycle rejected candidate tokens from the decoding process and store them in a lightweight adjacency matrix, and construct a multi-path draft tree via BFS to achieve training-free, adaptive speculative decoding acceleration.
Method¶
Overall Architecture¶
Token Recycling maintains a lightweight adjacency matrix during inference to record successor relationships between tokens. Each decoding step consists of four stages: (1) Collection Phase—stores sampled but unselected candidate tokens (next-tokens with top-\(k\) probabilities) from the current step into the adjacency matrix, recording the \(k\) most likely successor tokens and their probabilities for each token; (2) Construction Phase—starting from the currently accepted tokens, performs a breadth-first search (BFS) on the adjacency matrix to construct a draft tree containing multiple potential paths, expanding the most likely successor nodes at each layer; (3) Verification Phase—feeds the entire draft tree into the target LLM in one forward pass via tree attention, using a special tree attention mask to ensure causality, and accepts token paths consistent with the target LLM's distribution (guaranteeing losslessness); (4) Update Phase—newly generated candidate tokens after verification update the adjacency matrix again, forming a closed loop of online learning. This entire process starts from scratch, and the adjacency matrix is progressively enriched during decoding.
Key Designs¶
-
Adjacency Matrix Storage:
- Function: Records global token successor relationships with extremely low storage overhead.
- Mechanism: The matrix size is \(vocabulary\_size \times k\) (where \(k\) is the maximum number of successors kept per token), storing the most likely successor tokens and their probabilities for each token.
- Design Motivation: Requires only \(<2\text{MB}\) of storage (compared to gigabyte-scale retrieval methods), dynamically updates during decoding, and naturally adapts to the current task and domain without pre-building domain-specific databases.
-
BFS Tree Construction:
- Function: Generates high-quality, multi-path draft trees starting from the current token.
- Mechanism: Performs a breadth-first search in the adjacency matrix, expanding successor nodes ordered by probability at each layer; the generated draft tree covers multiple parallel speculative paths.
- Design Motivation: The tree depth and width can be flexibly adjusted based on the computational budget—BFS ensures high-probability paths are expanded first, balancing draft coverage with verification computational overhead.
-
Tree Attention Parallel Verification:
- Function: Verifies multiple speculative paths in a single forward pass.
- Mechanism: Feeds the entire draft tree into the target LLM using a specialized tree-attention mask, simultaneously evaluating token probabilities for all paths in one forward pass.
- Design Motivation: The natural compatibility with tree attention allows the BFS-constructed multi-path trees to be verified efficiently in parallel, eliminating the need for sequential path-by-path checks.
Loss & Training¶
Completely training-free—the entire method self-organizes at inference time. The adjacency matrix starts as empty and is updated online during the decoding process. The adjacency matrix converges to an effective state after decoding approximately 100+ tokens.
Key Experimental Results¶
Main Results¶
| Baseline Methods | Type | Token Recycling Advantage |
|---|---|---|
| vs. Training-free Methods (REST, n-gram retrieval) | Training-free | Speedup improvement +30% |
| vs. Training-based Methods (Medusa, EAGLE) | Training required | Speedup improvement +25% |
| Storage Overhead | - | \(<2\text{MB}\) (REST requires GB-scale) |
| Applicable LLM Scales | - | Achieves \(\sim 2\times\) speedup across all scales |
Speedup Across Different Tasks¶
| Task Type | Speedup |
|---|---|
| Dialogue Generation | \(\sim 2.0\times\) |
| Code Generation | \(\sim 2.0\times\) |
| Translation | \(\sim 1.8\times\) |
Ablation Study¶
| Ablation Dimension | Findings |
|---|---|
| Convergence Speed of Adjacency Matrix | Converges to an effective state after decoding 100+ tokens |
| BFS Tree Width/Depth | Optimal balance point exists—too wide wastes verification computation, too deep yields low matching probability |
| Cross-task Adaptability | Online updates of the adjacency matrix allow the method to automatically adapt to different task types |
| Cold Start Phase | Speedup is relatively low during the first ~100 tokens, then rapidly improves |
Key Findings¶
- Candidate tokens discarded during decoding exhibit a highly significant recurrence rate, validating the core hypothesis of "rejected token recycling."
- The online learning property of the adjacency matrix naturally endows the method with domain adaptability, eliminating the need for pre-built domain-specific databases.
- Token Recycling is effective across all tested LLM scales (\(7\text{B} \sim 70\text{B}+\) ), maintaining a stable speedup of \(1.8\times \sim 2.2\times\).
Highlights & Insights¶
- The insight of "recycling discarded tokens" is remarkably simple yet elegant: Candidate tokens naturally produced during decoding contain rich next-token probability information but have historically been discarded. This paper turns this "trash" into "treasure" for inference acceleration.
- Training-Free + Extremely Low Storage: A \(<2\text{MB}\) adjacency matrix achieves \(2\times\) speedup. Compared to methods that require extra training for draft models or maintaining large retrieval stores, the practicality is vastly improved.
- Strong Adaptability: The adjacency matrix is updated online during decoding, naturally adapting to the current context, task, and domain.
- Seamless Integration with Tree Attention: The multi-path draft tree constructed via BFS can be efficiently verified in parallel using tree attention.
Limitations & Future Work¶
- At the start of a conversation, the adjacency matrix is empty, requiring a "cold start" period of roughly 100 tokens.
- The adjacency matrix is based on token-level local successor relationships and cannot capture long-range semantic dependencies.
- In high-temperature sampling (high-stochasticity generation) scenarios, the recurrence rate of tokens decreases, which may reduce the acceleration effect.
- The hybrid use with model-based speculative decoding methods (such as Medusa/EAGLE) has not yet been explored.
Related Work & Insights¶
- vs. Draft-model methods like Medusa/EAGLE: The latter require training a small model as a drafter, whereas Token Recycling is completely training-free and highly general.
- vs. Retrieval-augmented speculative decoding like REST: The latter requires GB-scale retrieval datastores, whereas Token Recycling only requires \(<2\text{MB}\) and updates adaptively.
- vs. N-gram matching methods: The latter rely on static n-gram tables, whereas the dynamic updates of the adjacency matrix in Token Recycling provide much stronger adaptability.
- Insights: The online learning paradigm of adjacency matrices can be extended to other scenarios requiring pattern prediction; it shares a common theme with NSA—making full use of "already computed" byproducts.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Practical Value | ⭐⭐⭐⭐⭐ |