Tetris: Optimal Draft Token Selection for Batch Speculative Decoding¶
Conference: ACL 2025
arXiv: 2502.15197
Code: GitHub
Area: LLM Efficiency
Keywords: Speculative decoding, batch inference, throughput optimization, draft token selection, LLM inference acceleration
TL;DR¶
Tetris proposes a method to dynamically select the optimal draft tokens across requests in batch speculative decoding scenarios, maximizing inference throughput under limited computational resources by greedily choosing tokens with the highest cumulative acceptance probability.
Background & Motivation¶
Speculative Decoding (SD) is an effective method to accelerate LLM inference, which rapidly generates candidate tokens using a smaller draft model and then validates them in parallel using a larger target model. However, existing methods have the following limitations:
- Limitations of Fixed Window Sizes: Traditional SD uses the same draft window size for each request, but the optimal window size varies significantly across different requests at different decoding steps (showing a long-tail distribution). A fixed window cannot adaptively adjust.
- Inadequacy of Single-Request Optimization: Most existing works optimize draft token selection solely for a single request, neglecting the resource allocation problem in multi-request batch processing scenarios.
- Resource Waste Issue: In SD, once a token is rejected, all subsequent tokens must be discarded (cascading failure). This causes substantial computational resource waste under fixed window sizes.
- Practical Demands of Service Providers: LLM inference providers need to maximize total throughput within a limited computational capacity, requiring a global optimization of resource allocation across requests.
Method¶
Overall Architecture¶
Tetris introduces a "Manager" between the draft model and the target model, which greedily selects a combination of tokens with the highest cumulative acceptance probability from the additional draft tokens generated by the draft model to send to the target model for verification. The core idea is to leverage the cascading failure characteristics of intra-sequence tokens and the independence of inter-sequence tokens to dynamically allocate longer draft windows to "easy" requests and shorter windows to "hard" requests.
Key Designs¶
-
Cross-Request Greedy Selection Algorithm: The cumulative acceptance probability of each token \((i, j)\) is defined as \(\prod p_{i, t}\). Using a max-heap data structure, the algorithm extracts the token with the highest cumulative acceptance probability from the heap at each step and adds it to the validation set \(D^*\), until the computational capacity \(C\) is fully utilized. The algorithm's time complexity is \(O(C \log N)\), implemented via the GPU's
scatter_maxoperation, incurring an overhead of less than 0.3ms. -
Intra-Sequence Cascading and Inter-Sequence Independence: Tokens within the same request have sequential dependencies (if the previous one is rejected, all subsequent tokens are invalid), whereas tokens across different requests are independent. Tetris leverages this property to prioritize selecting tokens from other requests (parallel tokens) when the risk of cascading failure is high, and continues to expand the current request's window (sequential tokens) when the risk is low.
-
Extra Draft Token Mechanism: The draft model is allowed to generate extra tokens beyond the server capacity (typically 1-2), providing more choices for Tetris. Experiments demonstrate that increasing the number of extra draft tokens consistently improves the Verification Success Rate (VSR).
Loss & Training¶
Tetris involves no extra training and is a purely inference-stage optimization strategy. In practical implementation, since the ground-truth acceptance rate cannot be obtained, the output probability of the draft model is used as a proxy metric. It is theoretically proven that:
- Step-wise Optimality (Theorem 1): Given the ground-truth acceptance rate, the algorithm produces the optimal throughput at each decoding step.
- Global Optimality (Theorem 2): Under the assumption that all tokens at the same sequence position share the same acceptance rate, Tetris achieves the globally optimal throughput.
Key Experimental Results¶
Main Results¶
Experiments are conducted under three configurations: Vicuna-68M/33B, Llama-1B/70B, and Llama-1B/405B.
| Dataset | Metric | Tetris vs. Best Baseline | Tetris vs. Standard SD | Note |
|---|---|---|---|---|
| ShareGPT (Setting 1) | Throughput | +3.50% | +6.70% | Vicuna-68M → 33B |
| Arena (Setting 1) | Throughput | +5.17% | +7.47% | Vicuna-68M → 33B |
| Tough (Setting 3) | Throughput | +5.25% | +5.25% | Llama-1B → 405B |
| Tough (Setting 1) | End-to-End Latency | +5.47% | +9.32% | Most significant latency reduction |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| 0 extra draft tokens | VSR Baseline | Equivalent to standard SD |
| 1 extra draft token | VSR Gain ~2-4% | Most cost-effective |
| 2 extra draft tokens | Further VSR Gain | Diminishing returns |
| 3 extra draft tokens | Slight VSR Gain | Sub-optimal due to sequential pipeline, as additional drafting time offsets some gains |
Key Findings¶
- The distribution of the optimal draft window size per step is flat and long-tailed (Figure 2), validating the sub-optimality of fixed windows.
- Under a parallelized pipeline, Tetris's TER metric can achieve up to 12.04% throughput improvement (Setting 3, Tough dataset), far exceeding the results under the current sequential pipeline.
- DSD (Dynamic Speculative Decoding) does not always outperform standard SD in practice, potentially due to inaccurate conditional acceptance rate estimation.
- Tetris is robust to the choice of different draft window sizes, consistently outperforming baselines under all configurations.
Highlights & Insights¶
- Tetris-like Shape Intuition: The selected tokens form a staircase shape (similar to Tetris), where "easy" requests get longer draft windows, visually demonstrating adaptive resource allocation.
- Theoretical Guarantees: Provides theoretical proofs for both step-wise and global optimality, rather than relying solely on empirical findings.
- Plug-and-Play: Requires no extra training and can be seamlessly integrated with existing SD methods and frameworks (e.g., vLLM).
- Forward-looking Design: The paper points out that under a parallelized pipeline (e.g., PEARL, Minions), Tetris's drafting and selection time can be fully hidden, leading to even greater potential gains.
Limitations & Future Work¶
- Dependency on Sequential Pipelines: Currently, vLLM employs a sequential pipeline, where the generation time of extra draft tokens cannot be hidden, limiting the potential of Tetris.
- Acceptance Rate Estimation Accuracy: Using draft model probabilities as a proxy metric might be inaccurate when the quality of the draft model is low.
- Simplified Assumptions: The theoretical proof of global optimality relies on the assumption that "all tokens at the same sequence position share the same acceptance rate."
- No Consideration for Request Priority: The current method solely maximizes the total throughput without considering the priority or fairness of different requests.
Related Work & Insights¶
- EAGLE-2 and MDSD also utilize greedy token selection based on draft model probabilities but operate only at the single-request level; Tetris generalizes this to the batch level.
- DSD (Liu et al., 2024d) adaptively determines a single draft window for all requests in a batch, whereas Tetris independently optimizes the window size for each request, offering finer granularity.
- The emergence of parallelized pipelines (Minions, PEARL) will further unleash the potential of Tetris, which is a direction worthy of attention.
Rating¶
| Dimension | Score (1-5) |
|---|---|
| Novelty | 4 |
| Theoretical Depth | 4 |
| Experimental Thoroughness | 4 |
| Engineering Value | 5 |
| Writing Quality | 4 |
| Overall Score | 4.2 |