Parallel Token Prediction for Language Models¶

Conference: ICLR 2026 arXiv: 2512.21323 Code: GitHub Area: Model Compression Keywords: parallel decoding, speculative decoding, auxiliary variables, autoregressive models, inference acceleration

TL;DR¶

This paper proposes Parallel Token Prediction (PTP), which relocates sampling randomness from post-processing to model inputs via auxiliary variables, rendering future tokens deterministic functions of those variables and enabling joint prediction of multiple tokens in a single forward pass.

Background & Motivation¶

The sequential generation process of autoregressive Transformers is the primary bottleneck for inference latency — each token prediction requires one full forward pass. Limitations of existing acceleration methods include: - Speculative decoding: employs a small draft model followed by verification, yet the draft model itself still generates sequentially. - Independent multi-token prediction: assumes conditional independence among tokens, leading to semantic inconsistencies (e.g., generating "def numpy"). - Discrete diffusion: requires multi-step iteration with an irreducible sequential component.

The core insight of PTP is that if the random variable \(u_i \sim \mathcal{U}[0,1]\) used for sampling is supplied as model input, each token \(t_i\) becomes a deterministic function of \(u_i\) and the preceding context, allowing the model to predict all future tokens in parallel.

Method¶

Overall Architecture¶

PTP has two variants: O-PTP (predicting one-hot distributions) and C-PTP (recovering full conditional distributions), both supporting multi-token prediction in a single forward pass. Training can be performed via distillation or from scratch.

Key Designs¶

Auxiliary Variable Sampling Mechanism:
- Standard sampling: \(t_i = \text{Pick}(u_i, P_i)\), where \(u_i \sim \mathcal{U}[0,1]\) determines the token via inverse CDF.
- Key observation: given \(u_i\), token \(t_i\) is deterministic, and \(u_i\) carries information equivalent to \(t_i\).
- Theorem 1: \(t_k = f_P(t_{<i}; u_i, \ldots, u_k)\), i.e., future tokens can be expressed as deterministic functions of auxiliary variables.
O-PTP (One-Hot PTP):
- The model receives all auxiliary variables \(u_i, \ldots, u_N\) simultaneously and predicts one-hot distributions.
- \(t_k = \arg\max(P(t_k | t_{<i}; u_i, \ldots, u_k))\)
- Advantage: efficient parallel prediction; Limitation: does not expose the underlying sampling distribution.
C-PTP (Categorical PTP):
- Theorem 2: \(P(t_k | t_{<i}, u_i, \ldots, u_{k-1}) = P(t_k | t_{<k})\)
- The full conditional probability distribution is recovered by withholding \(u_k\).
- Can be trained from scratch (inverse autoregressive training) or via distillation.
Partial Quadratic Decoding:
- Drafting and verification are executed in parallel, with branches prepared for all possible acceptance counts.
- Branch probabilities are estimated using model confidence: \(P(\#\text{correct}=m|t) \approx (1-c_{i+m})\prod_{k=i}^{i+m-1} c_k\)
- Continuation tokens are greedily allocated to high-probability branches to minimize computational waste.

Loss & Training¶

Distillation training: the auxiliary variable \(u_k \in [F_{k,t_k-1}, F_{k,t_k})\) for each token is back-derived from the teacher model.
O-PTP loss: \(\mathcal{L}(\theta; t, i) = -\sum_{k=i}^N \log P_\theta(t_k | t_{<i}, u_i, \ldots, u_k)\)
C-PTP loss: \(\mathcal{L}(\theta; t, i) = -\sum_{k=i}^N \log P_\theta(t_k | t_{<i}, u_i, \ldots, u_{k-1})\)
Auxiliary variable encoding: \(\text{embed}(u) = W \cdot \text{binary}(u) + b\), mapping a float32 value to a 32-bit binary vector.

Key Experimental Results¶

Main Results (SpecBench — Vicuna-7B Distillation)¶

Method	MTC	TL	SUM	QA	Math	RAG	Avg. #accepted
O-PTP	2.77	-	-	-	-	-	4.2
Autoregressive baseline	-	-	-	-	-	-	~2.0
Independent prediction	-	-	-	-	-	-	~3.5

Metric	Ours (O-PTP)	Note
Wall-clock speedup	2.4×	vs. standard autoregressive decoding
Accepted tokens per step	4.2	per speculative decoding step

Ablation Study¶

Configuration	#accepted ↑	Note
O-PTP (with auxiliary variables)	7.0 ± 0.1	coordinated across tokens
Independent prediction (no auxiliary variables)	6.2 ± 0.1	independent tokens, inconsistent pairs
C-PTP trained from scratch	PPL 19.88	close to autoregressive baseline (19.81)

Key Findings¶

PTP draft models predict multiple tokens per call, shifting the optimal model size toward larger models (including direct fine-tuning of the teacher).
Auxiliary variables introduce coordination among tokens, substantially reducing incompatible token pairs (e.g., "def numpy" drops to <1%).
C-PTP trained from scratch achieves perplexity comparable to autoregressive models, empirically validating the theoretical expressiveness.

Highlights & Insights¶

Strong theoretical contributions: Theorems 1 and 2 rigorously prove the feasibility of parallel sampling from a probabilistic perspective.
The inverse autoregressive idea from Normalizing Flows is transferred to discrete sequence generation, representing a cross-domain innovation.
The auxiliary variable mechanism naturally resolves the inconsistency problem inherent in independent multi-token prediction.
Partial Quadratic Decoding leverages confidence estimates to allocate computation, offering strong practical utility.

Limitations & Future Work¶

Practical speedup is bounded by model capacity — limited Transformer capacity constrains the number of tokens that can be accurately predicted in a single pass.
Distillation requires the teacher model to back-derive auxiliary variables, incurring non-trivial training cost.
The binary encoding of auxiliary variables may not be the optimal representation.
Effectiveness on larger-scale models (70B+) and longer contexts remains unvalidated.

Distinction from Medusa/EAGLE: PTP achieves inter-token coordination via auxiliary variables rather than independent multi-head prediction.
Connection to Normalizing Flows: PTP is essentially a discrete analogue of Inverse Autoregressive Flow.
Compatible with efficient training techniques such as GaLore and FlashAttention.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The auxiliary-variable parallel sampling framework constitutes an entirely novel theoretical contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task validation is provided, though large-scale model experiments are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Theorem proofs are rigorous and figures are clear.
Value: ⭐⭐⭐⭐⭐ Opens a new design space for parallel token generation.