CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit¶
Conference: ACL 2026 arXiv: 2510.06133 Code: N/A Area: Image Restoration Keywords: Diffusion Language Model, Parallel Decoding, Trace Credit, Inference Acceleration, Confidence Enhancement
TL;DR¶
CreditDecoding is a training-free parallel decoding acceleration method that accumulates token-level historical evidence (trace credit) to boost correct but low-confidence tokens, achieving up to 5.48x speedup with +0.48 accuracy gain on LLaDA-8B-Instruct.
Background & Motivation¶
Background: Diffusion large language models (dLLMs) generate text through iterative denoising, supporting bidirectional attention and parallel token prediction. Existing parallel decoding schemes confirm only high-confidence positions at each step, re-masking others for subsequent refinement.
Limitations of Prior Work: (1) Computational redundancy — models often predict correct tokens many steps before actual decoding, but repeated re-masking occurs due to insufficient confidence; (2) History-agnostic decisions — each decoding step is independent of previous predictions, failing to leverage token historical consistency signals.
Key Challenge: Correct tokens are repeatedly re-masked due to temporarily insufficient confidence, causing massive redundant computation; yet directly lowering the decoding threshold introduces erroneous decoding.
Goal: Design a mechanism that leverages historical prediction consistency to safely decode correct tokens early, reducing redundant iterations.
Key Insight: Analyzing denoising trajectories reveals temporal consistency in token confidence — correct tokens show persistently rising confidence across steps, providing exploitable prior information.
Core Idea: Trace credit = cross-step accumulated historical logits, fused as a prior with current logits so that correct but low-confidence tokens cross the decoding threshold earlier.
Method¶
Overall Architecture¶
CreditDecoding augments standard parallel decoding with a token-level credit scoring system: (1) record each position's predicted token and confidence at every denoising step; (2) accumulate credit scores across steps; (3) fuse credit as log-gains into current logits, boosting correct token confidence for earlier decoding.
Key Designs¶
-
Trace Credit:
- Function: Quantifies a token's trustworthiness based on persistent correct prediction across historical steps
- Mechanism: For each position \(i\) and candidate token \(v\), accumulates cross-step historical logits to obtain credit score \(C_t^{i,v}\). Credit reflects convergence likelihood toward high confidence, providing adaptive gain
- Design Motivation: Single-step confidence is unstable and initially low, but temporal consistency indicates correct tokens' confidence trends are predictable
-
Credit-Fused Decoding:
- Function: Fuses historical credit with current logits to accelerate decoding
- Mechanism: Adds \(\log X\) gain to the target token's logit: \(\hat{l}_t^{i,v} = l_t^{i,v} + \log X\), where \(X\) is adaptively determined by trace credit. The gain enables correct tokens' posterior probability to exceed decoding threshold \(\tau\) earlier
- Design Motivation: The minimum gain formula \(X_{\min} = \frac{\tau}{1-\tau} \cdot (\frac{1}{p_t^{i,v}} - 1)\) shows that direct use of instantaneous probability yields highly sensitive gains; historically accumulated credit provides more robust gains
-
Hyperparameter-Free Variant:
- Function: Provides an out-of-the-box acceleration solution
- Mechanism: Automatically determines gain parameters based on denoising progress and credit distribution, requiring no manual hyperparameter tuning
- Design Motivation: Lowers the usage barrier, enabling CreditDecoding as a universal acceleration plugin
Loss & Training¶
CreditDecoding is a fully training-free inference-time method that only modifies the decoding strategy. It is orthogonal to existing optimizations (KV cache, operator fusion) and can be stacked.
Key Experimental Results¶
Main Results¶
LLaDA-8B-Instruct performance across 8 benchmarks
| Method | Speedup | Accuracy Change | Note |
|---|---|---|---|
| Standard Parallel Decoding | 1× | Baseline | Threshold control |
| Fast-dLLM | ~3× | Slight drop | Adaptive steps |
| CreditDecoding | 5.48× | +0.48 | Historical credit enhancement |
| CreditDecoding + KV Cache | Higher | +0.48 | Orthogonal stacking |
Ablation Study¶
| Component | Effect | Note |
|---|---|---|
| No credit (pure threshold) | Baseline | Standard parallel decoding |
| Current-step credit only | Slight speedup | No accumulation effect |
| Full trace credit | Maximum speedup | Historical accumulation is key |
| Different dLLM architectures | All effective | Strong generalizability |
Key Findings¶
- CreditDecoding achieves speedup without harming accuracy across knowledge, reasoning, and code benchmarks
- Speedup becomes more pronounced with more denoising steps — more steps mean more redundancy
- Effective across LLaDA, Dream, and other dLLM architectures
- Orthogonal to KV cache, operator fusion, and other optimizations; stackable for greater speedup
- Extensible to long-context scenarios
Highlights & Insights¶
- The "early prediction, late decoding" redundancy analysis reveals the core bottleneck of dLLM inference
- Trace credit elegantly exploits token prediction temporal consistency — simple historical accumulation yields significant speedup
- Training-free and orthogonal properties make it a practical plug-and-play tool
Limitations & Future Work¶
- Credit accumulation may not gather sufficient signal in extremely short sequences or very few steps
- The linear gain assumption in credit fusion may not be optimal for all scenarios
- Validated only on discrete-token diffusion models; applicability to continuous diffusion models remains unexplored
Related Work & Insights¶
- vs Standard Threshold Decoding: Ignores historical information; CreditDecoding leverages temporal consistency
- vs Fast-dLLM: Adjusts step scheduling; CreditDecoding optimizes at the token confidence level
- vs KV Cache: KV cache reduces computational overhead; CreditDecoding reduces redundant steps; orthogonal
Rating¶
- Novelty: ⭐⭐⭐⭐ — Trace credit concept is intuitive and effective with unique insights into dLLM inference
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four models, eight benchmarks, multiple ablations, orthogonality verification
- Writing Quality: ⭐⭐⭐⭐ — Clear analysis, intuitive visualizations
- Value: ⭐⭐⭐⭐⭐ — Provides a practical and general solution for dLLM inference acceleration