CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit¶
Conference: ACL 2026
arXiv: 2510.06133
Code: None
Area: Image Restoration
Keywords: Diffusion Language Models, Parallel Decoding, Trace Credit, Inference Acceleration, Confidence Enhancement
TL;DR¶
This paper proposes CreditDecoding, a training-free parallel decoding acceleration method that enhances correct but under-confident tokens by accumulating token-level historical evidence (Trace Credit), achieving up to a 5.48× speedup on LLaDA-8B-Instruct with an accuracy gain of 0.48.
Background & Motivation¶
Background: Diffusion Large Language Models (dLLMs) generate text through iterative denoising, supporting bidirectional attention and parallel token prediction. Existing parallel decoding schemes only confirm high-confidence positions at each step, re-masking others for subsequent refinement.
Limitations of Prior Work: (1) Computational redundancy—models often predict correct tokens many steps before actual decoding, but repeated re-masking and re-prediction occur due to insufficient confidence; (2) History-agnostic decision-making—each decoding step is independent of previous predictions, failing to utilize historical consistency signals of tokens, where temporary mispredictions can cause fluctuations in the confidence of stable tokens.
Key Challenge: Correct tokens are repeatedly re-masked due to temporary lack of confidence, causing significant redundant computation; however, directly lowering the decoding threshold introduces erroneous decoding.
Goal: Design a mechanism that utilizes historical prediction consistency to safely decode correct tokens early, reducing redundant iterations.
Key Insight: Analysis of denoising trajectories reveals that token confidence exhibits temporal consistency—the confidence of correct tokens continuously rises over multiple steps, providing exploitable prior information.
Core Idea: Trace Credit = cross-step accumulated historical logits, serves as a prior fused with current logits, enabling correct but low-confidence tokens to cross the decoding threshold earlier.
Method¶
Overall Architecture¶
CreditDecoding adds a token-level credit scoring system to standard parallel decoding: (1) Record predicted tokens and confidence at each denoising step; (2) Accumulate credit scores across steps; (3) Fuse credit into current logits as a log-gain to boost the confidence of correct tokens for earlier decoding.
Key Designs¶
-
Trace Credit:
- Function: Quantifies the reliability of a token being consistently predicted as correct across historical steps.
- Mechanism: Accumulates historical logits across steps for each position \(i\) and candidate token \(v\) to obtain a credit score \(C_t^{i,v}\). Credit reflects the likelihood of a candidate token converging to high confidence, providing an adaptive gain.
- Design Motivation: Single-step confidence is unstable and low in early stages, but temporal consistency suggests that the confidence trend of correct tokens is predictable.
-
Credit Fusion Decoding:
- Function: Fuses historical credit with current logits to accelerate decoding.
- Mechanism: Adds a gain in the form of \(\log X\) to the target token logit: \(\hat{l}_t^{i,v} = l_t^{i,v} + \log X\), where \(X\) is adaptively determined by Trace Credit. This gain allows the posterior probability of correct tokens to exceed the decoding threshold \(\tau\) earlier.
- Design Motivation: The minimum gain formula \(X_{\min} = \frac{\tau}{1-\tau} \cdot (\frac{1}{p_t^{i,v}} - 1)\) indicates that gains based solely on instantaneous probability are highly sensitive; historical accumulated credit provides a more robust gain.
-
Parameter-Free Variant:
- Function: Provides an out-of-the-box acceleration solution.
- Mechanism: Automatically determines gain parameters based on denoising progress and credit distribution, eliminating the need for manual hyperparameter tuning.
- Design Motivation: Reduces the barrier to entry, allowing CreditDecoding to serve as a general-purpose acceleration plugin.
Loss & Training¶
CreditDecoding is a completely training-free inference-time method that only modifies the decoding strategy. It is orthogonal to existing optimizations (such as KV cache, operator fusion) and can be used in combination.
Key Experimental Results¶
Main Results¶
Performance of LLaDA-8B-Instruct on 8 Benchmarks
| Method | Speedup | Accuracy Change | Description |
|---|---|---|---|
| Standard Parallel Decoding | 1× | Baseline | Threshold control |
| Fast-dLLM | ~3× | Slight drop | Adaptive steps |
| CreditDecoding | 5.48× | +0.48 | Historical credit enhancement |
| CreditDecoding + KV Cache | Higher | +0.48 | Orthogonal superposition |
Ablation Study¶
| Component | Effect | Description |
|---|---|---|
| No Credit (Pure Threshold) | Baseline | Standard parallel decoding |
| Current-step Credit Only | Slight speedup | No accumulation effect |
| Full Trace Credit | Max speedup | Historical accumulation is key |
| Different dLLM Architectures | Effective | Strong generalizability |
Key Findings¶
- CreditDecoding achieves acceleration across knowledge, reasoning, and code benchmarks without compromising accuracy.
- Acceleration becomes more significant as denoising steps increase—more steps lead to greater redundancy.
- The method is effective across different dLLM architectures such as LLaDA and Dream.
- It is orthogonal to optimizations like KV cache and operator fusion, allowing for greater cumulative speedups.
- Extensible to long-context scenarios.
Highlights & Insights¶
- The redundancy analysis of "early prediction, late decoding" reveals a core bottleneck in dLLM inference.
- Trace Credit is an elegant utilization of temporal consistency in token prediction—simple historical accumulation significantly accelerates the process.
- Training-free and orthogonal characteristics make it a practical, plug-and-play tool.
Limitations & Future Work¶
- Credit accumulation may not gather sufficient signals in extremely short sequences or very few denoising steps.
- The linear gain assumption for credit fusion may not be optimal for all scenarios.
- Only validated on discrete token diffusion models; applicability to continuous diffusion models remains unexplored.
Related Work & Insights¶
- vs Standard Threshold Decoding: Threshold decoding ignores historical information, while CreditDecoding leverages temporal consistency for acceleration.
- vs Fast-dLLM: Fast-dLLM adjusts step scheduling, whereas CreditDecoding optimizes at the token confidence level.
- vs KV Cache: KV cache optimizes computational overhead, while CreditDecoding reduces redundant steps; the two are orthogonal.
Rating¶
- Novelty: ⭐⭐⭐⭐ Trace Credit is an intuitive and effective concept with unique insights into dLLM inference.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Validated across four models, eight benchmarks, multiple ablations, and orthogonality tests.
- Writing Quality: ⭐⭐⭐⭐ Clear analysis and intuitive visualizations.
- Value: ⭐⭐⭐⭐⭐ Provides a practical and general solution for accelerating dLLM inference.