CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit¶

Conference: ACL 2026 arXiv: 2510.06133 Code: N/A Area: Image Restoration Keywords: Diffusion Language Model, Parallel Decoding, Trace Credit, Inference Acceleration, Confidence Enhancement

TL;DR¶

CreditDecoding is a training-free parallel decoding acceleration method that accumulates token-level historical evidence (trace credit) to boost correct but low-confidence tokens, achieving up to 5.48x speedup with +0.48 accuracy gain on LLaDA-8B-Instruct.

Background & Motivation¶

Background: Diffusion large language models (dLLMs) generate text through iterative denoising, supporting bidirectional attention and parallel token prediction. Existing parallel decoding schemes confirm only high-confidence positions at each step, re-masking others for subsequent refinement.

Limitations of Prior Work: (1) Computational redundancy — models often predict correct tokens many steps before actual decoding, but repeated re-masking occurs due to insufficient confidence; (2) History-agnostic decisions — each decoding step is independent of previous predictions, failing to leverage token historical consistency signals.

Key Challenge: Correct tokens are repeatedly re-masked due to temporarily insufficient confidence, causing massive redundant computation; yet directly lowering the decoding threshold introduces erroneous decoding.

Goal: Design a mechanism that leverages historical prediction consistency to safely decode correct tokens early, reducing redundant iterations.

Key Insight: Analyzing denoising trajectories reveals temporal consistency in token confidence — correct tokens show persistently rising confidence across steps, providing exploitable prior information.

Core Idea: Trace credit = cross-step accumulated historical logits, fused as a prior with current logits so that correct but low-confidence tokens cross the decoding threshold earlier.

Method¶

Overall Architecture¶

CreditDecoding augments standard parallel decoding with a token-level credit scoring system: (1) record each position's predicted token and confidence at every denoising step; (2) accumulate credit scores across steps; (3) fuse credit as log-gains into current logits, boosting correct token confidence for earlier decoding.

Key Designs¶

Trace Credit:
- Function: Quantifies a token's trustworthiness based on persistent correct prediction across historical steps
- Mechanism: For each position \(i\) and candidate token \(v\), accumulates cross-step historical logits to obtain credit score \(C_t^{i,v}\). Credit reflects convergence likelihood toward high confidence, providing adaptive gain
- Design Motivation: Single-step confidence is unstable and initially low, but temporal consistency indicates correct tokens' confidence trends are predictable
Credit-Fused Decoding:
- Function: Fuses historical credit with current logits to accelerate decoding
- Mechanism: Adds \(\log X\) gain to the target token's logit: \(\hat{l}_t^{i,v} = l_t^{i,v} + \log X\), where \(X\) is adaptively determined by trace credit. The gain enables correct tokens' posterior probability to exceed decoding threshold \(\tau\) earlier
- Design Motivation: The minimum gain formula \(X_{\min} = \frac{\tau}{1-\tau} \cdot (\frac{1}{p_t^{i,v}} - 1)\) shows that direct use of instantaneous probability yields highly sensitive gains; historically accumulated credit provides more robust gains
Hyperparameter-Free Variant:
- Function: Provides an out-of-the-box acceleration solution
- Mechanism: Automatically determines gain parameters based on denoising progress and credit distribution, requiring no manual hyperparameter tuning
- Design Motivation: Lowers the usage barrier, enabling CreditDecoding as a universal acceleration plugin

Loss & Training¶

CreditDecoding is a fully training-free inference-time method that only modifies the decoding strategy. It is orthogonal to existing optimizations (KV cache, operator fusion) and can be stacked.

Key Experimental Results¶

Main Results¶

LLaDA-8B-Instruct performance across 8 benchmarks

Method	Speedup	Accuracy Change	Note
Standard Parallel Decoding	1×	Baseline	Threshold control
Fast-dLLM	~3×	Slight drop	Adaptive steps
CreditDecoding	5.48×	+0.48	Historical credit enhancement
CreditDecoding + KV Cache	Higher	+0.48	Orthogonal stacking

Ablation Study¶

Component	Effect	Note
No credit (pure threshold)	Baseline	Standard parallel decoding
Current-step credit only	Slight speedup	No accumulation effect
Full trace credit	Maximum speedup	Historical accumulation is key
Different dLLM architectures	All effective	Strong generalizability

Key Findings¶

CreditDecoding achieves speedup without harming accuracy across knowledge, reasoning, and code benchmarks
Speedup becomes more pronounced with more denoising steps — more steps mean more redundancy
Effective across LLaDA, Dream, and other dLLM architectures
Orthogonal to KV cache, operator fusion, and other optimizations; stackable for greater speedup
Extensible to long-context scenarios

Highlights & Insights¶

The "early prediction, late decoding" redundancy analysis reveals the core bottleneck of dLLM inference
Trace credit elegantly exploits token prediction temporal consistency — simple historical accumulation yields significant speedup
Training-free and orthogonal properties make it a practical plug-and-play tool

Limitations & Future Work¶

Credit accumulation may not gather sufficient signal in extremely short sequences or very few steps
The linear gain assumption in credit fusion may not be optimal for all scenarios
Validated only on discrete-token diffusion models; applicability to continuous diffusion models remains unexplored

vs Standard Threshold Decoding: Ignores historical information; CreditDecoding leverages temporal consistency
vs Fast-dLLM: Adjusts step scheduling; CreditDecoding optimizes at the token confidence level
vs KV Cache: KV cache reduces computational overhead; CreditDecoding reduces redundant steps; orthogonal

Rating¶

Novelty: ⭐⭐⭐⭐ — Trace credit concept is intuitive and effective with unique insights into dLLM inference
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four models, eight benchmarks, multiple ablations, orthogonality verification
Writing Quality: ⭐⭐⭐⭐ — Clear analysis, intuitive visualizations
Value: ⭐⭐⭐⭐⭐ — Provides a practical and general solution for dLLM inference acceleration