Skip to content

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

Conference: ACL 2026 arXiv: 2510.06133 Code: N/A Area: Image Restoration Keywords: Diffusion Language Model, Parallel Decoding, Trace Credit, Inference Acceleration, Confidence Enhancement

TL;DR

CreditDecoding is a training-free parallel decoding acceleration method that accumulates token-level historical evidence (trace credit) to boost correct but low-confidence tokens, achieving up to 5.48x speedup with +0.48 accuracy gain on LLaDA-8B-Instruct.

Background & Motivation

Background: Diffusion large language models (dLLMs) generate text through iterative denoising, supporting bidirectional attention and parallel token prediction. Existing parallel decoding schemes confirm only high-confidence positions at each step, re-masking others for subsequent refinement.

Limitations of Prior Work: (1) Computational redundancy — models often predict correct tokens many steps before actual decoding, but repeated re-masking occurs due to insufficient confidence; (2) History-agnostic decisions — each decoding step is independent of previous predictions, failing to leverage token historical consistency signals.

Key Challenge: Correct tokens are repeatedly re-masked due to temporarily insufficient confidence, causing massive redundant computation; yet directly lowering the decoding threshold introduces erroneous decoding.

Goal: Design a mechanism that leverages historical prediction consistency to safely decode correct tokens early, reducing redundant iterations.

Key Insight: Analyzing denoising trajectories reveals temporal consistency in token confidence — correct tokens show persistently rising confidence across steps, providing exploitable prior information.

Core Idea: Trace credit = cross-step accumulated historical logits, fused as a prior with current logits so that correct but low-confidence tokens cross the decoding threshold earlier.

Method

Overall Architecture

CreditDecoding augments standard parallel decoding with a token-level credit scoring system: (1) record each position's predicted token and confidence at every denoising step; (2) accumulate credit scores across steps; (3) fuse credit as log-gains into current logits, boosting correct token confidence for earlier decoding.

Key Designs

  1. Trace Credit:

    • Function: Quantifies a token's trustworthiness based on persistent correct prediction across historical steps
    • Mechanism: For each position \(i\) and candidate token \(v\), accumulates cross-step historical logits to obtain credit score \(C_t^{i,v}\). Credit reflects convergence likelihood toward high confidence, providing adaptive gain
    • Design Motivation: Single-step confidence is unstable and initially low, but temporal consistency indicates correct tokens' confidence trends are predictable
  2. Credit-Fused Decoding:

    • Function: Fuses historical credit with current logits to accelerate decoding
    • Mechanism: Adds \(\log X\) gain to the target token's logit: \(\hat{l}_t^{i,v} = l_t^{i,v} + \log X\), where \(X\) is adaptively determined by trace credit. The gain enables correct tokens' posterior probability to exceed decoding threshold \(\tau\) earlier
    • Design Motivation: The minimum gain formula \(X_{\min} = \frac{\tau}{1-\tau} \cdot (\frac{1}{p_t^{i,v}} - 1)\) shows that direct use of instantaneous probability yields highly sensitive gains; historically accumulated credit provides more robust gains
  3. Hyperparameter-Free Variant:

    • Function: Provides an out-of-the-box acceleration solution
    • Mechanism: Automatically determines gain parameters based on denoising progress and credit distribution, requiring no manual hyperparameter tuning
    • Design Motivation: Lowers the usage barrier, enabling CreditDecoding as a universal acceleration plugin

Loss & Training

CreditDecoding is a fully training-free inference-time method that only modifies the decoding strategy. It is orthogonal to existing optimizations (KV cache, operator fusion) and can be stacked.

Key Experimental Results

Main Results

LLaDA-8B-Instruct performance across 8 benchmarks

Method Speedup Accuracy Change Note
Standard Parallel Decoding Baseline Threshold control
Fast-dLLM ~3× Slight drop Adaptive steps
CreditDecoding 5.48× +0.48 Historical credit enhancement
CreditDecoding + KV Cache Higher +0.48 Orthogonal stacking

Ablation Study

Component Effect Note
No credit (pure threshold) Baseline Standard parallel decoding
Current-step credit only Slight speedup No accumulation effect
Full trace credit Maximum speedup Historical accumulation is key
Different dLLM architectures All effective Strong generalizability

Key Findings

  • CreditDecoding achieves speedup without harming accuracy across knowledge, reasoning, and code benchmarks
  • Speedup becomes more pronounced with more denoising steps — more steps mean more redundancy
  • Effective across LLaDA, Dream, and other dLLM architectures
  • Orthogonal to KV cache, operator fusion, and other optimizations; stackable for greater speedup
  • Extensible to long-context scenarios

Highlights & Insights

  • The "early prediction, late decoding" redundancy analysis reveals the core bottleneck of dLLM inference
  • Trace credit elegantly exploits token prediction temporal consistency — simple historical accumulation yields significant speedup
  • Training-free and orthogonal properties make it a practical plug-and-play tool

Limitations & Future Work

  • Credit accumulation may not gather sufficient signal in extremely short sequences or very few steps
  • The linear gain assumption in credit fusion may not be optimal for all scenarios
  • Validated only on discrete-token diffusion models; applicability to continuous diffusion models remains unexplored
  • vs Standard Threshold Decoding: Ignores historical information; CreditDecoding leverages temporal consistency
  • vs Fast-dLLM: Adjusts step scheduling; CreditDecoding optimizes at the token confidence level
  • vs KV Cache: KV cache reduces computational overhead; CreditDecoding reduces redundant steps; orthogonal

Rating

  • Novelty: ⭐⭐⭐⭐ — Trace credit concept is intuitive and effective with unique insights into dLLM inference
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Four models, eight benchmarks, multiple ablations, orthogonality verification
  • Writing Quality: ⭐⭐⭐⭐ — Clear analysis, intuitive visualizations
  • Value: ⭐⭐⭐⭐⭐ — Provides a practical and general solution for dLLM inference acceleration