Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes¶
Conference: ACL 2025
arXiv: 2505.22165
Code: None
Area: Others
Keywords: Text diffusion models, Poisson diffusion process, non-simultaneous denoising, time predictor, continuous-discrete unification
TL;DR¶
This paper proposes NeoDiff, which unifies the theoretical framework of discrete and continuous text diffusion models. By introducing a dual-time framework consisting of "extrinsic time" (sentence-level diffusion progress) and "intrinsic time" (token-level diffusion progress), NeoDiff utilizes a Poisson process to independently allocate fine-grained noise levels to each token and adaptively adjusts denoising progress with a context-aware time predictor. NeoDiff outperforms existing diffusion baselines across multiple tasks, including machine translation, paraphrasing, and text simplification.
Background & Motivation¶
Background: Text diffusion models are categorized into two major paradigms: - Discrete Diffusion (e.g., D3PM, Absorbing Diffusion): Perform state transitions independently for each token on categorical distributions. Different tokens can have different diffusion progress, but the noise control is coarse-grained (tokens are either preserved or completely corrupted). - Continuous Diffusion (e.g., DiffuSeq, Difformer): Map tokens into continuous spaces and inject Gaussian noise. The noise control is fine-grained, but all tokens share a uniform noise level.
Limitations of Prior Work: - The coarseness of discrete diffusion limits the benefits of multi-step generation—tokens only have "uncorrupted" and "corrupted" states, lacking fine-grained intermediate transitions. - The uniform noise in continuous diffusion forces all tokens to stay at the same noise level, which prevents low-noise tokens from acting as context to help restore high-noise tokens. - Existing improvements (such as mask blending in DiffuSeq-V2 or monotonic noise in AR-Diffusion) have not fully achieved token-level, fine-grained noise control.
Key Challenge: There is a need for a unified framework that enjoys both the fine-grained noise control of continuous diffusion and the varying token-level diffusion progress of discrete diffusion.
Goal - Achieving token-level non-simultaneous diffusion in the continuous space. - Adaptively adjusting the denoising rate of each token using contextual information during the reverse denoising process. - Optimizing the inference time schedule to improve generation quality.
Key Insight: Generalize the diffusion time variable into two dimensions: extrinsic time \(t\) (global progress) and intrinsic time \(\tau\) (token-level progress). By randomizing the intrinsic time via a Poisson process, token-level non-simultaneous diffusion is naturally achieved.
Core Idea: Unify discrete and continuous diffusion using a dual-time framework (extrinsic + intrinsic time). The forward process assigns token-level noise using a Poisson process, while the reverse process adaptively regulates denoising via a time predictor based on semantic contexts.
Method¶
Overall Architecture¶
Input text \(\to\) Word embedding mapping to continuous space \(\mathbf{z}_0\) \(\to\) Poisson Forward Process: Each token independently samples an intrinsic time \(\tau_t\) and is corrupted with Gaussian noise of corresponding intensity based on \(\tau_t\) \(\to\) Reverse Denoising: Encoder-Decoder Transformer predicts \(\hat{\mathbf{z}}_0\) + a time predictor predicts \(\tau_{t'}\) of each token for the next step \(\to\) Rounding back to discrete tokens \(\to\) Output text. The entire process conducts inference using an optimized extrinsic time schedule.
Key Designs¶
-
Dual-Time Unified Framework:
- Function: Integrate existing discrete and continuous diffusion models into a unified theory.
- Mechanism:
- Extrinsic time \(t \in [0,1]\): Global diffusion progress for the entire sentence.
- Intrinsic time \(\tau_t \in [0,1]\): Independent diffusion progress for each token.
- Discrete diffusion = \(\tau_t \in \{0, 1\}\) (a binary random function).
- Continuous diffusion = \(\tau_t = t\) (a deterministic function, synchronized across all tokens).
- DiffuSeq-V2 = \(\tau_t = \max(t + \tau_{\text{mask}}(t), 1)\) (blending).
- NeoDiff = \(\tau_t \in [0,1]\), modeled as a continuous random function of \(t\).
- Design Motivation: By generalizing the time variables, a unified theory covering all existing methods is established, with NeoDiff being the most general instantiation.
-
Poisson Forward Diffusion Process:
- Function: Independently sample fine-grained diffusion progress for each token.
- Mechanism:
- Introduce a discrete state function \(s_t \in \{0, 1, ..., s_{\max}\}\), corresponding to the range from clean to maximum noise.
- The evolution of \(s_t\) follows a Poisson process: \(s_t \sim \text{Poisson}(\lambda(t))\), where \(\lambda(t) = ks_{\max}t\).
- Normalize and clip: \(\tau_t = \text{Clip}(s_t / s_{\max}, 1)\).
- Key variance control issue: When \(s_{\max}\) is very large, \(\text{CV} = 1/\sqrt{s_{\max}} \to 0\), causing \(\tau\) values of all tokens to converge and degenerate into continuous diffusion.
- Solution: Variance rescaling transformation: $\(\tau_t = \text{Clip}\left(\text{Round}\left(\frac{s_t - \lambda(t)}{\sqrt{\lambda(t)}} \sigma(t) + \lambda(t)\right), s_{\max}\right) / s_{\max}\)$ Setting \(\sigma(t) = \lambda(t)\) guarantees that discrete characteristics are independent of \(s_{\max}\).
- Design Motivation: The Poisson process is naturally suited for modeling "jump-like" state transitions. It has a single parameter (\(\lambda\)) and an analytical distribution, serving as a natural bridge between discrete jumps and continuous gradients.
-
Context-Aware Time Predictor:
- Function: Adaptively adjust the denoising rate of each token based on semantic context during the reverse process.
- Mechanism:
- Instead of rigidly mirroring the forward process (\(p_\theta(\tau_{t'}|\mathbf{z}_t, \tau_t) = q(\tau_{t'})\)), explicitly model \(p_\theta(\tau_{t'}|\mathbf{z}_t, \tau_t)\).
- Input: predicted generation \(\hat{\mathbf{z}}_0\) (instead of \(\mathbf{z}_t\)!), target time \(t'\), conditional sentence embedding \(\mathbf{x}\).
- Train with cross-entropy, framing the prediction of \(\tau\) as a discrete classification problem.
- Pseudo-label strategy: Instead of directly using \(\tau_{t'}\) as labels (which introduces bias), pseudo-labels are obtained by mapping the reconstruction loss ranking of \(\hat{\mathbf{z}}_0\) through the inverse Poisson CDF. Tokens with larger loss are assigned higher \(\tau\) (more noise requires slower denoising).
- Design Motivation: Dynamically link the token-level denoising speed to generation quality—well-generated tokens complete denoising first, while difficult tokens retain more noise and recover gradually.
-
Bayes-Optimized Extrinsic Time Schedule:
- Function: Optimize the global timestep sequence during inference.
- Mechanism: Treat \(\{t_1, t_2, ..., t_K\}\) as continuous variables and search for the optimal schedule on the validation set using Bayesian optimization.
- Design Motivation: Different tasks require different time allocation strategies.
Loss & Training¶
- Total loss:
$\(\mathcal{L} = \mathcal{L}_z + \mathcal{L}_\tau + \mathcal{L}_{\text{anchor}}\)$
- \(\mathcal{L}_z = \|\hat{\mathbf{z}}_0 - \mathbf{z}_0\|^2\) (prediction loss)
- \(\mathcal{L}_\tau = \text{KL}(q(\tau_{t'}) \| p_\theta(\tau_{t'}|\mathbf{z}_t, \tau_t))\) (time prediction loss)
- \(\mathcal{L}_{\text{anchor}} = -\log p_\theta(y|\hat{\mathbf{z}}_0)\) (anchor loss, preventing embedding space collapse)
Key Experimental Results¶
Main Results (Machine Translation BLEU)¶
| Model | Type | Beam | IWSLT14 | WMT14 | WMT16 |
|---|---|---|---|---|---|
| Absorbing | Discrete | 5 | 28.32 | 21.62 | 30.41 |
| SeqDiffuSeq | Continuous | 10 | 30.03 | 24.24 | 26.17 |
| Difformer | Continuous | 10 | 32.09 | 23.80 | 30.93 |
| DiNoiSer | Continuous | 50 | 31.61 | 24.26 | 31.08 |
| NeoDiff | Hybrid | 10 | 33.14 | 25.28 | 32.31 |
NeoDiff outperforms all baselines across all translation tasks, exceeding the maximum beam size of baselines while using a smaller beam size.
Other Tasks (BLEU)¶
| Model | QQP (Paraphrase) | Quasar-T (Question Gen) | Wiki-Auto (Text Simplification) |
|---|---|---|---|
| Difformer (b=10) | 30.43 | 16.66 | 40.77 |
| NeoDiff (b=10) | 30.87 | 18.35 | 41.33 |
Ablation Study¶
| Configuration | IWSLT BLEU | WMT14 BLEU |
|---|---|---|
| NeoDiff (Full) | 33.14 | 25.28 |
| w/o Poisson (Unified \(\tau=t\)) | 32.09 (-1.05) | 24.24 (-1.04) |
| w/o Time Predictor | 32.22 (-0.92) | 24.41 (-0.87) |
| w/o Bayesian Scheduler | 32.31 (-0.83) | 24.74 (-0.54) |
| w/o Variance Rescaling | 32.58 (-0.56) | — |
Key Findings¶
- Poisson diffusion is the most critical component: removing it degrades the model to standard continuous diffusion (Difformer), causing a ~1.0 BLEU drop.
- The contribution of the time predictor is second only to the Poisson process: adaptive denoising rates are more effective than rigidly mirroring the forward process.
- The necessity of variance rescaling: it prevents degeneration into continuous diffusion when \(s_{\max} \to \infty\).
- Bayes-optimized time scheduling consistently yields additional gains, showing that the optimal time allocation is task-specific.
- LLM evaluation (DeepSeek-V3) also confirms NeoDiff's comprehensive leading performance in accuracy, fluency, and creativity.
Highlights & Insights¶
- Theoretical elegance of the dual-time framework: By generalizing the temporal variables, it unifies discrete, continuous, and hybrid diffusion models, providing a consolidated theoretical foundation for future research. This "parameterized unification" approach can be transferred to other domains requiring the reconciliation of distinct core methodologies.
- Ingenious design of Poisson + variance rescaling: The Poisson process naturally captures "jumps" (discreteness), while variance rescaling decouples the resolution from \(s_{\max}\), resolving the tension between fine-grained control and diversity.
- The pseudo-label strategy is a key technical innovation: by mapping generation quality (reconstruction loss ranking) \(\to\) inverse Poisson CDF \(\to\) temporal labels, it translates "which tokens are well-generated" into "which tokens should finish denoising first", achieving generation quality-guided adaptive denoising.
Limitations & Future Work¶
- The experiments are small-scale (at the Transformer-base level) and not validated on large models (such as LLaMA-based text diffusion).
- The generation quality still lags behind autoregressive models (Transformer with \(b=5\) achieves 30.83 on QQP vs. NeoDiff with \(b=10\) achieving 30.87).
- Bayes-optimized time scheduling requires extra evaluation overhead on the validation set.
- The time predictor increases model complexity and training costs.
- The synergy with more advanced continuous diffusion techniques (such as flow matching) remains unexplored.
Related Work & Insights¶
- vs. Difformer: NeoDiff introduces token-level non-simultaneous diffusion on top of Difformer. It represents a fundamental improvement rather than an incremental optimization, boosting BLEU by 1-2 points.
- vs. DiffuSeq-V2: DiffuSeq-V2 blends discrete and continuous noise using [MASK] tokens, but falls short of true fine-grained token-level control. NeoDiff's Poisson process is more natural and theoretically sound.
- vs. AR-Diffusion: AR-Diffusion achieves semi-autoregressive generation via monotonically increasing noise but artificially restricts noise pattern design. NeoDiff's Poisson allocation is much more flexible.
- vs. D3PM (Discrete Diffusion): Discrete diffusion lacks fine-grained transitions. NeoDiff preserves the discrete behavior of state jumps within a continuous space.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The dual-time framework is a highly original theoretical contribution that unifies the formalization of discrete and continuous diffusion.
- Experimental Thoroughness: ⭐⭐⭐⭐ Tested on 6 tasks with multiple baselines and detailed ablations, though model scale is relatively small.
- Writing Quality: ⭐⭐⭐⭐ Mathematically rigorous with clear framework diagrams, though high notation density may affect readability.
- Value: ⭐⭐⭐⭐ Provides a unified theoretical framework and practical directions for improvement in the text diffusion domain.