EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models¶
Conference: ACL 2025
arXiv: 2502.19765
Code: -
Area: NLP / Text Editing / Diffusion Models
Keywords: diffusion language model, text editing, SDEdit, self-conditioning, controllable generation
TL;DR¶
Proposes EdiText, a controllable text editing method based on latent diffusion models. It combines SDEdit for coarse-grained editing and self-conditioning for fine-grained editing, achieving multi-scale text editing control ranging from minor modifications to extensive rewriting.
Background & Motivation¶
Background: Text editing aims to modify a given reference text to satisfy target attributes. Existing methods include approaches based on autoregressive and non-autoregressive models. While diffusion models have demonstrated powerful multi-scale control capabilities in the image editing domain, their application to text editing remains under-explored.
Limitations of Prior Work: (1) Energy-based model approaches (e.g., Mireshghallah et al. 2022) only offer fine-tuning level control, with a limited range; (2) ParaGuide (Horvitz et al. 2024) uses classifier guidance to adjust editing intensity, but the control range remains narrow; (3) Autoregressive models (e.g., Qwen2.5) show limited response to instruction control regarding the editing intensity, and modifying prompts fails to significantly alter the editing scale.
Core Problem: How to achieve both coarse-grained (large-scale adjustment) and fine-grained (precise tuning) editing control in text editing?
Method¶
Overall Architecture¶
EdiText uses LD4LG (Latent Diffusion for Language Generation) as the backbone model. This model compresses discrete text into a fixed-length continuous latent representation via a Perceiver Resampler encoder, reconstructs the text using an autoregressive decoder, and trains a conditional diffusion model to model the distribution of latent representations. Two complementary editing techniques are integrated on top of this model.
Key Designs¶
-
SDEdit Coarse-Grained Editing (EdiText-CE): The reference text is encoded into a latent representation \(x_0\), which is diffused to timestep \(t_{CE}\) during the forward process, followed by reverse denoising using the trained conditional diffusion model conditioned on target attributes. The parameter \(t_{CE}\) controls the editing intensity: when \(t_{CE}\) is close to \(T\), more noise is added, less original text is preserved, and the edit is large; when \(t_{CE}\) is close up to 0, less noise is added, more is preserved, and the edit is small.
-
Self-Conditioning Fine-Grained Editing (EdiText-FE): The self-conditioning mechanism is re-purposed: instead of using the model's own previous prediction during sampling, the latent representation of the reference text is injected as a condition. The reference text representation serves as the condition from \(t=T\) down to \(t_{FE}\), below which regular self-conditioning resumes. Smaller \(t_{FE}\) values prolong the influence of the reference text, leading to smaller edits.
-
Integrated Coarse-to-Fine Editing: SDEdit provides large-scale but coarse control, while self-conditioning provides small-scale but precise control. When stacked together, SDEdit first establishes the overall editing scope, and self-conditioning then performs fine tuning within that range, achieving a complete multi-scale coverage.
Loss & Training¶
- LD4LG Training Loss: \(L(\theta) = \mathbb{E}_{t,x_0,\epsilon_t}[\lambda_t^{-1} \|x_\theta(x_t, t) - x_0\|_2^2]\), where \(\lambda_t = 1 - \alpha_t\)
- Additional training in self-conditioning mode: Alternates training between unconditional and conditional (previous step prediction) modes with a probability of \(p=0.5\)
Experiments¶
Main Results (Detoxifying)¶
| Method | Hamming ↓ | SacreBLEU ↑ | BERTScore ↑ | Moderation ↓ | PerspectiveAI ↓ |
|---|---|---|---|---|---|
| ParaGuide (λ=200) | 25.3 | 14.9 | 0.903 | 0.446 | 0.321 |
| ParaGuide (λ=10K) | 27.2 | 11.0 | 0.889 | 0.335 | 0.229 |
| Qwen2.5-0.5B | 27.2 | 31.1 | 0.903 | 0.347 | 0.312 |
| EdiText-CE (t=175) | 17.4 | 34.7 | 0.923 | 0.576 | 0.450 |
| EdiText-CE (t=200) | 28.9 | 7.6 | 0.865 | 0.105 | 0.136 |
| EdiText-FE (t=25) | 24.7 | 14.9 | 0.881 | 0.117 | 0.121 |
Sentiment Control Task (Neg → Pos)¶
| Method | Hamming ↓ | BERTScore ↑ | Accuracy ↑ |
|---|---|---|---|
| ParaGuide (λ=10K) | 18.0 | 0.857 | 0.89 |
| Qwen2.5-0.5B | 23.9 | 0.881 | 0.60 |
| EdiText-CE (t=200) | 15.1 | 0.879 | 0.77 |
| EdiText-CE (t=225) | 19.5 | 0.846 | 0.90 |
| EdiText-FE (t=25) | 10.7 | 0.916 | 0.60 |
Key Findings¶
- Control Range: EdiText-CE covers the full range from near zero edit to completely rewritten texts by adjusting \(t_{CE}\), whereas ParaGuide and Qwen2.5 exhibit extremely limited control ranges.
- Editing Quality: Under the same preservation rates, the target attribute satisfaction rate of EdiText outperforms or matches the baselines.
- Fine-Grained Control: EdiText-FE provides smoother editing gradients, mitigating the abrupt changes observed in EdiText-CE.
- Integration Advantage: Combining coarse and fine controls enables continuous and seamless multi-scale editing coverage.
Highlights & Insights¶
- Innovatively adopts SDEdit (a method from the image domain) to text editing to achieve coarse-grained control.
- Cleverly re-interprets self-conditioning, reframing it from "enhancing generation quality" to "anchoring the reference text representation."
- The dual-layered coarse/fine control mechanism is highly complementary, covering the entire range of edits.
- The approach is clean and elegant, bypassing the need for additional classifiers (unlike ParaGuide).
Limitations & Future Work¶
- Since it is based on latent diffusion models, the generated text quality is still inferior to modern large-scale autoregressive models.
- LD4LG compresses text into a fixed-length latent representation, which might lose details in long texts.
- Evaluation is only conducted on toxicity detoxification and sentiment control tasks; generalizability remains to be verified.
- Optimal editing parameters (\(t_{CE}\), \(t_{FE}\)) need empirical tuning for specific tasks.
- Inference speed of diffusion models is slower compared to instruction-guided editing with autoregressive LLMs.
Related Work & Insights¶
- Diffusion Language Models: LD4LG (Lovelace et al. 2023) latent diffusion; MDLM (Sahoo et al. 2024) discrete diffusion.
- Text Editing: ParaGuide (Horvitz et al. 2024) classifier-guided; Mireshghallah et al. 2022 EBM-based.
- Image-to-Text Editing Transference: SDEdit (Meng et al. 2022) noise-denoise editing framework.
- Controllable Text Generation: Li et al. 2022 diffusion-based constrained generation; self-conditioning (Chen et al. 2023) for improved sampling quality.
Rating¶
| Dimension | Score (1-10) |
|---|---|
| Novelty | 8 |
| Technical Depth | 7 |
| Experimental Thoroughness | 7 |
| Writing Quality | 7 |
| Value | 6 |
| Overall Score | 7.0 |