Planning with Diffusion Models for Target-Oriented Dialogue Systems¶
Conference: ACL 2025
arXiv: 2504.16858
Code: https://github.com/ninglab/DiffTOD
Area: Image Generation
Keywords: Dialogue Planning, Diffusion Models, Target-Oriented Dialogue, Non-Sequential Planning, Trajectory Generation
TL;DR¶
DiffTOD models dialogue planning as a trajectory generation problem, leveraging a masked diffusion language model to achieve non-sequential dialogue planning. It designs three guidance mechanisms (word-level/semantic-level/search-based) to flexibly control the dialogue toward the target, significantly outperforming baselines in negotiation, recommendation, and chitchat scenarios.
Background & Motivation¶
Background: Target-Oriented Dialogue (TOD) systems need to strategically guide dialogues toward specific goals (e.g., reaching a deal, recommending items).
Limitations of Prior Work: - Existing dialogue planning methods (LLM prompt/RL policy) perform step-by-step sequential generation and can only plan the next step based on past actions. - Sequential planning leads to compounding errors and short-sighted decisions, failing to look ahead at the future trajectory of the dialogue. - LLMs are trained to passively follow instructions and lack the capability to proactively guide dialogues.
Key Challenge: Sequential planning cannot globally optimize dialogue strategies and easily falls into local optima (e.g., insisting on not lowering the price during negotiations, leading to a breakdown).
Goal: Design a non-sequential dialogue planning method that optimizes action strategies by simultaneously considering both the past and the future.
Key Insight: The mathematical connection between dialogue trajectory generation and the denoising process of diffusion models—both are progressive infilling from incomplete to complete.
Core Idea: Generate the entire dialogue trajectory simultaneously (instead of step-by-step) using a diffusion model, combined with conditional guidance to ensure the trajectory achieves the target.
Method¶
Overall Architecture¶
Model TOD as a conversational MDP \(\rightarrow\) translate dialogue planning into a trajectory generation problem \(\rightarrow\) estimate trajectory likelihood using a diffusion language model \(\rightarrow\) optimize action strategies via conditional guidance \(\rightarrow\) pass the generated plan to the LLM to execute dialogue. The planning and generation stages are decoupled.
Key Designs¶
-
Trajectory Modeling with Diffusion Models (Trajectory Modeling):
- Function: Deconstructs the trajectory likelihood of dialogue planning into a form equivalent to the diffusion denoising process.
- Mechanism: Trajectory generation \(p_\theta(\tau_{0:T}) = p(\tau^N) \prod_{n=1}^N p_\theta(\tau^{n-1}|\tau^n)\) is mathematically equivalent to diffusion denoising. A masked diffusion language model (MDLM) is fine-tuned on dialogue history to progressively recover the complete trajectory from an incomplete one.
- Design Motivation: The non-sequential generation capability of diffusion models is naturally suited for simultaneously considering the past and future, avoiding the short-sighted issues of sequential planning.
-
Three Guidance Mechanisms (Guidance Mechanisms):
- Word-Level Guidance: Fixes target keywords at specific positions on the trajectory, and the diffusion model completes the dialogue around the keywords. Suitable for keyword-guided dialogues.
- Semantic-Level Guidance: Uses semantic descriptions (e.g., "the system recommended the target item") as conditions, combined with MBR decoding to select the best one from multiple paraphrased versions. Suitable for semantic-level goals.
- Search-Based Guidance: Constructs a conversational search tree, generates different actions at tree nodes using word-level/semantic-level guidance, and uses a search algorithm (MCTS) to select the path that maximizes cumulative reward. Suitable for complex negotiations requiring long-term strategies.
- The three types of guidance can be used individually or in combination, allowing flexible switching of targets during testing without retraining.
-
Conditional Trajectory Generation (Conditional Trajectory Generation):
- Factorizes the likelihood into \(p_\theta(\tau|\mathcal{O}=1) \propto p_\theta(\tau) \cdot p_\theta(\mathcal{O}=1|\tau)\), where the unconditional part is generated by the diffusion model and the conditional part is achieved through the guidance mechanism.
- Implemented as trajectory inpainting: fixes known target states/actions and lets the diffusion model fill in the rest.
Loss & Training¶
Fine-tunes the masked diffusion language model (MDLM) on dialogue history data. The three guidance mechanisms are applied during inference, requiring no retraining.
Key Experimental Results¶
Main Results¶
| Setting | Metric | DiffTOD | Best Baseline | Gain |
|---|---|---|---|---|
| CraigslistBargain (Buyer) | SR | 0.901 | 0.798 (Claude-3.5) | +12.9% |
| CraigslistBargain (Seller) | SR | 0.793 | 0.689 (ProCoT) | +15.1% |
| TopDial (Recommendation) | SR | 0.663 | 0.620 (GPT-4o) | +6.9% |
| PersonaChat (Keyword) | KCR | 0.767 | 0.706 (GPT-4o) | +8.6% |
Ablation Study¶
| Configuration | Description |
|---|---|
| w/o Non-sequential Planning | SR drops significantly, proving looking-ahead capability is crucial |
| w/o Search-based | Negotiation performance drops to the level of ProCoT |
| w/o Semantic-level MBR | Recommendation success rate drops |
Key Findings¶
- The advantage is most pronounced in negotiation scenarios requiring long-term strategies—the performance gap continues to widen in the later stages of multi-turn negotiations.
- Search-based guidance shows significant effectiveness on complex goals, while word-level guidance suffices for simple goals (e.g., single keywords).
- DiffTOD can serve both buyer and seller roles using the same model simply by switching guidance strategies, demonstrating its flexibility.
- Non-sequential planning is especially effective in sparse reward scenarios (where feedback is only available at the end).
Highlights & Insights¶
- The mathematical connection between diffusion models and dialogue planning is elegantly derived—showing an equivalent transformation from MDP trajectory likelihood to the diffusion denoising process.
- The three-tiered guidance mechanism is cleverly designed (word-level \(\rightarrow\) semantic-level \(\rightarrow\) search-based), progressively increasing complexity and strategic depth.
- The decoupled "planning and generation" architecture allows the planner's output to guide any LLM for dialogue execution.
- Flexible guidance (switching targets at test time without retraining) resolves the limitation of RL-based methods that require retraining for each goal.
Limitations & Future Work¶
- The generation quality and speed of diffusion language models are not as good as autoregressive LLMs.
- Search-based guidance incurs high computational overhead (requiring the construction of search trees).
- States and actions are represented in natural language, so the trajectory length is limited by the model's context window.
- The user simulator (GPT-4o) might not fully reflect real user behavior.
Related Work & Insights¶
- vs PPDPP (RL-based): PPDPP optimizes sequentially using policy gradients, whereas DiffTOD generates non-sequentially using diffusion, with the latter showing a clear advantage in long-term planning.
- vs ProCoT/EnPL (Prompt-based): Prompt-based methods rely heavily on the LLM's inherent planning capabilities, while DiffTOD uses a dedicated diffusion model for more controllable planning.
- vs Diffuser (Janner et al.): Serving as the inspiration, Diffuser is designed for continuous control, while DiffTOD adapts this to discrete dialogue scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The mathematical modeling of using diffusion models for dialogue planning is highly elegant, and the three-tiered guidance design is exquisite.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets + multiple baselines + ablations + turn-by-turn analysis + human evaluation.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear mathematical derivations and intuitive diagrams.
- Value: ⭐⭐⭐⭐ Provides a brand-new non-sequential planning paradigm for target-oriented dialogue.