DEF-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation¶
Conference: ACL 2025
arXiv: 2505.21033
Code: https://github.com/ElPlaguister/Def-DTS
Area: Image Segmentation
Keywords: Dialogue Topic Segmentation, Deductive Reasoning, Intent Classification, LLM Prompting, Unsupervised
TL;DR¶
Proposed DEF-DTS, a dialogue topic segmentation method based on LLM multi-step deductive reasoning. Through a three-step pipeline of bidirectional context summarization \(\rightarrow\) utterance intent classification (5 classes) \(\rightarrow\) deductive topic shift judgment, it achieves unsupervised/prompt-based SOTA on three datasets: TIAGE, SuperDialseg, and Dialseg711, outperforming supervised methods on Dialseg711.
Background & Motivation¶
Background: Dialogue Topic Segmentation (DTS) aims to identify topic boundaries in dialogues. Supervised methods (fine-tuned BERT/RoBERTa) require substantial annotation and have poor cross-domain generalization; prompt-based methods (directly asking the LLM to judge whether the topic has changed) yield unstable performance.
Limitations of Prior Work: (1) Supervised models rely on domain-specific data, making them costly and poorly generalizable; (2) Existing LLM prompt methods make overly coarse judgments on topic changes—merely checking whether contexts are "different" without reasoning; (3) Quiet topic transitions (e.g., transition from QA to a new topic) and explicit switching require different identification strategies.
Key Challenge: Determining whether a topic changes requires complex reasoning capabilities—understanding what was previously said, identifying the intent of the current utterance, and recognizing whether the intent implies a topic shift. Relying on a single prompt for the LLM to make this judgment is overly challenging.
Goal: How to design a multi-step reasoning pipeline to enable LLMs to systematically determine topic boundaries in dialogues?
Key Insight: Decompose topic segmentation into three subtasks—context understanding, intent classification, and deductive reasoning—to lower the task difficulty at each step.
Core Idea: The essence of topic shift is the change in utterance intent (shifting from "developing the topic" to "introducing a new topic" or "changing the topic"). Hence, topic boundaries are indirectly determined through intent classification.
Method¶
Overall Architecture¶
A three-step pipeline processes each utterance: (1) Bidirectional context summarization (prior 2 sentences + subsequent 3 sentences) \(\rightarrow\) (2) 5-class intent classification \(\rightarrow\) (3) Deductive reasoning to determine topic shifts. The overall system utilizes a structured XML prompt format.
Key Designs¶
-
Bidirectional Context Summarization:
- Function: Generate concise summaries of the preceding and succeeding context for the current utterance.
- Mechanism: Extract 2 preceding sentences and 3 succeeding sentences, then generate summaries using the LLM for each. An asymmetric window (2 before, 3 after) is used because topic shifts are typically more pronounced in subsequent utterances.
- Design Motivation: Summarizing provides more concise information than the raw context, reducing the context overhead for the LLM while retaining key semantics.
-
5-Class Utterance Intent Classification (Core):
- Function: Classify each utterance into one of five domain-independent intents.
- Mechanism: The 5 intents are: JUST_COMMENT (pure comment), JUST_ANSWER (answering a question), DEVELOP_TOPIC (developing the current topic), INTRODUCE_TOPIC (introducing a new subtopic), and CHANGE_TOPIC (shifting the topic). Each class is provided with 1-3 examples.
- Design Motivation: This is the core innovation of the method—judging topic shifts is reframed as an easier intent classification task. INTRODUCE_TOPIC and CHANGE_TOPIC imply topic boundaries, whereas the other three imply topic continuation. The domain-independent intent definitions allow the method to be used across different domains.
-
Deductive Reasoning:
- Function: Rule-based deduction of whether a topic shift occurred based on the classified intent.
- Mechanism: If the intent is INTRODUCE_TOPIC or CHANGE_TOPIC, it is marked as a topic boundary; otherwise, it is marked as topic continuation.
- Design Motivation: Simplify the final judgment into rule-based reasoning using intents, preventing the LLM from directly making vague "is the topic changing" decisions.
-
Structured XML Prompt Format:
- Function: Structure the inputs and outputs using XML tags.
- Mechanism: Organize the prompt with XML tags (e.g.,
<context>,<intent>,<reasoning>). - Design Motivation: The XML format outperforms JSON (0.658) and natural language (0.640) in F1 score (0.699) and produces more stable, easily parsable outputs.
Loss & Training¶
- Training-free—a pure prompt-based method that relies on the in-context learning capabilities of LLMs.
- Supports various LLMs including GPT-4o, LLaMA-3.1-70B, Qwen2.5-72B, and DeepSeek-R1/V3.
Key Experimental Results¶
Main Results (3 Datasets)¶
| Dataset | DEF-DTS Pk↓ | DEF-DTS WD↓ | DEF-DTS F1↑ | Best Supervised Pk↓ |
|---|---|---|---|---|
| TIAGE | 0.232 | 0.256 | 0.699 | 0.130 (RoBERTa) |
| SuperDialseg | 0.315 | 0.324 | 0.686 | 0.185 (RoBERTa) |
| Dialseg711 | 0.015 | 0.018 | 0.979 | 0.034 (BERT) |
Ablation Study (TIAGE Dataset)¶
| Configuration | Pk↓ | F1↑ | Description |
|---|---|---|---|
| DEF-DTS full | 0.232 | 0.699 | |
| w/o Intent Classification | 0.366 | 0.524 | F1 drops by 17.5 points, intent classification is core |
| w/o Bidirectional Context | 0.248 | 0.682 | Slight decline |
| w/o Intent Examples | 0.290 | 0.617 | Examples are critical for accuracy |
| JSON Format | 0.257 | 0.658 | XML > JSON > Natural Language |
Key Findings¶
- Intent classification is the key to success: Removing intent classification causes the F1 score to drop sharply from 0.699 to 0.524, directly proving the effectiveness of the problem decomposition strategy.
- Surpasses supervised methods on the synthetic dataset Dialseg711: Pk of 0.015 vs. BERT's 0.034, with a near-perfect F1 of 0.979.
- Largest gain at topic shifts: On utterances containing a topic shift, DEF-DTS achieves an approximate 40% improvement over the baseline.
- \(\chi^2\) test validates the linguistic validity of intent labels: \(\chi^2(32)=76.2263, p<0.001\), indicating that the 5 intent labels are highly correlated with topic shifts.
- Generalization across LLMs: Effective across various LLMs including GPT-4o, LLaMA-3.1-70B, and Qwen2.5-72B.
Highlights & Insights¶
- Problem reformulation of "Topic Segmentation = Intent Classification": Reframing vague topic-change judgments into clear 5-class intent classification significantly reduces task difficulty.
- Domain-independent intent definitions: The 5 intents (comment/answer/develop/introduce/change) represent general dialogue behaviors, free from domain-specific dependencies.
- Structural advantage of the XML prompt format: A simple formatting choice yielded notable performance improvements.
Limitations & Future Work¶
- The 5 intent labels may not be fully comprehensive; certain conversational flows (e.g., rhetorical questions, topic drift) are not covered.
- Performance still lags behind supervised methods on real dialogue datasets (TIAGE/SuperDialseg), mostly due to poor performance near noisy boundaries.
- Smaller LLMs (\(<70\text{B}\)) exhibit formatting errors, requiring model-specific tuning.
- The selection strategy for intent examples has not been thoroughly explored; manual selection may not be optimal.
- Cohen's Kappa is moderate on TIAGE (0.485) and SuperDialseg (0.429), implying that the annotation consistency of the task itself is not high.
Related Work & Insights¶
- vs. S3-DST: S3-DST uses LLMs to directly judge whether a topic changes, lacking intermediate reasoning steps. DEF-DTS provides a reasoning chain through intent classification, significantly outperforming S3-DST.
- vs. BERT/RoBERTa Supervised Methods: Supervised methods perform better on TIAGE/SuperDialseg but require extensive annotations and exhibit poor cross-domain generalization. DEF-DTS is training-free.
- vs. TextTiling: Classic unsupervised methods based on lexical similarity fail to handle semantic-level topic shifts.
Rating¶
- Novelty: ⭐⭐⭐⭐ Restructuring topic segmentation as intent classification is an ingenious design.
- Experimental Thoroughness: ⭐⭐⭐⭐ 3 datasets, detailed ablation studies, and cross-LLM validation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and persuasive ablation analysis.
- Value: ⭐⭐⭐⭐ Provides a practical prompt engineering paradigm for unsupervised dialogue analysis.