DEF-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation¶

Conference: ACL 2025
arXiv: 2505.21033
Code: https://github.com/ElPlaguister/Def-DTS
Area: Image Segmentation
Keywords: Dialogue Topic Segmentation, Deductive Reasoning, Intent Classification, LLM Prompting, Unsupervised

TL;DR¶

Proposed DEF-DTS, a dialogue topic segmentation method based on LLM multi-step deductive reasoning. Through a three-step pipeline of bidirectional context summarization \(\rightarrow\) utterance intent classification (5 classes) \(\rightarrow\) deductive topic shift judgment, it achieves unsupervised/prompt-based SOTA on three datasets: TIAGE, SuperDialseg, and Dialseg711, outperforming supervised methods on Dialseg711.

Background & Motivation¶

Background: Dialogue Topic Segmentation (DTS) aims to identify topic boundaries in dialogues. Supervised methods (fine-tuned BERT/RoBERTa) require substantial annotation and have poor cross-domain generalization; prompt-based methods (directly asking the LLM to judge whether the topic has changed) yield unstable performance.

Limitations of Prior Work: (1) Supervised models rely on domain-specific data, making them costly and poorly generalizable; (2) Existing LLM prompt methods make overly coarse judgments on topic changes—merely checking whether contexts are "different" without reasoning; (3) Quiet topic transitions (e.g., transition from QA to a new topic) and explicit switching require different identification strategies.

Key Challenge: Determining whether a topic changes requires complex reasoning capabilities—understanding what was previously said, identifying the intent of the current utterance, and recognizing whether the intent implies a topic shift. Relying on a single prompt for the LLM to make this judgment is overly challenging.

Goal: How to design a multi-step reasoning pipeline to enable LLMs to systematically determine topic boundaries in dialogues?

Key Insight: Decompose topic segmentation into three subtasks—context understanding, intent classification, and deductive reasoning—to lower the task difficulty at each step.

Core Idea: The essence of topic shift is the change in utterance intent (shifting from "developing the topic" to "introducing a new topic" or "changing the topic"). Hence, topic boundaries are indirectly determined through intent classification.

Method¶

Overall Architecture¶

A three-step pipeline processes each utterance: (1) Bidirectional context summarization (prior 2 sentences + subsequent 3 sentences) \(\rightarrow\) (2) 5-class intent classification \(\rightarrow\) (3) Deductive reasoning to determine topic shifts. The overall system utilizes a structured XML prompt format.

Key Designs¶

Bidirectional Context Summarization:
- Function: Generate concise summaries of the preceding and succeeding context for the current utterance.
- Mechanism: Extract 2 preceding sentences and 3 succeeding sentences, then generate summaries using the LLM for each. An asymmetric window (2 before, 3 after) is used because topic shifts are typically more pronounced in subsequent utterances.
- Design Motivation: Summarizing provides more concise information than the raw context, reducing the context overhead for the LLM while retaining key semantics.
5-Class Utterance Intent Classification (Core):
- Function: Classify each utterance into one of five domain-independent intents.
- Mechanism: The 5 intents are: JUST_COMMENT (pure comment), JUST_ANSWER (answering a question), DEVELOP_TOPIC (developing the current topic), INTRODUCE_TOPIC (introducing a new subtopic), and CHANGE_TOPIC (shifting the topic). Each class is provided with 1-3 examples.
- Design Motivation: This is the core innovation of the method—judging topic shifts is reframed as an easier intent classification task. INTRODUCE_TOPIC and CHANGE_TOPIC imply topic boundaries, whereas the other three imply topic continuation. The domain-independent intent definitions allow the method to be used across different domains.
Deductive Reasoning:
- Function: Rule-based deduction of whether a topic shift occurred based on the classified intent.
- Mechanism: If the intent is INTRODUCE_TOPIC or CHANGE_TOPIC, it is marked as a topic boundary; otherwise, it is marked as topic continuation.
- Design Motivation: Simplify the final judgment into rule-based reasoning using intents, preventing the LLM from directly making vague "is the topic changing" decisions.
Structured XML Prompt Format:
- Function: Structure the inputs and outputs using XML tags.
- Mechanism: Organize the prompt with XML tags (e.g., <context>, <intent>, <reasoning>).
- Design Motivation: The XML format outperforms JSON (0.658) and natural language (0.640) in F1 score (0.699) and produces more stable, easily parsable outputs.

Loss & Training¶

Training-free—a pure prompt-based method that relies on the in-context learning capabilities of LLMs.
Supports various LLMs including GPT-4o, LLaMA-3.1-70B, Qwen2.5-72B, and DeepSeek-R1/V3.

Key Experimental Results¶

Main Results (3 Datasets)¶

Dataset	DEF-DTS Pk↓	DEF-DTS WD↓	DEF-DTS F1↑	Best Supervised Pk↓
TIAGE	0.232	0.256	0.699	0.130 (RoBERTa)
SuperDialseg	0.315	0.324	0.686	0.185 (RoBERTa)
Dialseg711	0.015	0.018	0.979	0.034 (BERT)

Ablation Study (TIAGE Dataset)¶

Configuration	Pk↓	F1↑	Description
DEF-DTS full	0.232	0.699
w/o Intent Classification	0.366	0.524	F1 drops by 17.5 points, intent classification is core
w/o Bidirectional Context	0.248	0.682	Slight decline
w/o Intent Examples	0.290	0.617	Examples are critical for accuracy
JSON Format	0.257	0.658	XML > JSON > Natural Language

Key Findings¶

Intent classification is the key to success: Removing intent classification causes the F1 score to drop sharply from 0.699 to 0.524, directly proving the effectiveness of the problem decomposition strategy.
Surpasses supervised methods on the synthetic dataset Dialseg711: Pk of 0.015 vs. BERT's 0.034, with a near-perfect F1 of 0.979.
Largest gain at topic shifts: On utterances containing a topic shift, DEF-DTS achieves an approximate 40% improvement over the baseline.
\(\chi^2\) test validates the linguistic validity of intent labels: \(\chi^2(32)=76.2263, p<0.001\), indicating that the 5 intent labels are highly correlated with topic shifts.
Generalization across LLMs: Effective across various LLMs including GPT-4o, LLaMA-3.1-70B, and Qwen2.5-72B.

Highlights & Insights¶

Problem reformulation of "Topic Segmentation = Intent Classification": Reframing vague topic-change judgments into clear 5-class intent classification significantly reduces task difficulty.
Domain-independent intent definitions: The 5 intents (comment/answer/develop/introduce/change) represent general dialogue behaviors, free from domain-specific dependencies.
Structural advantage of the XML prompt format: A simple formatting choice yielded notable performance improvements.

Limitations & Future Work¶

The 5 intent labels may not be fully comprehensive; certain conversational flows (e.g., rhetorical questions, topic drift) are not covered.
Performance still lags behind supervised methods on real dialogue datasets (TIAGE/SuperDialseg), mostly due to poor performance near noisy boundaries.
Smaller LLMs (\(<70\text{B}\)) exhibit formatting errors, requiring model-specific tuning.
The selection strategy for intent examples has not been thoroughly explored; manual selection may not be optimal.
Cohen's Kappa is moderate on TIAGE (0.485) and SuperDialseg (0.429), implying that the annotation consistency of the task itself is not high.

vs. S3-DST: S3-DST uses LLMs to directly judge whether a topic changes, lacking intermediate reasoning steps. DEF-DTS provides a reasoning chain through intent classification, significantly outperforming S3-DST.
vs. BERT/RoBERTa Supervised Methods: Supervised methods perform better on TIAGE/SuperDialseg but require extensive annotations and exhibit poor cross-domain generalization. DEF-DTS is training-free.
vs. TextTiling: Classic unsupervised methods based on lexical similarity fail to handle semantic-level topic shifts.

Rating¶

Novelty: ⭐⭐⭐⭐ Restructuring topic segmentation as intent classification is an ingenious design.
Experimental Thoroughness: ⭐⭐⭐⭐ 3 datasets, detailed ablation studies, and cross-LLM validation.
Writing Quality: ⭐⭐⭐⭐ Clear structure and persuasive ablation analysis.
Value: ⭐⭐⭐⭐ Provides a practical prompt engineering paradigm for unsupervised dialogue analysis.