Skip to content

DEF-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation

Conference: ACL 2025
arXiv: 2505.21033
Code: https://github.com/ElPlaguister/Def-DTS
Area: Image Segmentation
Keywords: Dialogue Topic Segmentation, Deductive Reasoning, Intent Classification, LLM Prompting, Unsupervised

TL;DR

Proposed DEF-DTS, a dialogue topic segmentation method based on LLM multi-step deductive reasoning. Through a three-step pipeline of bidirectional context summarization \(\rightarrow\) utterance intent classification (5 classes) \(\rightarrow\) deductive topic shift judgment, it achieves unsupervised/prompt-based SOTA on three datasets: TIAGE, SuperDialseg, and Dialseg711, outperforming supervised methods on Dialseg711.

Background & Motivation

Background: Dialogue Topic Segmentation (DTS) aims to identify topic boundaries in dialogues. Supervised methods (fine-tuned BERT/RoBERTa) require substantial annotation and have poor cross-domain generalization; prompt-based methods (directly asking the LLM to judge whether the topic has changed) yield unstable performance.

Limitations of Prior Work: (1) Supervised models rely on domain-specific data, making them costly and poorly generalizable; (2) Existing LLM prompt methods make overly coarse judgments on topic changes—merely checking whether contexts are "different" without reasoning; (3) Quiet topic transitions (e.g., transition from QA to a new topic) and explicit switching require different identification strategies.

Key Challenge: Determining whether a topic changes requires complex reasoning capabilities—understanding what was previously said, identifying the intent of the current utterance, and recognizing whether the intent implies a topic shift. Relying on a single prompt for the LLM to make this judgment is overly challenging.

Goal: How to design a multi-step reasoning pipeline to enable LLMs to systematically determine topic boundaries in dialogues?

Key Insight: Decompose topic segmentation into three subtasks—context understanding, intent classification, and deductive reasoning—to lower the task difficulty at each step.

Core Idea: The essence of topic shift is the change in utterance intent (shifting from "developing the topic" to "introducing a new topic" or "changing the topic"). Hence, topic boundaries are indirectly determined through intent classification.

Method

Overall Architecture

A three-step pipeline processes each utterance: (1) Bidirectional context summarization (prior 2 sentences + subsequent 3 sentences) \(\rightarrow\) (2) 5-class intent classification \(\rightarrow\) (3) Deductive reasoning to determine topic shifts. The overall system utilizes a structured XML prompt format.

Key Designs

  1. Bidirectional Context Summarization:

    • Function: Generate concise summaries of the preceding and succeeding context for the current utterance.
    • Mechanism: Extract 2 preceding sentences and 3 succeeding sentences, then generate summaries using the LLM for each. An asymmetric window (2 before, 3 after) is used because topic shifts are typically more pronounced in subsequent utterances.
    • Design Motivation: Summarizing provides more concise information than the raw context, reducing the context overhead for the LLM while retaining key semantics.
  2. 5-Class Utterance Intent Classification (Core):

    • Function: Classify each utterance into one of five domain-independent intents.
    • Mechanism: The 5 intents are: JUST_COMMENT (pure comment), JUST_ANSWER (answering a question), DEVELOP_TOPIC (developing the current topic), INTRODUCE_TOPIC (introducing a new subtopic), and CHANGE_TOPIC (shifting the topic). Each class is provided with 1-3 examples.
    • Design Motivation: This is the core innovation of the method—judging topic shifts is reframed as an easier intent classification task. INTRODUCE_TOPIC and CHANGE_TOPIC imply topic boundaries, whereas the other three imply topic continuation. The domain-independent intent definitions allow the method to be used across different domains.
  3. Deductive Reasoning:

    • Function: Rule-based deduction of whether a topic shift occurred based on the classified intent.
    • Mechanism: If the intent is INTRODUCE_TOPIC or CHANGE_TOPIC, it is marked as a topic boundary; otherwise, it is marked as topic continuation.
    • Design Motivation: Simplify the final judgment into rule-based reasoning using intents, preventing the LLM from directly making vague "is the topic changing" decisions.
  4. Structured XML Prompt Format:

    • Function: Structure the inputs and outputs using XML tags.
    • Mechanism: Organize the prompt with XML tags (e.g., <context>, <intent>, <reasoning>).
    • Design Motivation: The XML format outperforms JSON (0.658) and natural language (0.640) in F1 score (0.699) and produces more stable, easily parsable outputs.

Loss & Training

  • Training-free—a pure prompt-based method that relies on the in-context learning capabilities of LLMs.
  • Supports various LLMs including GPT-4o, LLaMA-3.1-70B, Qwen2.5-72B, and DeepSeek-R1/V3.

Key Experimental Results

Main Results (3 Datasets)

Dataset DEF-DTS Pk↓ DEF-DTS WD↓ DEF-DTS F1↑ Best Supervised Pk↓
TIAGE 0.232 0.256 0.699 0.130 (RoBERTa)
SuperDialseg 0.315 0.324 0.686 0.185 (RoBERTa)
Dialseg711 0.015 0.018 0.979 0.034 (BERT)

Ablation Study (TIAGE Dataset)

Configuration Pk↓ F1↑ Description
DEF-DTS full 0.232 0.699
w/o Intent Classification 0.366 0.524 F1 drops by 17.5 points, intent classification is core
w/o Bidirectional Context 0.248 0.682 Slight decline
w/o Intent Examples 0.290 0.617 Examples are critical for accuracy
JSON Format 0.257 0.658 XML > JSON > Natural Language

Key Findings

  • Intent classification is the key to success: Removing intent classification causes the F1 score to drop sharply from 0.699 to 0.524, directly proving the effectiveness of the problem decomposition strategy.
  • Surpasses supervised methods on the synthetic dataset Dialseg711: Pk of 0.015 vs. BERT's 0.034, with a near-perfect F1 of 0.979.
  • Largest gain at topic shifts: On utterances containing a topic shift, DEF-DTS achieves an approximate 40% improvement over the baseline.
  • \(\chi^2\) test validates the linguistic validity of intent labels: \(\chi^2(32)=76.2263, p<0.001\), indicating that the 5 intent labels are highly correlated with topic shifts.
  • Generalization across LLMs: Effective across various LLMs including GPT-4o, LLaMA-3.1-70B, and Qwen2.5-72B.

Highlights & Insights

  • Problem reformulation of "Topic Segmentation = Intent Classification": Reframing vague topic-change judgments into clear 5-class intent classification significantly reduces task difficulty.
  • Domain-independent intent definitions: The 5 intents (comment/answer/develop/introduce/change) represent general dialogue behaviors, free from domain-specific dependencies.
  • Structural advantage of the XML prompt format: A simple formatting choice yielded notable performance improvements.

Limitations & Future Work

  • The 5 intent labels may not be fully comprehensive; certain conversational flows (e.g., rhetorical questions, topic drift) are not covered.
  • Performance still lags behind supervised methods on real dialogue datasets (TIAGE/SuperDialseg), mostly due to poor performance near noisy boundaries.
  • Smaller LLMs (\(<70\text{B}\)) exhibit formatting errors, requiring model-specific tuning.
  • The selection strategy for intent examples has not been thoroughly explored; manual selection may not be optimal.
  • Cohen's Kappa is moderate on TIAGE (0.485) and SuperDialseg (0.429), implying that the annotation consistency of the task itself is not high.
  • vs. S3-DST: S3-DST uses LLMs to directly judge whether a topic changes, lacking intermediate reasoning steps. DEF-DTS provides a reasoning chain through intent classification, significantly outperforming S3-DST.
  • vs. BERT/RoBERTa Supervised Methods: Supervised methods perform better on TIAGE/SuperDialseg but require extensive annotations and exhibit poor cross-domain generalization. DEF-DTS is training-free.
  • vs. TextTiling: Classic unsupervised methods based on lexical similarity fail to handle semantic-level topic shifts.

Rating

  • Novelty: ⭐⭐⭐⭐ Restructuring topic segmentation as intent classification is an ingenious design.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 3 datasets, detailed ablation studies, and cross-LLM validation.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and persuasive ablation analysis.
  • Value: ⭐⭐⭐⭐ Provides a practical prompt engineering paradigm for unsupervised dialogue analysis.