Skip to content

Template-assisted Contrastive Learning of Task-oriented Dialogue Sentence Embeddings

Conference: ACL 2026 arXiv: 2305.14299 Code: GitHub Area: Dialogue Systems Keywords: dialogue sentence embeddings, contrastive learning, template augmentation, intent classification, unsupervised representation learning

TL;DR

This paper proposes TaDSE, a framework that leverages existing template information in dialogues as auxiliary anchors. Through three stages—template-aware data augmentation, paired contrastive training, and semantic compression inference—TaDSE significantly improves sentence embedding quality for task-oriented dialogue in an unsupervised setting, surpassing previous SOTA and even outperforming supervised commercial embedding models on five benchmarks.

Background & Motivation

Background: Learning high-quality dialogue sentence embeddings is critical for downstream tasks such as intent classification and slot filling in low-annotation scenarios. Existing unsupervised sentence embedding methods (e.g., SimCSE, PromptBERT) perform well on general text but degrade noticeably when transferred to the dialogue domain, due to the unique semantic relational structure among dialogue utterances.

Limitations of Prior Work: Obtaining utterance-level semantic relation annotations in the dialogue domain is extremely difficult, whereas token-level annotations (e.g., entities, slots, templates) are relatively easy to acquire. However, existing sentence embedding frameworks operate as sentence-level self-supervised systems and cannot exploit these rich token-level auxiliary signals. General data augmentation methods (e.g., back-translation, rule-based substitution) are prone to semantic drift or require additional model support.

Key Challenge: Dialogues contain abundant structured template information—multiple utterances with different surface forms share the same underlying template—yet utterance-template pairing relationships have never been exploited for embedding learning. Existing methods perform contrastive learning solely in the utterance space, ignoring the potential of templates as semantic anchors that could constrain the structure of the embedding space.

Goal: Design an unsupervised framework that leverages template information to enhance dialogue sentence embeddings, yielding tighter clustering of semantically similar utterances and clearer decision boundaries.

Key Insight: The authors observe that templates serve as the "semantic skeleton" of utterances—utterances sharing the same template differ only in slot values while preserving the same core semantic structure. Introducing templates as auxiliary representations into contrastive learning enables the model to learn to distinguish correct utterance-template pairings, thereby improving the embedding space.

Core Idea: Template-aware data augmentation expands the diversity of utterance-template pairs; a triple contrastive loss (template loss + utterance loss + pairing loss) is then used for joint training; finally, semantic compression fuses template representations into utterance representations at inference time to further refine the embeddings.

Method

Overall Architecture

TaDSE consists of three stages: (1) Template Data Augmentation—generates a large number of natural synthetic utterances by permuting slot values over existing slots and templates; (2) Paired Contrastive Training—jointly learns template representations, utterance representations, and utterance-template pairing representations via three contrastive losses; (3) Semantic Compression Inference—blends template representations into utterance representations at a tuned ratio during inference to enhance semantic discriminability. The inputs are dialogue utterances and their corresponding templates; the output is an optimized sentence embedding vector.

Key Designs

  1. Template Data Augmentation:

    • Function: Expand the diversity of utterance-template pairs in the training data.
    • Mechanism: Slot types (e.g., CITY, DEVICE) and their high-frequency values are extracted from the dataset to construct a Slot Book. For each template, top-\(k\) frequent slot values are permuted to generate a large number of natural synthetic utterances. For example, "Book a flight to {CITY}" yields "Book a flight to Paris/Tokyo/London," etc. Across five datasets, 834K augmented utterances are generated, averaging 16 utterances per template.
    • Design Motivation: The effectiveness of paired contrastive learning depends on having sufficiently diverse utterances per template. The original datasets have a low utterance/template ratio; augmentation substantially increases pairing diversity, enabling the model to better learn discriminative representations.
  2. Triple Contrastive Loss:

    • Function: Jointly learn representations in the template space, utterance space, and utterance-template pairing space.
    • Mechanism: (a) The template loss \(L^t\) uses dropout noise to generate positive pairs, pulling together two encodings of the same template; (b) the utterance loss \(L^u\) follows the SimCSE framework to learn utterance representations; (c) the pairing loss \(L^{pair}\) treats correct utterance-template pairs as positives and other utterances as negatives, training the model to distinguish semantically matched pairs. The final loss is \(L^{train} = L^t + \lambda^u L^u + \lambda^{pair} L^{pair}\).
    • Design Motivation: Utterance-only contrastive learning cannot exploit the structural information in templates. The pairing loss enables the model to use templates as semantic anchors, pulling utterances under the same template closer together and pushing utterances from different templates apart, resulting in cleaner semantic clusters.
  3. Semantic Compression:

    • Function: Fuse template information into utterance representations at inference time to further refine embeddings.
    • Mechanism: The final representation is \(repr_i = \lambda^{comp} t_i + (1 - \lambda^{comp}) u_i\), where \(\lambda^{comp}\) is tuned on a validation set. Incorporating the template component enhances specific semantic dimensions, enabling distinction between utterances that are superficially similar but semantically different.
    • Design Motivation: Templates capture the semantic essence of utterances; blending in a moderate amount of template signal sharpens decision boundaries in ambiguous regions. Additionally, the optimal value of \(\lambda^{comp}\) serves as an analytical tool for measuring the quality of template-utterance semantic alignment.

Loss & Training

All three contrastive losses are based on the InfoNCE framework with in-batch negative sampling. Each loss branch has an independent temperature hyperparameter: \(\tau_t\), \(\tau_u\), and \(\tau_{pair}\). The method performs transfer learning on a SimCSE-initialized BERT-base model, with kNN on the training set for intent classification evaluation. Optionally, a trainable MLP layer \(W_A\) is added to the template branch to adjust the template representation dimensionality.

Key Experimental Results

Main Results

Model SNIPS ATIS MASSIVE HWU64 CLINC150 Avg.
BERT 80.00 78.05 41.86 50.84 33.35 56.82
SimCSE 91.71 85.67 76.77 81.08 71.00 81.25
DSE 95.86 87.01 76.77 79.28 70.16 81.82
TaDSE 97.00 89.70 78.18 82.77 70.56 83.64
TaDSE w/ MLP 96.29 89.14 79.15 82.29 72.49 83.87

Comparison with supervised commercial embeddings (TaDSE is unsupervised, only 110M parameters):

Model SNIPS ATIS Avg.
OpenAI-large 98.57 84.77 91.67
Gemini-001 98.29 86.00 92.15
TaDSE 97.00 89.70 93.35

Ablation Study

Configuration SNIPS ATIS MASSIVE CLINC150
w/o augmentation (SimCSE) 91.71 85.67 77.00 71.05
+ augmentation 93.29 86.00 77.37 70.98
+ \(L^t\) 95.29 88.47 78.58 71.53
+ \(L^t\) + \(L^{pair}\) 96.14 89.59 79.39 72.98
+ \(L^{t'}\) (MLP) + \(L^{pair}\) 97.00 88.69 79.83 73.45

Key Findings

  • The pairing loss \(L^{pair}\) contributes most significantly; on SNIPS, its introduction alone improves performance from 93.29 to 96.14 (+2.85%), validating the effectiveness of utterance-template paired learning.
  • The template loss \(L^t\) alone also yields substantial gains (+2.0%–+2.5%), indicating that the salient semantic information encoded in templates provides an independent contribution to embedding learning.
  • Augmentation stability varies by dataset: SNIPS and ATIS improve consistently with increased augmentation (augmentation-stable), while MASSIVE and CLINC150 may degrade under high-order augmentation.
  • Semantic compression consistently yields positive gains on augmentation-stable datasets (SNIPS +0.29%, ATIS +0.44%), confirming the quality of template-utterance semantic alignment.

Highlights & Insights

  • The use of templates as semantic anchors is a particularly elegant design—it converts token-level annotations already present in dialogue data into auxiliary supervision signals for sentence-level contrastive learning, effectively injecting "free" supervision. This idea is generalizable to any domain with structured templates or schemas.
  • Semantic compression testing serves not only as an inference-time enhancement but also as an analytical tool: the optimal value of \(\lambda^{comp}\) reflects the quality of template-utterance semantic alignment in the embedding space, providing an interpretable window into the representation geometry.
  • An unsupervised 110M-parameter model surpasses the supervised commercial embeddings from OpenAI and Google in average accuracy, demonstrating the substantial potential of domain-specialized approaches.

Limitations & Future Work

  • The method depends on template and slot annotations; for dialogue datasets without annotated templates, an additional automatic template extraction step is required (the NER-based approach used for CLINC150 in this work shows limited effectiveness).
  • Evaluation is conducted solely on intent classification; the effectiveness on other downstream tasks (e.g., dialogue state tracking, response selection) remains unvalidated.
  • Semantic compression yields uncertain benefits on augmentation-unstable datasets, indicating some sensitivity to data quality.
  • Future work could consider combining LLMs to automatically generate high-quality templates, removing the dependence on manual annotation.
  • vs SimCSE: SimCSE relies solely on dropout noise for positive pairs; TaDSE additionally introduces templates as semantic anchors for paired contrastive learning, enabling better exploitation of structural information in the dialogue domain.
  • vs DSE: DSE uses consecutive utterances as positive pairs for contrastive learning, but remains at the utterance-utterance level; TaDSE introduces cross-granularity utterance-template pairings, providing more precise semantic association signals.

Rating

  • Novelty: ⭐⭐⭐⭐ The design of templates as contrastive learning anchors and the semantic compression test are both innovative, though the base framework remains grounded in SimCSE.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across five datasets with thorough ablations and comparisons against commercial models, though validation on additional downstream tasks is lacking.
  • Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with complete methodological derivations and rich figures and tables.
  • Value: ⭐⭐⭐⭐ Provides an effective paradigm for leveraging template information in dialogue embedding learning, generalizable to other domains with structured annotations.