Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis¶

Conference: ACL 2025
arXiv: 2505.12597
Code: https://github.com/AI-S2-Lab/Chain-Talker
Area: Speech
Keywords: Conversational Speech Synthesis, Empathetic Speech, Chain Modeling, Emotion Captioning, Semantic Encoding

TL;DR¶

This paper proposes Chain-Talker, which achieves interpretable empathetic conversational speech synthesis through a three-stage chain modeling (emotion understanding \(\rightarrow\) semantic understanding \(\rightarrow\) empathetic rendering), and develops CSS-EmCap, an automatic annotation pipeline to generate emotional captions for conversational speech.

Background & Motivation¶

Background: Conversational Speech Synthesis (CSS) aims to generate speech matched with the conversational context, based on the emotional and stylistic context of user-agent interactions. In recent years, GPT-based CSS models (such as GPT-Talker) have significantly improved naturalness and expressiveness by directly predicting speech token sequences.

Limitations of Prior Work: Current generative CSS models suffer from two interpretability issues: (1) speech generation lacks a deep understanding of conversational emotion, making it difficult to achieve genuine empathy; (2) general discrete speech codes contain excessive redundant information, mixing semantic and acoustic information, which limits expressiveness.

Key Challenge: The end-to-end approach of directly predicting speech tokens from conversational context lacks interpretability: the model neither explicitly understands emotional shifts nor efficiently disentangles semantic from acoustic information.

Goal: To enable CSS systems to progressively understand dialogue emotions, extract semantic information, and eventually generate empathetic speech much like humans, while also automatically generating high-quality emotional caption annotations for conversational speech.

Key Insight: Drawing inspiration from the human Chain-of-Thought process, CSS is decomposed into three stages: first understanding emotion, then understanding semantics, and finally rendering empathetically. Concurrently, an automatic context-aware emotional captioning pipeline is constructed using LLMs.

Core Idea: To mimic the three-stage chain modeling of human cognition—first perceiving emotion, then encoding semantics, and finally rendering speech—to achieve interpretable empathetic conversational speech synthesis.

Method¶

Overall Architecture¶

Chain-Talker consists of two main components: EmGPT and Synthesizer. EmGPT is responsible for emotion and semantic understanding based on an autoregressive GPT architecture, while the Synthesizer handles empathetic speech rendering based on an OT-CFM (Optimal Transport Conditional Flow Matching) model.

The input sequence is defined as \(\mathcal{Q} = (\langle BOS \rangle, \mathcal{H}, \mathcal{C}, \langle EOS \rangle)\), where \(\mathcal{H}\) denotes the dialogue history and \(\mathcal{C}\) represents the current utterance to be synthesized.

Key Designs¶

Module 1: Unified Context Tokenization

Function: Codify multi-modal dialogue information into a unified sequence.
Mechanism: Alternately concatenate user and agent utterances in the sequence of speaker information, audio, textual content, and emotional captions. Text is encoded into \(T_n^t\) using BPE, emotional captions are encoded into \(T_n^d\), speaker embeddings \(T_n^p\) are extracted using a pre-trained speaker verification model, and speech is encoded into \(T_n^a\) via a supervised ASR model (with VQ).
Design Motivation: To enable the model to understand the context prior to predicting emotions and subsequently generating the corresponding speech, unifying multi-modal information within a singular sequence space.

Module 2: Emotion Understanding

Function: Predict the emotional caption tokens of the current utterance based on the dialogue context.
Mechanism: Utilize EmGPT to autoregressively predict the emotional caption \(T_N^d\): \(p(T_{N,:}^d | \Re_{1 \to N-1}, T_{N,:}^p, T_{N,:}^t; \Theta) = \prod_{j=0}^{D} p(T_{N,j}^d | T_{N,<j}^d, \Re_{1 \to N-1}, T_{N,:}^p, T_{N,:}^t; \Theta)\)
Design Motivation: To explicitly comprehend emotional dynamics within dialogues, providing emotional guidance for subsequent speech generation.

Module 3: Semantic Understanding

Function: Predict purely semantic speech encodings on top of the understood emotions.
Mechanism: EmGPT utilizes the predicted emotional caption \(T_N^d\) along with context information to secure semantic encodings \(T_N^a\): \(p(T_{N,:}^a | \Re_{1 \to N-1}, T_N^p, T_N^t, T_N^d; \Theta)\)
Design Motivation: To employ purely semantic encodings produced by a supervised ASR model, avoiding acoustic redundancy inherent in general discrete codes.

Module 4: Empathetic Rendering

Function: Synthesize the final empathetic speech leveraging the emotional captions and semantic encodings.
Mechanism: Adopt OT-CFM as the backbone to predict Mel-spectrograms and utilize HiFi-GAN to synthesize waveforms. The OT-CFM leverages emotional captions \(U_N^d\), speaker information \(U_{agent}^p\), semantic encodings \(T_N^a\), and masked Mel-spectrograms \(U_{agent}^m\) simultaneously to predict vector fields: \(\frac{d\phi_t(X)}{dt} = \nu_t(\phi_t(X), t | U_{agent}^p, U_N^d, T_N^a, U_{agent}^m)\)
Design Motivation: To utilize emotional captions to guide emotion and style rendering during the decoding stage, rather than directly decoding from raw speech tokens.

Loss & Training¶

Training Loss: - EmGPT training involves two losses: \(\mathcal{L}_{caption}\) (cross-entropy loss of the emotional caption tokens) and \(\mathcal{L}_{speech}\) (cross-entropy loss of the semantic encodings). - The Synthesizer employs the OT-CFM loss: \(\mathcal{L}_{OT\text{-}CFM} = \mathbb{E}_{t,X_0,X_1}[\|\omega_t(\phi_t^{OT}(X_0,X_1)|X_1) - \nu_t(\phi_t^{OT}(X_0,X_1)|\theta)\|]\)

Multi-Stage Training: - First Stage: Train on large-scale single-sentence TTS data (based on CosyVoice-300M-25Hz, with approximately 170,000 hours of speech data). - Second Stage: Fine-tune on dialogue data, learning to infer emotional captions from the dialogue context and predict semantic encodings. - The Synthesizer can be trained independently in single-sentence mode.

CSS-EmCap Pipeline: - Multi-level attribute extraction: sentence-level style factors (gender, pitch, energy, speaking rate) + dialogue-level emotion classification. - Two-step generation: generate base captions based on context and attributes, then expand and enrich them via synonym substitution and variation in emotional intensity.

Key Experimental Results¶

Main Results¶

Evaluated on three datasets: NCSSD, DailyTalk, and MultiDialog (totaling approximately 384 hours).

Method	DMOS-N ↑	DMOS-E ↑	ACCm ↑	DDTW ↓	SSIM ↑
Ground Truth	4.467	4.571	-	-	-
CCATTS	3.423	3.469	0.462	67.851	0.765
GPT-Talker	3.962	3.913	0.562	44.625	0.814
GPT-Talker_c	4.045	4.102	0.589	40.374	0.829
Chain-Talker	4.147	4.239	0.612	38.784	0.862

CSS-EmCap annotation quality:

Method	DMOS-C ↑	SIM_R ↑	SIM_G ↑	DIS-1 ↑	DIS-2 ↑
Qwen2-Audio	4.212	0.431	0.534	0.086	0.174
SECap	4.268	0.475	0.617	0.081	0.186
CSS-EmCap	4.462	0.568	0.694	0.106	0.296

Ablation Study¶

Configuration	DMOS-N ↑	DMOS-E ↑	ACCm ↑	DDTW ↓	SSIM ↑
Chain-Talker	4.147	4.239	0.612	38.784	0.862
w/o context	3.982	3.984	0.564	43.589	0.847
w/o captions	4.037	4.084	0.571	43.479	0.836
w/o \(\mathcal{L}^{caption}\)	3.947	3.956	0.568	45.764	0.829
w/o First-Stage	3.756	3.789	0.517	52.640	0.793

Key Findings¶

Chain-Talker outperforms the strongest baseline by 0.102 in naturalness MOS (DMOS-N) and by 0.112 in expressiveness MOS (DMOS-E).
Removing dialogue history (w/o context) leads to a descent of 0.255 in DMOS-E, verifying the importance of context modeling.
Removing the emotional caption loss (w/o \(\mathcal{L}^{caption}\)) causes a decline of 0.2 in DMOS-N and 0.283 in DMOS-E.
The absence of first-stage pre-training leads to a drastic degradation across all metrics, indicating the necessity of large-scale pre-training.
Optimal performance is achieved when the dialogue history length is \(N\)=3, peaking at around 200 epochs.

Highlights & Insights¶

The introduction of the chain modeling paradigm is quite elegant, decomposing the complex CSS task into an interpretable three-step cognitive chain that aligns with the human intuition of "perceiving emotion first, then understanding content, and finally expressing".
The design of the CSS-EmCap pipeline is highly instructive: first extracting structured attributes, then prompting an LLM to generate descriptive natural language captions, and finally validating them to close the loop.
Employing supervised semantic encodings as a replacement for general discrete codes is a critical design choice—eliminating acoustic redundancies makes semantic understanding much cleaner.
Emotional captions (natural language) outperform conventional emotion labels (DMOS-C 4.462 vs. GT's 4.327), demonstrating that natural language is more expressive in controlling conversational styles.

Limitations & Future Work¶

Inference Latency: The average response latency of 2.5 seconds fails to meet-real time interaction requirements; streaming inference is a vital future direction.
Data Scale: The dialogue dataset is restricted to 384 hours and predominantly features young speakers, failing to adequately cover styles for children and elderly individuals.
Security Risks: Zero-shot voice cloning capabilities could potentially be abused for voice spoofing.
The generation of emotional captions relies on the Gemini API, increasing dependence on external LLMs.

GPT-Talker pioneered the GPT-based CSS paradigm, but lacked explicit emotional understanding.
CosyVoice provides the foundational architecture for supervised semantic tokens and the OT-CFM synthesizer.
The successful application of Chain-of-Thought (CoT) in conversational tasks inspired the chain modeling paradigm in CSS.
The audio comprehension capability of Qwen2-Audio/SECap offers a comparative benchmark for emotional caption generation.

Rating¶

Novelty: ⭐⭐⭐⭐ (First implementation of chain modeling in CSS with clear logic)
Experimental Thoroughness: ⭐⭐⭐⭐ (Three datasets + comprehensive ablation studies + visualization analysis)
Writing Quality: ⭐⭐⭐⭐ (Clear structure and rich illustrations/tables)
Value: ⭐⭐⭐⭐ (Both the CSS-EmCap pipeline and the chain-based design hold solid reference value for the community)