Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations¶
Conference: ACL 2026 arXiv: 2603.14432 Code: https://github.com/choddeok/Affectron Area: Audio & Speech / Speech Synthesis Keywords: Nonverbal Vocalizations, Emotional Speech Synthesis, NV-Augmented Training, Affective Routing, Neural Codec Language Model
TL;DR¶
This paper proposes Affectron, a framework that achieves diverse and emotionally aligned nonverbal vocalization (NV) synthesis—such as laughter and sighs—on small-scale open-source disentangled corpora, via two training-time augmentation strategies: emotion-driven Top-K NV matching and emotion-aware Top-K routing. The proposed method substantially outperforms the purely language-pretrained VoiceCraft baseline.
Background & Motivation¶
Background: Nonverbal vocalizations (NVs), such as laughter, sighs, and crying, are critical means of conveying emotion in expressive speech synthesis. Existing expressive TTS systems primarily adopt two paradigms: label-controlled TTS (manually inserting NV labels to control type and position) and spontaneous-style TTS (implicitly predicting NVs from contextual cues).
Limitations of Prior Work: Label-controlled approaches rely on alignment annotations or NV detection models, whose biases and error propagation lead to temporal inconsistencies in NV placement. Spontaneous-style approaches are constrained by the irreproducibility of proprietary datasets. Publicly available NV corpora are overwhelmingly biased toward basic types (e.g., breath and laughter) and suffer from acoustic artifacts, making fine-grained NV variant modeling (e.g., distinguishing chuckles, giggles, and snickers) infeasible.
Key Challenge: The fundamental bottleneck is the absence of large-scale, diverse, high-quality public NV corpora. While neural codec language models (NCLMs) can generate natural speech from low-quality data, they are primarily designed for voice cloning and lack fine-grained prosodic control over NV variants.
Goal: To achieve emotionally consistent and contextually aligned diverse NV generation on small-scale open-source disentangled corpora, where linguistic speech and NVs are recorded separately.
Key Insight: The authors observe that affective attributes typically transition gradually rather than abruptly between adjacent utterance segments—segments with shorter temporal gaps exhibit smaller emotional angular distances. Consequently, positions with minimal emotional change serve as natural anchor points for NV insertion.
Core Idea: Design training-time NV augmentation strategies that select appropriate NV types via affective embedding matching and determine appropriate insertion positions via emotional angular distance routing, then fine-tune a pretrained VoiceCraft model on the augmented samples.
Method¶
Overall Architecture¶
Affectron adopts VoiceCraft (330M parameters), pretrained on linguistic speech only, as its backbone. During training, NV augmentation constructs training samples containing NVs, and the backbone is fine-tuned to acquire NV generation capability. At inference, the model directly generates output from NV-annotated text and an affective reference utterance, without requiring the matching and routing procedures.
Key Designs¶
-
Emotion-Driven Top-K NV Matching (EDNM):
- Function: Selects emotionally consistent and diverse NV candidates for each linguistic utterance.
- Mechanism: Given a linguistic utterance \(u\) and speaker \(s\), all NV candidates from that speaker are retrieved. Emotion2Vec is used to compute the cosine similarity between the affective embeddings of each NV candidate and the utterance. The Top-K candidates are selected and normalized into a probability distribution via temperature-scaled softmax, from which at most 2 NVs are sampled. The temperature parameter is set to \(\tau=0.7\) and Top-K to 10.
- Design Motivation: Random NV pairing increases diversity but lacks emotional coherence. Affective-embedding-based matching ensures the selected NVs are aligned with the emotional state of the utterance, while probabilistic sampling rather than deterministic selection preserves diversity.
-
Emotion-Aware Top-K Routing (EAR):
- Function: Determines the optimal insertion position for NVs within an utterance.
- Mechanism: Word-level segments are extracted using the Montreal Forced Aligner, and an affective attribute predictor generates affective pseudo-labels for each segment. Affective attributes are converted to spherical coordinates to compute angular distances. For each NV candidate, the emotional distance \(\Delta\) (based on arc-cosine distance on the sphere) to all potential insertion positions is computed, and the Top-K positions with the smallest distances are selected. The final insertion position is sampled from the softmax distribution over negative distances.
- Design Motivation: NVs should be inserted at positions where affective attributes change least (i.e., emotional stable points), which preserves emotional coherence while enhancing expressiveness. Spherical coordinates capture the directional variation of affective attributes more faithfully than Euclidean distance.
-
NV Structure Masking (NSM):
- Function: Enables the model to generate NVs conditioned on the affective context of surrounding linguistic speech.
- Mechanism: The causal masking strategy of VoiceCraft is extended—NV codec token sequences are reordered according to routing-determined positions. A random NV segment along with surrounding linguistic tokens is selected to form a masking span, the masked content is moved to the end of the sequence, and delayed stacking is applied for efficient multi-codebook autoregressive modeling.
- Design Motivation: Through masking and reordering, the model can leverage both preceding and following affective context (bidirectional conditioning) when generating NVs, rather than relying solely on historical information. This is critical for NV naturalness and affective expression.
Loss & Training¶
The AdamW optimizer is used with a learning rate of \(1\times10^{-5}\), batch size of 100 (via gradient accumulation), and training for 50,000 steps on 4 NVIDIA RTX A6000 GPUs for approximately 5 days. Training data is sourced from the EARS dataset (approximately 100 hours of clean speech and 4 hours of NVs, from 107 speakers).
Key Experimental Results¶
Main Results (Seen Speakers)¶
| Method | NV-Acc↑ | NV-Sim↑ | NV-EECS↑ | NV-SECS↑ | WER↓ | V-EECS↑ |
|---|---|---|---|---|---|---|
| VoiceCraft (baseline) | 10.49 | 0.5898 | 0.6149 | 0.8950 | 9.05 | 0.6212 |
| Affectron (full) | 37.75 | 0.6118 | 0.5748 | 0.8906 | 6.59 | 0.6216 |
Ablation Study¶
| Configuration | NV-Acc↑ | NV-EECS↑ | Notes |
|---|---|---|---|
| w/ DA only | 58.78 | 0.5455 | Data augmentation only; high Acc but poor emotional alignment |
| w/ DA + EDNM | 35.83 | 0.5648 | EECS improves with affective matching |
| w/ DA + EDNM + EAR | 32.93 | 0.5707 | EECS further improves with routing |
| Full (+ NSM) | 37.75 | 0.5748 | Complete model; best NV quality |
NV Type and Position Prediction vs. LLM¶
| Method | Type JSD↓ | Type Acc@1↑ | Location JSD↓ |
|---|---|---|---|
| GPT-oss-20B | 0.1130 | 16.98 | 0.1278 |
| Affectron-330M | 0.0051 | 75.77 | 0.0523 |
Key Findings¶
- Affectron's NV type distribution alignment far exceeds all LLM baselines (JSD of only 0.0051 vs. the best LLM's 0.1130).
- Removing EDNM paradoxically increases NV-Acc (random matching enhances diversity) but causes a significant drop in EECS, confirming the importance of emotional alignment.
- NSM leverages bidirectional affective context and is better suited for NV generation than standard causal masking.
- Consistent trends on unseen speakers validate zero-shot generalization capability.
Highlights & Insights¶
- Training-time augmentation, zero inference overhead: The matching and routing modules are used only during training; at inference the model directly generates from annotated text without additional cost. This train-time augmentation → inference-time simplification paradigm is worth borrowing.
- Spherical coordinate modeling of affective dynamics: Mapping multi-dimensional affective attributes onto a sphere and measuring change via angular distance captures directional variation more faithfully than Euclidean distance, and is transferable to other affective computing tasks.
- 330M specialized model outperforms 7B–20B LLMs: On NV type prediction, the domain-specific small model substantially outperforms general-purpose large models, demonstrating that domain-specific explicit affective modeling is more effective than pure textual reasoning.
Limitations & Future Work¶
- Validation is limited to the EARS dataset (approximately 100 hours), restricting scale.
- Linguistic speech and NVs are recorded separately, precluding modeling of their overlapping occurrences in natural settings.
- NV types cover only 15 categories, leaving richer nonverbal expressions unaddressed.
- No direct comparison with large-scale NV-capable TTS systems such as CosyVoice.
Related Work & Insights¶
- vs. VoiceCraft: The original model supports only voice cloning with minimal NV capability. Affectron endows it with NV generation through augmented training.
- vs. label-controlled TTS (ELaTE, EmoCtrl-TTS): These methods rely on NV detection models for data annotation, suffering from severe error propagation. Affectron's affective routing is based on affective attribute computation.
- vs. CosyVoice: Requires large-scale high-quality annotated corpora. Affectron operates effectively on small-scale open-source disentangled data.
Rating¶
- Novelty: ⭐⭐⭐⭐ Emotion-driven NV matching and routing constitute novel augmentation strategies.
- Experimental Thoroughness: ⭐⭐⭐⭐ Ablation study is meticulous; LLM comparisons are convincing.
- Writing Quality: ⭐⭐⭐⭐ Logic from background to method to experiments is clear and well-structured.
- Value: ⭐⭐⭐ The application domain is relatively specialized, but the augmentation strategy is broadly transferable.