ParaMETA: Towards Learning Disentangled Paralinguistic Speaking Styles Representations¶

Conference: AAAI 2026 arXiv: 2601.12289 Code: GitHub Area: Others Keywords: Speaking Style Representation, Disentangled Embedding, Contrastive Learning, Prototype Learning, text-to-speech

TL;DR¶

This paper proposes ParaMETA, a unified paralinguistic speaking style representation learning framework that achieves disentangled representations of speaking styles—including emotion, age, gender, and language—through META space regularization and task-specific subspace projection, while simultaneously supporting downstream multi-task classification and style-controllable speech synthesis.

Background & Motivation¶

Core Problem¶

Understanding and modeling speaking styles (emotion, age, gender, language, etc.) from speech is critical for numerous applications: - Recognition tasks: Affective computing and human-computer interaction require recognition of speaker emotion, age, and gender. - Generation tasks: TTS systems require precise control over speaking styles to produce diverse and expressive speech.

The key challenge is: how to learn a set of disentangled, task-specific speaking style embeddings such that different style types do not interfere with one another?

Three Major Limitations of Prior Work¶

1. Inefficiency of single-task models: Training separate models for each style recognition task (emotion, age, gender, etc.) is computationally expensive and difficult to scale. Multi-task models are more efficient but frequently suffer from negative transfer due to inter-task interference.

2. Style entanglement in CLAP: CLAP (Contrastive Language-Audio Pretraining) is a prevailing speech representation method that aligns speech and text into a unified embedding space. However, this unified embedding compresses all speaking styles (emotion, age, gender, etc.) into a shared space, resulting in: - Dominant styles (e.g., gender) overshadowing others (e.g., emotion); - Difficulty in independently controlling individual styles; - Dependency on large-scale models and high computational resources.

3. Limitations of style control in TTS: - Text-prompt methods (CosyVoice, PromptTTS): descriptive text is inherently ambiguous (e.g., "happy male" admits multiple expressions). - Speech-prompt methods (F5-TTS, VALL-E): embeddings extracted from reference speech carry entangled style information. - UniStyle attempts to unify both prompt types, but its tightly coupled design causes generated speech to retain reference speech characteristics even when the text specifies conflicting styles.

Core Insight¶

Different types of speaking styles (emotion vs. gender vs. age) exhibit distinct discriminative boundaries and label spaces, and should be projected into their own independent subspaces rather than compressed into a single shared space.

Method¶

Overall Architecture¶

ParaMETA adopts a two-stage embedding learning strategy:

META embedding space: A shared representation is learned through hierarchical-similarity-weighted contrastive regularization, drawing samples with more shared labels closer together.
Task-specific subspaces: META embeddings are projected into low-dimensional, task-exclusive subspaces, each optimized independently.

The framework supports both speech-based and text-based prompting, and employs prototype alignment to ensure cross-modal semantic consistency.

Key Designs¶

1. Speech Encoder¶

ParaMETA is a model-agnostic representation learning framework, validated across four encoder backbones: - CNN: Convolutional layers followed by global mean pooling over the time dimension. - LSTM: Final hidden state used as the sequence representation. - Q-Former: Learnable latent queries attend to the spectrogram via cross-attention. - Transformer: Self-attention layers with summation pooling over time steps.

The input is a Mel spectrogram \(\mathrm{MEL} \in \mathbb{R}^{F \times t}\), encoded as \(x = \mathrm{Encoder}(\mathrm{MEL}) \in \mathbb{R}^D\).

2. META Embedding Regularization¶

Mechanism: Conventional contrastive learning treats all samples with differing labels uniformly as "negative pairs," a binary partitioning that ignores partially overlapping style relationships. ParaMETA instead adopts a hierarchical similarity (positive-to-less-positive) strategy:

The class-level similarity between sample pair \((i, j)\) is computed as the proportion of shared labels across all tasks:

\[w_{i,j} = \frac{1}{T} \sum_{t=1}^{T} \mathbb{1}[y_i^{(t)} = y_j^{(t)}]\]

After normalization, this serves as the weight in the contrastive loss.

Intuition: A speech sample labeled [female, happy] should be closer to [female, sad] than to [male, sad], since the former shares the gender label.

META regularization loss:

\[\mathcal{L}_{\mathrm{META}} = -\frac{1}{B} \sum_{i=1}^{B} \sum_{j \neq i}^{B} \hat{w}_{i,j} \log p_{i,j}\]

where \(\log p_{i,j}\) is the softmax log-probability based on cosine similarity.

Design Motivation: These graded weights transform the embedding space from a simple "attract same-class, repel others" structure into a hierarchical topology, providing a richer initial representation for subsequent task-specific projection.

3. Task-Specific Subspace Projection¶

Mechanism: META embeddings are projected through \(T\) independent linear transformations into task-exclusive subspaces, \(z^{(t)} = f_t(Z) \in \mathbb{R}^{B \times d}\), where supervised contrastive loss is applied independently within each subspace:

\[\mathcal{L}_{\text{SCL}}^{(t)} = -\frac{1}{B} \sum_{i=1}^{B} \frac{1}{|\mathcal{P}_i^{(t)}|} \sum_{j \in \mathcal{P}_i^{(t)}} \log \frac{e^{\cos(z_i, z_j)}}{\sum_{k \neq i} e^{\cos(z_k, z_j)}}\]

where \(\mathcal{P}_i^{(t)} = \{j \mid j \neq i, y_j^{(t)} = y_i^{(t)}\}\) is the set of positive samples sharing the same task-\(t\) label as sample \(i\).

Effect: In the emotion subspace, all "happy"-labeled utterances cluster together regardless of gender or age. This design effectively eliminates inter-task interference.

4. Prototype Learning¶

Mechanism: A prototype vector \(p_c^{(t)} \in \mathbb{R}^d\) is maintained for each class of each task, serving as a class anchor.

Prototypes are updated via Exponential Moving Average (EMA):

\[p_c^{(t)} \leftarrow m \cdot p_c^{(t)} + (1-m) \cdot z_c^{(t)}, \quad m = 0.99\]

The prototype alignment loss pulls each sample embedding toward its corresponding prototype:

\[\mathcal{L}_{\mathrm{PAL}} = \sum_{t=1}^{T} \frac{1}{B} \sum_{i=1}^{B} (1 - \cos(z_i^{(t)}, p_c^{(t)}))\]

Design Motivation: Prototypes serve a dual role—acting as class anchors during training (via alignment loss) and as directly usable class representations at inference (nearest-prototype classification; replaceable modules for TTS style control).

5. Text–Speech Alignment¶

A pretrained text encoder encodes style descriptions (e.g., "happy adult female"), whose embeddings are projected into the corresponding task-specific subspaces and subjected to the same prototype alignment loss, achieving cross-modal semantic consistency.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{\mathrm{META}} + \mathcal{L}_{\mathrm{SCL}} + \mathcal{L}_{\mathrm{PAL}}^{(\mathrm{Speech})} + \mathcal{L}_{\mathrm{PAL}}^{(\mathrm{Text})}\]

Key Experimental Results¶

Experimental Setup¶

Datasets: Baker + LJSpeech + ESD + CREMA-D + Genshin Impact character speech
16 speaking styles: 7 emotion classes, 5 age classes, 2 gender classes, 2 language classes
Approximately 93k speech samples, uniformly resampled to 22.05 kHz
Hardware: NVIDIA TITAN RTX, batch size 32, 40k training steps

Main Results: Speaking Style Classification (Subject-Independent, Transformer Backbone)¶

Method	Emotion B.Acc	Gender B.Acc	Age B.Acc	Language B.Acc
CLAP (General)	14.3%	50.0%	25.0%	50.0%
CLAP (Speech&Music)	22.1%	67.1%	11.9%	18.9%
ParaCLAP	9.2%	9.7%	10.8%	20.0%
Cross-Entropy	35.0%	76.8%	20.6%	89.5%
CLAP Objective	55.2%	39.4%	25.3%	56.6%
ParaMETA (w/o reg)	44.2%	77.9%	26.1%	90.7%
ParaMETA (w/ reg)	50.1%	78.4%	29.7%	91.1%

Key observations: - Catastrophic failure of pretrained large models: CLAP and ParaCLAP perform extremely poorly under the subject-independent setting (emotion as low as 9–22%), indicating that their embedding spaces overfit to training speakers. - Negative transfer from CLAP objective: CLAP-style contrastive learning yields strong emotion performance but substantially degrades gender and language accuracy, reflecting cross-task interference caused by style entanglement. - ParaMETA is the most consistent: Achieves best performance in 12 out of 16 combinations (4 backbones × 4 tasks), demonstrating the superiority of disentangled representations. - META regularization is especially effective on difficult tasks: Emotion improves by 6% and age by 3.6% with the Transformer backbone.

Speech Generation Quality Evaluation (TTS, Subjective Listening Tests)¶

Prompt Type	N-MOS (Naturalness)	E-MOS (Expressiveness)
Text Only	2.02 ± 0.69	2.33 ± 0.97
Speech Only	2.89 ± 0.82	3.19 ± 0.88
ParaMETA Text	3.06 ± 0.71	2.91 ± 0.87
ParaMETA Speech	3.41 ± 0.86	3.41 ± 1.10

ParaMETA embeddings significantly improve perceptual quality for both prompt types. Text-prompt naturalness improves by 1.0 point; speech-prompt naturalness improves by 0.5 points. Disentangled embeddings filter out irrelevant information such as background noise.

Style Manipulation Experiment¶

Manipulation Type	Original Similarity	Post-Manipulation Similarity	Classification Accuracy
Language	0.4812	0.4850	55.0%
Age	0.4707	0.5486	70.0%
Emotion	0.4687	0.8367	90.0%
Gender	0.4707	0.9888	100.0%

Gender manipulation achieves the highest precision (100%), followed closely by emotion (90%). Language manipulation yields the weakest results (55%), as language is primarily conveyed through phonemes and lexical content, which are tightly bound to the text input.

Ablation Study: Computational Resource Comparison¶

Method	RTF (Real-Time Factor)	Parameters	GPU Memory
CLAP	0.091	198.48M	1966 MB
ParaCLAP	0.008	276.33M	1345 MB
ParaMETA (LSTM)	0.003	3.77M	433 MB
ParaMETA (Transformer)	0.005	1.86M	429 MB

ParaMETA-LSTM requires only 1.9% of CLAP's parameters, 22% of its GPU memory, and runs 30× faster, making it highly suitable for resource-constrained and real-time deployment scenarios.

Key Findings¶

Necessity of disentanglement: t-SNE visualizations clearly show that in the META space, gender dominates the clustering structure (e.g., happy male is closer to sad male), whereas in task-specific subspaces, emotion clusters are substantially cleaner—confirming that projection achieves style disentanglement.
Cross-entropy is a reasonable baseline: In direct training settings, CE-based multi-task learning is less susceptible to negative transfer than CLAP, yet still less stable than ParaMETA.
Speech prompts outperform text prompts: Speech naturally encodes rich prosodic information including pitch, rate, and intonation, whereas text descriptions are inherently ambiguous (the same "happy male" may be expressed in numerous ways).

Highlights & Insights¶

Hierarchical contrastive learning: Rather than treating all non-identical-label samples as equidistant negatives, the framework assigns graded similarity weights proportional to the number of shared labels, yielding a richer and more principled embedding topology.
Dual role of prototypes: Prototypes serve both as class anchors during training (alignment loss) and as direct inference interfaces (nearest-prototype classification; replaceable modules for TTS style control), constituting an elegant and unified design.
Simplicity of style manipulation: Replacing the embedding in a given task subspace with the target class prototype suffices to modify a specific style while leaving others unchanged—a direct benefit of the disentangled design.
Model-agnostic generality: ParaMETA demonstrates consistent effectiveness across CNN, LSTM, Q-Former, and Transformer backbones, validating the universality of the framework.

Limitations & Future Work¶

Poor language manipulation performance: Language is highly bound to phonemes and lexical content; substituting embeddings alone cannot alter text content, necessitating complementary modifications at the text level.
Limited dataset scale: With 93k samples aggregated from multiple public datasets, the data distribution may be imbalanced, and the authenticity of Genshin Impact character speech is questionable.
Coarse emotion granularity: Only 7 discrete emotion categories are used, failing to capture continuous affective dimensions (e.g., the valence-arousal model).
TTS evaluation relies solely on subjective MOS: Quantitative objective metrics such as WER, speaker similarity, and F0 correlation are absent.
Complex combined style manipulation unexplored: Whether simultaneous modification of multiple styles (e.g., emotion and age) introduces interference remains uninvestigated.

The unified embedding space of CLAP is effective for general audio understanding, but exposes entanglement problems in paralinguistic tasks requiring fine-grained style control.
UniStyle's tightly coupled design serves as a cautionary example—generated speech retains reference speech characteristics even when the text specifies conflicting styles.
The prototype learning with EMA updates is inspired by MoCo (He et al. 2020), widely validated in visual contrastive learning, and is adapted here with notable elegance for speaking style representation.
The hierarchical contrastive learning paradigm may offer insights for other multi-label representation learning scenarios, such as multi-attribute person re-identification and multi-label image retrieval.
The ParaMETA framework is potentially extensible to disentangled learning of additional paralinguistic attributes such as accent, speech rate, and loudness.

Rating¶

Dimension	Score (1–5)	Comments
Novelty	4	Hierarchical contrastive learning + task-specific subspace disentanglement + prototype manipulation
Technical Depth	4	Motivation and design rationale of the four loss components are clearly articulated
Experimental Thoroughness	4	Covers classification, TTS, manipulation, and computational cost; compares four backbones
Writing Quality	4	Well-structured, intuitive figures, consistent notation
Practicality	4	Lightweight, model-agnostic, open-source
Overall	4	Unified framework addressing both recognition and generation; disentangled design is elegant and effective