Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition¶
Conference: ACL 2026 arXiv: 2604.17647 Code: Project Page Area: Multilingual Translation Keywords: Non-verbal vocalization supervision, hyperbolic representation learning, optimal transport alignment, prosody codebook, cross-lingual emotion transfer
TL;DR¶
This paper proposes NOVA-ARC, the first framework to formulate multilingual speech emotion recognition (SER) as an unsupervised transfer problem from labeled non-verbal vocalizations (NVV) to unlabeled verbal speech (UVS). By leveraging a hyperbolic prosody vector-quantized codebook, a Hyperbolic Emotion Lens, and optimal transport prototype alignment, NOVA-ARC achieves cross-modal emotion transfer and validates the feasibility and superiority of NVV→UVS transfer across 6 datasets.
Background & Motivation¶
Background: Supervision signals for SER almost exclusively rely on labeled verbal speech, yet such annotations are extremely scarce for low-resource languages. Non-verbal vocalizations (laughter, sighs, crying) carry rich emotional signals and are naturally language-agnostic due to the absence of lexical content.
Limitations of Prior Work: (1) Emotion labels in verbal speech are inevitably entangled with lexical and phonological features, which do not transfer across languages; (2) existing UDA methods still assume emotional supervision comes from labeled verbal speech; (3) emotion recognition from non-verbal vocalizations has only recently gained attention and has never been used as a supervision source for cross-lingual SER.
Key Challenge: A language-agnostic emotional supervision signal is needed — emotion in verbal speech is mixed with language-specific expression conventions, whereas non-verbal vocalizations offer a purer alternative.
Goal: To verify whether non-verbal vocalizations can serve as a stronger and more transferable source of emotional supervision for multilingual SER.
Key Insight: Non-verbal vocalizations (laughter/sobbing/sighs) originate from shared physiological mechanisms, with dominant features at the prosodic level — voicing, spectral tilt, intensity dynamics, and temporal modulation — all of which are naturally language-independent.
Core Idea: Model the hierarchical structure of emotion (coarse-grained emotion families → fine-grained categories → intensity) in hyperbolic space, discretize prosodic patterns via a hyperbolic VQ codebook, and align NVV emotion prototypes to UVS representations via optimal transport.
Method¶
Overall Architecture¶
Input NVV/UVS audio is processed by a shared self-supervised encoder (voc2vec/WavLM/wav2vec 2.0/MMS) to extract frame-level features, which are projected onto the Poincaré ball. Features are discretized via a hyperbolic VQ codebook, fused with continuous representations via Möbius addition, compressed through a bottleneck, calibrated for intensity via the Hyperbolic Emotion Lens, and aggregated through attentive pooling to obtain utterance-level embeddings. A classifier trained on labeled NVV data computes class prototypes; UVS representations are aligned to these prototypes via optimal transport with consistency regularization.
Key Designs¶
-
Hyperbolic Prosody Vector-Quantized Codebook:
- Function: Discretizes continuous prosodic features into a shared vocabulary of emotional patterns.
- Mechanism: A codebook \(\mathcal{C}\) of size \(K=256\) is maintained on the Poincaré ball; each frame \(\mathbf{x}_t\) is assigned to its nearest codeword \(\mathbf{q}_t\) using Poincaré distance. Continuous frames and discrete tokens are fused via Möbius addition, followed by bottleneck projection.
- Design Motivation: (1) VQ enforces discretization of prosodic patterns, enabling NVV and UVS to share the same prosodic vocabulary; (2) hyperbolic space is better suited than Euclidean space for encoding hierarchical relations — emotions exhibit a tree-like structure from broad categories to subcategories.
-
Hyperbolic Emotion Lens (HEL) + Optimal Transport Prototype Alignment:
- Function: Calibrates emotional intensity discrepancies between NVV and UVS, and enables unsupervised transfer.
- Mechanism: HEL adjusts the radial position of embeddings on the Poincaré ball via a learnable power-law radial transformation \(\alpha\) (proximity to the boundary indicates higher intensity). Fréchet means of labeled NVV data are computed as class prototypes \(\mu^{(c)}\). For UVS batches, Sinkhorn iterations solve an entropy-regularized optimal transport plan \(\Pi^*\), inducing soft pseudo-labels \(q_{cj} = n \Pi^*_{cj}\) for training.
- Design Motivation: Emotional expressions in NVV tend to be more intense than in UVS (laughter is more salient than a smile). HEL's radial calibration bridges this intensity gap. Optimal transport is more flexible than hard clustering — it allows a UVS utterance to be matched to multiple emotion prototypes with varying weights.
-
Shared Forward Pass + Consistency Regularization:
- Function: Ensures NVV and UVS follow identical network paths, promoting alignment of the representation space.
- Mechanism: NVV and UVS share all model parameters (encoder, projection layers, codebook, classifier). Consistency regularization stabilizes training on unlabeled UVS and reduces pseudo-label noise.
- Design Motivation: If NVV and UVS use separate encoding paths, their representation spaces may diverge — shared parameters enforce both input types to be represented in a common space.
Loss & Training¶
The overall objective is: \(\mathcal{L} = L_S(\mathcal{B}_S) + \lambda_{\text{OPT}} L_{\text{OPT}}(\mathcal{B}_T) + \lambda_{\text{OT}} L_{\text{OT-CE}}(\mathcal{B}_T)\). \(L_S\) is the supervised cross-entropy on NVV; \(L_{\text{OPT}}\) encourages geometric alignment; \(L_{\text{OT-CE}}\) trains the classifier with transport-induced soft labels. AdamW is used for 30 epochs with cosine decay and 10% warmup.
Key Experimental Results¶
Main Results¶
NVV→UVS Transfer (NOVA-ARC + voc2vec)
| Target Dataset | Language | NOVA-ARC Acc | Direct Transfer Baseline |
|---|---|---|---|
| ASVP-ESD (V) | Multilingual | 62.23 | 32.67 |
| MESD | Spanish | ~55 | 49.02 |
| AESDD | Greek | ~42 | 35.86 |
| RAVDESS | English | ~43 | 36.51 |
| Emo-DB | German | ~50 | 44.69 |
Ablation Study¶
- Hyperbolic vs. Euclidean comparisons consistently demonstrate the superiority of hyperbolic space.
- voc2vec (pre-trained specifically on non-verbal speech) performs best on the NVV source domain, while WavLM/MMS are stronger on the UVS target domain.
- NOVA-ARC also achieves the best performance in the V→V (verbal-to-verbal) transfer setting.
Key Findings¶
- NVV→UVS transfer is viable — NOVA-ARC substantially outperforms the direct transfer baseline (+15–30 pp), confirming that NVV contains effective cross-lingual emotional signals.
- voc2vec is strongest on NVV but weakest on UVS, indicating that specialized encoders capture patterns unique to NVV.
- The advantage of hyperbolic space is more pronounced in low-resource target domains, where hierarchical structural encoding provides a stronger inductive bias under data scarcity.
Highlights & Insights¶
- Reframing SER as NVV→UVS transfer represents a paradigm-level innovation — it fundamentally reconceives the source of emotional supervision.
- Hyperbolic space is a highly principled choice for emotion modeling, given that emotions exhibit a clear coarse-to-fine hierarchy (positive/negative → specific emotion → intensity).
- The shared-parameter design is both elegant and critical — it ensures NVV and UVS are represented in a unified space.
Limitations & Future Work¶
- The NVV dataset (ASVP-ESD) is limited in scale.
- Emotion categories are unified into only 5 classes; finer-grained classification remains unvalidated.
- Sensitivity analysis of hyperparameters such as codebook size is insufficient.
- Validation across more languages and larger-scale settings is needed.
Related Work & Insights¶
- vs. Standard UDA SER: Prior methods still assume emotional supervision derives from verbal speech; NOVA-ARC uses NVV as a purer supervision source.
- vs. Mote et al.: Uses KNN-based voice conversion for cross-lingual adaptation; NOVA-ARC employs optimal transport for prototype alignment.
- vs. Phukan et al.: Focuses on NVV recognition itself; NOVA-ARC uses NVV as a bridge for transfer learning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to propose the NVV→UVS transfer paradigm; hyperbolic prosody codebook design is distinctly original.
- Experimental Thoroughness: ⭐⭐⭐⭐ 6 datasets, 4 encoders, hyperbolic vs. Euclidean ablations.
- Writing Quality: ⭐⭐⭐⭐ Compelling motivation and complete theoretical framework.
- Value: ⭐⭐⭐⭐⭐ Opens an entirely new direction for low-resource SER research.