Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition¶

Conference: ACL 2026 arXiv: 2604.17647 Code: Project Page Area: Multilingual Translation Keywords: Non-verbal vocalization supervision, hyperbolic representation learning, optimal transport alignment, prosody codebook, cross-lingual emotion transfer

TL;DR¶

This paper proposes NOVA-ARC, the first framework to formulate multilingual speech emotion recognition (SER) as an unsupervised transfer problem from labeled non-verbal vocalizations (NVV) to unlabeled verbal speech (UVS). By leveraging a hyperbolic prosody vector-quantized codebook, a Hyperbolic Emotion Lens, and optimal transport prototype alignment, NOVA-ARC achieves cross-modal emotion transfer and validates the feasibility and superiority of NVV→UVS transfer across 6 datasets.

Background & Motivation¶

Background: Supervision signals for SER almost exclusively rely on labeled verbal speech, yet such annotations are extremely scarce for low-resource languages. Non-verbal vocalizations (laughter, sighs, crying) carry rich emotional signals and are naturally language-agnostic due to the absence of lexical content.

Limitations of Prior Work: (1) Emotion labels in verbal speech are inevitably entangled with lexical and phonological features, which do not transfer across languages; (2) existing UDA methods still assume emotional supervision comes from labeled verbal speech; (3) emotion recognition from non-verbal vocalizations has only recently gained attention and has never been used as a supervision source for cross-lingual SER.

Key Challenge: A language-agnostic emotional supervision signal is needed — emotion in verbal speech is mixed with language-specific expression conventions, whereas non-verbal vocalizations offer a purer alternative.

Goal: To verify whether non-verbal vocalizations can serve as a stronger and more transferable source of emotional supervision for multilingual SER.

Key Insight: Non-verbal vocalizations (laughter/sobbing/sighs) originate from shared physiological mechanisms, with dominant features at the prosodic level — voicing, spectral tilt, intensity dynamics, and temporal modulation — all of which are naturally language-independent.

Core Idea: Model the hierarchical structure of emotion (coarse-grained emotion families → fine-grained categories → intensity) in hyperbolic space, discretize prosodic patterns via a hyperbolic VQ codebook, and align NVV emotion prototypes to UVS representations via optimal transport.

Method¶

Overall Architecture¶

Input NVV/UVS audio is processed by a shared self-supervised encoder (voc2vec/WavLM/wav2vec 2.0/MMS) to extract frame-level features, which are projected onto the Poincaré ball. Features are discretized via a hyperbolic VQ codebook, fused with continuous representations via Möbius addition, compressed through a bottleneck, calibrated for intensity via the Hyperbolic Emotion Lens, and aggregated through attentive pooling to obtain utterance-level embeddings. A classifier trained on labeled NVV data computes class prototypes; UVS representations are aligned to these prototypes via optimal transport with consistency regularization.

Key Designs¶

Hyperbolic Prosody Vector-Quantized Codebook:
- Function: Discretizes continuous prosodic features into a shared vocabulary of emotional patterns.
- Mechanism: A codebook \(\mathcal{C}\) of size \(K=256\) is maintained on the Poincaré ball; each frame \(\mathbf{x}_t\) is assigned to its nearest codeword \(\mathbf{q}_t\) using Poincaré distance. Continuous frames and discrete tokens are fused via Möbius addition, followed by bottleneck projection.
- Design Motivation: (1) VQ enforces discretization of prosodic patterns, enabling NVV and UVS to share the same prosodic vocabulary; (2) hyperbolic space is better suited than Euclidean space for encoding hierarchical relations — emotions exhibit a tree-like structure from broad categories to subcategories.
Hyperbolic Emotion Lens (HEL) + Optimal Transport Prototype Alignment:
- Function: Calibrates emotional intensity discrepancies between NVV and UVS, and enables unsupervised transfer.
- Mechanism: HEL adjusts the radial position of embeddings on the Poincaré ball via a learnable power-law radial transformation \(\alpha\) (proximity to the boundary indicates higher intensity). Fréchet means of labeled NVV data are computed as class prototypes \(\mu^{(c)}\). For UVS batches, Sinkhorn iterations solve an entropy-regularized optimal transport plan \(\Pi^*\), inducing soft pseudo-labels \(q_{cj} = n \Pi^*_{cj}\) for training.
- Design Motivation: Emotional expressions in NVV tend to be more intense than in UVS (laughter is more salient than a smile). HEL's radial calibration bridges this intensity gap. Optimal transport is more flexible than hard clustering — it allows a UVS utterance to be matched to multiple emotion prototypes with varying weights.
Shared Forward Pass + Consistency Regularization:
- Function: Ensures NVV and UVS follow identical network paths, promoting alignment of the representation space.
- Mechanism: NVV and UVS share all model parameters (encoder, projection layers, codebook, classifier). Consistency regularization stabilizes training on unlabeled UVS and reduces pseudo-label noise.
- Design Motivation: If NVV and UVS use separate encoding paths, their representation spaces may diverge — shared parameters enforce both input types to be represented in a common space.

Loss & Training¶

The overall objective is: \(\mathcal{L} = L_S(\mathcal{B}_S) + \lambda_{\text{OPT}} L_{\text{OPT}}(\mathcal{B}_T) + \lambda_{\text{OT}} L_{\text{OT-CE}}(\mathcal{B}_T)\). \(L_S\) is the supervised cross-entropy on NVV; \(L_{\text{OPT}}\) encourages geometric alignment; \(L_{\text{OT-CE}}\) trains the classifier with transport-induced soft labels. AdamW is used for 30 epochs with cosine decay and 10% warmup.

Key Experimental Results¶

Main Results¶

NVV→UVS Transfer (NOVA-ARC + voc2vec)

Target Dataset	Language	NOVA-ARC Acc	Direct Transfer Baseline
ASVP-ESD (V)	Multilingual	62.23	32.67
MESD	Spanish	~55	49.02
AESDD	Greek	~42	35.86
RAVDESS	English	~43	36.51
Emo-DB	German	~50	44.69

Ablation Study¶

Hyperbolic vs. Euclidean comparisons consistently demonstrate the superiority of hyperbolic space.
voc2vec (pre-trained specifically on non-verbal speech) performs best on the NVV source domain, while WavLM/MMS are stronger on the UVS target domain.
NOVA-ARC also achieves the best performance in the V→V (verbal-to-verbal) transfer setting.

Key Findings¶

NVV→UVS transfer is viable — NOVA-ARC substantially outperforms the direct transfer baseline (+15–30 pp), confirming that NVV contains effective cross-lingual emotional signals.
voc2vec is strongest on NVV but weakest on UVS, indicating that specialized encoders capture patterns unique to NVV.
The advantage of hyperbolic space is more pronounced in low-resource target domains, where hierarchical structural encoding provides a stronger inductive bias under data scarcity.

Highlights & Insights¶

Reframing SER as NVV→UVS transfer represents a paradigm-level innovation — it fundamentally reconceives the source of emotional supervision.
Hyperbolic space is a highly principled choice for emotion modeling, given that emotions exhibit a clear coarse-to-fine hierarchy (positive/negative → specific emotion → intensity).
The shared-parameter design is both elegant and critical — it ensures NVV and UVS are represented in a unified space.

Limitations & Future Work¶

The NVV dataset (ASVP-ESD) is limited in scale.
Emotion categories are unified into only 5 classes; finer-grained classification remains unvalidated.
Sensitivity analysis of hyperparameters such as codebook size is insufficient.
Validation across more languages and larger-scale settings is needed.

vs. Standard UDA SER: Prior methods still assume emotional supervision derives from verbal speech; NOVA-ARC uses NVV as a purer supervision source.
vs. Mote et al.: Uses KNN-based voice conversion for cross-lingual adaptation; NOVA-ARC employs optimal transport for prototype alignment.
vs. Phukan et al.: Focuses on NVV recognition itself; NOVA-ARC uses NVV as a bridge for transfer learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to propose the NVV→UVS transfer paradigm; hyperbolic prosody codebook design is distinctly original.
Experimental Thoroughness: ⭐⭐⭐⭐ 6 datasets, 4 encoders, hyperbolic vs. Euclidean ablations.
Writing Quality: ⭐⭐⭐⭐ Compelling motivation and complete theoretical framework.
Value: ⭐⭐⭐⭐⭐ Opens an entirely new direction for low-resource SER research.