Skip to content

Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

Conference: ACL 2026 arXiv: 2604.17647 Code: Project Page Area: Multilingual Translation Keywords: Non-verbal vocalization supervision, hyperbolic representation learning, optimal transport alignment, prosody codebook, cross-lingual emotion transfer

TL;DR

This paper proposes NOVA-ARC, the first framework to formulate multilingual speech emotion recognition (SER) as an unsupervised transfer problem from labeled non-verbal vocalizations (NVV) to unlabeled verbal speech (UVS). By leveraging a hyperbolic prosody vector-quantized codebook, a Hyperbolic Emotion Lens, and optimal transport prototype alignment, NOVA-ARC achieves cross-modal emotion transfer and validates the feasibility and superiority of NVV→UVS transfer across 6 datasets.

Background & Motivation

Background: Supervision signals for SER almost exclusively rely on labeled verbal speech, yet such annotations are extremely scarce for low-resource languages. Non-verbal vocalizations (laughter, sighs, crying) carry rich emotional signals and are naturally language-agnostic due to the absence of lexical content.

Limitations of Prior Work: (1) Emotion labels in verbal speech are inevitably entangled with lexical and phonological features, which do not transfer across languages; (2) existing UDA methods still assume emotional supervision comes from labeled verbal speech; (3) emotion recognition from non-verbal vocalizations has only recently gained attention and has never been used as a supervision source for cross-lingual SER.

Key Challenge: A language-agnostic emotional supervision signal is needed — emotion in verbal speech is mixed with language-specific expression conventions, whereas non-verbal vocalizations offer a purer alternative.

Goal: To verify whether non-verbal vocalizations can serve as a stronger and more transferable source of emotional supervision for multilingual SER.

Key Insight: Non-verbal vocalizations (laughter/sobbing/sighs) originate from shared physiological mechanisms, with dominant features at the prosodic level — voicing, spectral tilt, intensity dynamics, and temporal modulation — all of which are naturally language-independent.

Core Idea: Model the hierarchical structure of emotion (coarse-grained emotion families → fine-grained categories → intensity) in hyperbolic space, discretize prosodic patterns via a hyperbolic VQ codebook, and align NVV emotion prototypes to UVS representations via optimal transport.

Method

Overall Architecture

Input NVV/UVS audio is processed by a shared self-supervised encoder (voc2vec/WavLM/wav2vec 2.0/MMS) to extract frame-level features, which are projected onto the Poincaré ball. Features are discretized via a hyperbolic VQ codebook, fused with continuous representations via Möbius addition, compressed through a bottleneck, calibrated for intensity via the Hyperbolic Emotion Lens, and aggregated through attentive pooling to obtain utterance-level embeddings. A classifier trained on labeled NVV data computes class prototypes; UVS representations are aligned to these prototypes via optimal transport with consistency regularization.

Key Designs

  1. Hyperbolic Prosody Vector-Quantized Codebook:

    • Function: Discretizes continuous prosodic features into a shared vocabulary of emotional patterns.
    • Mechanism: A codebook \(\mathcal{C}\) of size \(K=256\) is maintained on the Poincaré ball; each frame \(\mathbf{x}_t\) is assigned to its nearest codeword \(\mathbf{q}_t\) using Poincaré distance. Continuous frames and discrete tokens are fused via Möbius addition, followed by bottleneck projection.
    • Design Motivation: (1) VQ enforces discretization of prosodic patterns, enabling NVV and UVS to share the same prosodic vocabulary; (2) hyperbolic space is better suited than Euclidean space for encoding hierarchical relations — emotions exhibit a tree-like structure from broad categories to subcategories.
  2. Hyperbolic Emotion Lens (HEL) + Optimal Transport Prototype Alignment:

    • Function: Calibrates emotional intensity discrepancies between NVV and UVS, and enables unsupervised transfer.
    • Mechanism: HEL adjusts the radial position of embeddings on the Poincaré ball via a learnable power-law radial transformation \(\alpha\) (proximity to the boundary indicates higher intensity). Fréchet means of labeled NVV data are computed as class prototypes \(\mu^{(c)}\). For UVS batches, Sinkhorn iterations solve an entropy-regularized optimal transport plan \(\Pi^*\), inducing soft pseudo-labels \(q_{cj} = n \Pi^*_{cj}\) for training.
    • Design Motivation: Emotional expressions in NVV tend to be more intense than in UVS (laughter is more salient than a smile). HEL's radial calibration bridges this intensity gap. Optimal transport is more flexible than hard clustering — it allows a UVS utterance to be matched to multiple emotion prototypes with varying weights.
  3. Shared Forward Pass + Consistency Regularization:

    • Function: Ensures NVV and UVS follow identical network paths, promoting alignment of the representation space.
    • Mechanism: NVV and UVS share all model parameters (encoder, projection layers, codebook, classifier). Consistency regularization stabilizes training on unlabeled UVS and reduces pseudo-label noise.
    • Design Motivation: If NVV and UVS use separate encoding paths, their representation spaces may diverge — shared parameters enforce both input types to be represented in a common space.

Loss & Training

The overall objective is: \(\mathcal{L} = L_S(\mathcal{B}_S) + \lambda_{\text{OPT}} L_{\text{OPT}}(\mathcal{B}_T) + \lambda_{\text{OT}} L_{\text{OT-CE}}(\mathcal{B}_T)\). \(L_S\) is the supervised cross-entropy on NVV; \(L_{\text{OPT}}\) encourages geometric alignment; \(L_{\text{OT-CE}}\) trains the classifier with transport-induced soft labels. AdamW is used for 30 epochs with cosine decay and 10% warmup.

Key Experimental Results

Main Results

NVV→UVS Transfer (NOVA-ARC + voc2vec)

Target Dataset Language NOVA-ARC Acc Direct Transfer Baseline
ASVP-ESD (V) Multilingual 62.23 32.67
MESD Spanish ~55 49.02
AESDD Greek ~42 35.86
RAVDESS English ~43 36.51
Emo-DB German ~50 44.69

Ablation Study

  • Hyperbolic vs. Euclidean comparisons consistently demonstrate the superiority of hyperbolic space.
  • voc2vec (pre-trained specifically on non-verbal speech) performs best on the NVV source domain, while WavLM/MMS are stronger on the UVS target domain.
  • NOVA-ARC also achieves the best performance in the V→V (verbal-to-verbal) transfer setting.

Key Findings

  • NVV→UVS transfer is viable — NOVA-ARC substantially outperforms the direct transfer baseline (+15–30 pp), confirming that NVV contains effective cross-lingual emotional signals.
  • voc2vec is strongest on NVV but weakest on UVS, indicating that specialized encoders capture patterns unique to NVV.
  • The advantage of hyperbolic space is more pronounced in low-resource target domains, where hierarchical structural encoding provides a stronger inductive bias under data scarcity.

Highlights & Insights

  • Reframing SER as NVV→UVS transfer represents a paradigm-level innovation — it fundamentally reconceives the source of emotional supervision.
  • Hyperbolic space is a highly principled choice for emotion modeling, given that emotions exhibit a clear coarse-to-fine hierarchy (positive/negative → specific emotion → intensity).
  • The shared-parameter design is both elegant and critical — it ensures NVV and UVS are represented in a unified space.

Limitations & Future Work

  • The NVV dataset (ASVP-ESD) is limited in scale.
  • Emotion categories are unified into only 5 classes; finer-grained classification remains unvalidated.
  • Sensitivity analysis of hyperparameters such as codebook size is insufficient.
  • Validation across more languages and larger-scale settings is needed.
  • vs. Standard UDA SER: Prior methods still assume emotional supervision derives from verbal speech; NOVA-ARC uses NVV as a purer supervision source.
  • vs. Mote et al.: Uses KNN-based voice conversion for cross-lingual adaptation; NOVA-ARC employs optimal transport for prototype alignment.
  • vs. Phukan et al.: Focuses on NVV recognition itself; NOVA-ARC uses NVV as a bridge for transfer learning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to propose the NVV→UVS transfer paradigm; hyperbolic prosody codebook design is distinctly original.
  • Experimental Thoroughness: ⭐⭐⭐⭐ 6 datasets, 4 encoders, hyperbolic vs. Euclidean ablations.
  • Writing Quality: ⭐⭐⭐⭐ Compelling motivation and complete theoretical framework.
  • Value: ⭐⭐⭐⭐⭐ Opens an entirely new direction for low-resource SER research.