Skip to content

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

Conference: ICLR 2026 arXiv: 2602.06270 Code: None Area: Speech Emotion Recognition Keywords: Speech Emotion Recognition, Prosodic Features, Vowel-level, LLM Reasoning, GRPO

TL;DR

This paper proposes VowelPrompt, which extracts vowel-level prosodic descriptors (pitch/energy/duration) grounded in phonetic evidence, converts them into natural language to augment LLM emotion recognition prompts, and employs a two-stage SFT+GRPO training pipeline. The method consistently outperforms state-of-the-art approaches under zero-shot, fine-tuning, cross-domain, and cross-lingual conditions, while generating interpretable emotion reasoning.

Background & Motivation

Background: Speech Emotion Recognition (SER) has undergone three generations of development: handcrafted features via openSMILE → deep self-supervised features via wav2vec/HuBERT → LLM-based emotion recognition via text prompting. Two technical paradigms currently coexist: Audio LLMs (e.g., Qwen2-Audio) that directly process audio embeddings but remain opaque, and text-only prompting methods (e.g., SpeechCueLLM) that describe prosody in natural language but at coarse granularity (e.g., "the voice is very loud").

Limitations of Prior Work: Deep features lack interpretability and cannot explain why a given emotion is predicted; text prompting methods rely on sentence-level prosodic descriptions (e.g., "high pitch, fast speech rate"), which discard fine-grained syllable-level prosodic variation—yet emotional expression is often concentrated in specific stressed syllables.

Key Challenge: How can interpretability be maintained while achieving performance on par with or superior to opaque deep features?

Phonetic Rationale: Vowels are the primary carriers of emotional prosody—they are voiced, acoustically stable (with well-defined F0 and formants), and dominate utterances in both duration and energy. By contrast, consonants contribute minimally to prosodic cues.

Core Idea: Extract vowel-level (phoneme-level) prosodic feature descriptors, convert them into natural language embedded in prompts, and enable LLMs to jointly reason over lexical semantics and local prosodic information.

Method

Overall Architecture

Speech + transcript → MFA forced alignment → vowel segment extraction → 6 LLD computation → speaker/vowel-type normalization → quantile discretization → natural language prosodic description → concatenation with transcript → LLM joint emotion reasoning. Training follows a two-stage pipeline: SFT (with reasoning traces generated by GPT-4o) → GRPO reinforcement learning.

Key Designs

  1. Vowel-level Feature Extraction:

    • Forced alignment (MFA) obtains phoneme-level time boundaries → vowel segments are filtered using the IPA vowel inventory
    • 6 LLDs: pitch mean, pitch slope, pitch variance, energy mean, energy variance, duration
    • Two-stage normalization: speaker-level z-score → vowel-type-level normalization
    • Quantile discretization (K levels, e.g., "very low/low/medium/high/very high") → natural language description
    • Design Motivation: Vowels are the primary carriers of emotional prosody (acoustically stable, continuous voicing) and localize emotional cues more precisely than full-phoneme or sentence-level features
  2. Two-stage LLM Adaptation:

    • SFT Stage: Small-scale training data + GPT-4o-generated reasoning traces (with explicit references to prosodic features), fine-tuning the LLM with CE loss
    • GRPO Stage: \(R = R_{acc} + R_{format}\), accuracy reward (exact match) + format reward (completeness of \<think>/\<answer> tags), with KL constraint to prevent deviation from the SFT reference
    • Design Motivation: SFT provides cold-start alignment; GRPO further improves reasoning quality and output format compliance
  3. Multilingual Extension:

    • MFA supports 20+ languages → unified vowel representation via IPA → language-level normalization
    • Vowel prosodic features are described in English (even when the input is French/German), leveraging the cross-lingual capabilities of multilingual LLMs

Loss & Training

SFT: standard CE loss. GRPO: within-group relative advantage + KL regularization, with two verifiable rewards for accuracy and format.

Key Experimental Results

Main Results

Dataset Condition VowelPrompt Prev. SOTA Gain
IEMOCAP Fine-tuning 72.8% WA 68.5% +4.3%
MELD Zero-shot 52.1% WA 46.3% +5.8%
CaFE (French) Cross-lingual 62.4% 54.1% +8.3%
EmoDB (German) Cross-lingual 78.9% 71.2% +7.7%

Ablation Study

Configuration IEMOCAP WA Note
VowelPrompt (full) 72.8% full
w/o prosodic descriptors 65.3% text-only
w/o GRPO 70.1% SFT-only
Consonant-level features 68.7% vowels > consonants
Shuffled prosody 58.2% confirms no spurious correlation

Key Findings

  • Vowel-level prosody significantly outperforms coarse sentence-level descriptions (IEMOCAP zero-shot: +1.2% UACC over SpeechCueLLM)
  • The GRPO stage yields +2.7% WA improvement, primarily benefiting format compliance and cross-domain generalization
  • Counterfactual experiments (shuffling prosodic description order, assigning prosodic features to incorrect vowels) confirm that the model genuinely leverages prosodic information rather than spurious correlations
  • Vowel-level features outperform consonant-level features (ablation), and combining both yields no significant improvement—indicating vowels already capture the primary emotional cues
  • Cross-lingual generalization: a model fine-tuned on English transfers effectively to French CaFE (+8.3%) and German EmoDB (+7.7%)
  • Placebo experiments matching marginal distributions rule out statistical artifacts—random prosodic descriptions degrade performance to chance level
  • Human evaluation: prosodic references in reasoning traces are rated as "linguistically plausible" by annotators at a rate exceeding 85%

Highlights & Insights

  • Interpretable Emotion Reasoning: The \ reasoning traces generated by the LLM explicitly identify which prosodic feature of which vowel drove the prediction—human evaluation finds >85% of such reasoning to be linguistically plausible
  • Text-only Deployment: At inference time, only the transcript and prosodic description text are required, with no audio encoder running on GPU—substantially reducing deployment complexity
  • Value of GRPO: Beyond accuracy gains, GRPO is critical for ensuring output format consistency (\<think>/\<answer> tags), which is essential in production environments
  • The linguistic hypothesis that vowels serve as emotional anchors is thoroughly validated—vowel-level > consonant-level > sentence-level
  • Deployment advantage of audio-encoder-free inference: Only a text LLM is required at inference time, with prosodic information passed as text, greatly simplifying deployment architecture

Limitations & Future Work

  • Depends on forced alignment quality—MFA alignment accuracy degrades in noisy environments or non-standard speech
  • Prosodic descriptors are extracted from audio, so audio input is still required during inference (even though the LLM reasoning itself is text-only, preprocessing requires audio)
  • Evaluation is limited to a small number of SER benchmarks such as IEMOCAP and MELD; applicability to additional domains (e.g., customer service, mental health) remains to be validated
  • The behavior of vowel-level features in tonal languages (e.g., Chinese) is unexplored—interactions between tone and emotion may be considerably more complex
  • The impact of GRPO hyperparameters (e.g., KL coefficient) on cross-domain generalization requires systematic ablation
  • vs SpeechCueLLM: Both use natural language to describe prosody, but SpeechCueLLM operates at coarse sentence-level granularity, whereas VowelPrompt resolves features at the individual vowel level
  • vs Emotion-LLaMA: Emotion-LLaMA directly fuses audio embeddings into the LLM and is not interpretable; VowelPrompt's intermediate representations are fully human-readable
  • vs wav2vec/HuBERT: Deep features are strong but opaque; VowelPrompt surpasses them on several benchmarks while providing reasoning explanations
  • Insight: Text-augmented speech understanding is a paradigm worth further exploration—"translating" audio information into natural language and exploiting the reasoning capabilities of LLMs

Rating

  • Novelty: ⭐⭐⭐⭐⭐ An elegant combination of vowel-level prosody and LLM reasoning with clear linguistic motivation
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets + 15 ablation/counterfactual experiments covering zero-shot/fine-tuning/cross-domain/cross-lingual settings
  • Writing Quality: ⭐⭐⭐⭐ Detailed and comprehensive, though somewhat lengthy
  • Value: ⭐⭐⭐⭐⭐ Interpretable, high-performing, and cross-lingual—representing a substantive advance for the SER field