VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation¶
Conference: ICLR 2026 arXiv: 2602.06270 Code: None Area: Speech Emotion Recognition Keywords: Speech Emotion Recognition, Prosodic Features, Vowel-level, LLM Reasoning, GRPO
TL;DR¶
This paper proposes VowelPrompt, which extracts vowel-level prosodic descriptors (pitch/energy/duration) grounded in phonetic evidence, converts them into natural language to augment LLM emotion recognition prompts, and employs a two-stage SFT+GRPO training pipeline. The method consistently outperforms state-of-the-art approaches under zero-shot, fine-tuning, cross-domain, and cross-lingual conditions, while generating interpretable emotion reasoning.
Background & Motivation¶
Background: Speech Emotion Recognition (SER) has undergone three generations of development: handcrafted features via openSMILE → deep self-supervised features via wav2vec/HuBERT → LLM-based emotion recognition via text prompting. Two technical paradigms currently coexist: Audio LLMs (e.g., Qwen2-Audio) that directly process audio embeddings but remain opaque, and text-only prompting methods (e.g., SpeechCueLLM) that describe prosody in natural language but at coarse granularity (e.g., "the voice is very loud").
Limitations of Prior Work: Deep features lack interpretability and cannot explain why a given emotion is predicted; text prompting methods rely on sentence-level prosodic descriptions (e.g., "high pitch, fast speech rate"), which discard fine-grained syllable-level prosodic variation—yet emotional expression is often concentrated in specific stressed syllables.
Key Challenge: How can interpretability be maintained while achieving performance on par with or superior to opaque deep features?
Phonetic Rationale: Vowels are the primary carriers of emotional prosody—they are voiced, acoustically stable (with well-defined F0 and formants), and dominate utterances in both duration and energy. By contrast, consonants contribute minimally to prosodic cues.
Core Idea: Extract vowel-level (phoneme-level) prosodic feature descriptors, convert them into natural language embedded in prompts, and enable LLMs to jointly reason over lexical semantics and local prosodic information.
Method¶
Overall Architecture¶
Speech + transcript → MFA forced alignment → vowel segment extraction → 6 LLD computation → speaker/vowel-type normalization → quantile discretization → natural language prosodic description → concatenation with transcript → LLM joint emotion reasoning. Training follows a two-stage pipeline: SFT (with reasoning traces generated by GPT-4o) → GRPO reinforcement learning.
Key Designs¶
-
Vowel-level Feature Extraction:
- Forced alignment (MFA) obtains phoneme-level time boundaries → vowel segments are filtered using the IPA vowel inventory
- 6 LLDs: pitch mean, pitch slope, pitch variance, energy mean, energy variance, duration
- Two-stage normalization: speaker-level z-score → vowel-type-level normalization
- Quantile discretization (K levels, e.g., "very low/low/medium/high/very high") → natural language description
- Design Motivation: Vowels are the primary carriers of emotional prosody (acoustically stable, continuous voicing) and localize emotional cues more precisely than full-phoneme or sentence-level features
-
Two-stage LLM Adaptation:
- SFT Stage: Small-scale training data + GPT-4o-generated reasoning traces (with explicit references to prosodic features), fine-tuning the LLM with CE loss
- GRPO Stage: \(R = R_{acc} + R_{format}\), accuracy reward (exact match) + format reward (completeness of \<think>/\<answer> tags), with KL constraint to prevent deviation from the SFT reference
- Design Motivation: SFT provides cold-start alignment; GRPO further improves reasoning quality and output format compliance
-
Multilingual Extension:
- MFA supports 20+ languages → unified vowel representation via IPA → language-level normalization
- Vowel prosodic features are described in English (even when the input is French/German), leveraging the cross-lingual capabilities of multilingual LLMs
Loss & Training¶
SFT: standard CE loss. GRPO: within-group relative advantage + KL regularization, with two verifiable rewards for accuracy and format.
Key Experimental Results¶
Main Results¶
| Dataset | Condition | VowelPrompt | Prev. SOTA | Gain |
|---|---|---|---|---|
| IEMOCAP | Fine-tuning | 72.8% WA | 68.5% | +4.3% |
| MELD | Zero-shot | 52.1% WA | 46.3% | +5.8% |
| CaFE (French) | Cross-lingual | 62.4% | 54.1% | +8.3% |
| EmoDB (German) | Cross-lingual | 78.9% | 71.2% | +7.7% |
Ablation Study¶
| Configuration | IEMOCAP WA | Note |
|---|---|---|
| VowelPrompt (full) | 72.8% | full |
| w/o prosodic descriptors | 65.3% | text-only |
| w/o GRPO | 70.1% | SFT-only |
| Consonant-level features | 68.7% | vowels > consonants |
| Shuffled prosody | 58.2% | confirms no spurious correlation |
Key Findings¶
- Vowel-level prosody significantly outperforms coarse sentence-level descriptions (IEMOCAP zero-shot: +1.2% UACC over SpeechCueLLM)
- The GRPO stage yields +2.7% WA improvement, primarily benefiting format compliance and cross-domain generalization
- Counterfactual experiments (shuffling prosodic description order, assigning prosodic features to incorrect vowels) confirm that the model genuinely leverages prosodic information rather than spurious correlations
- Vowel-level features outperform consonant-level features (ablation), and combining both yields no significant improvement—indicating vowels already capture the primary emotional cues
- Cross-lingual generalization: a model fine-tuned on English transfers effectively to French CaFE (+8.3%) and German EmoDB (+7.7%)
- Placebo experiments matching marginal distributions rule out statistical artifacts—random prosodic descriptions degrade performance to chance level
- Human evaluation: prosodic references in reasoning traces are rated as "linguistically plausible" by annotators at a rate exceeding 85%
Highlights & Insights¶
- Interpretable Emotion Reasoning: The \
reasoning traces generated by the LLM explicitly identify which prosodic feature of which vowel drove the prediction—human evaluation finds >85% of such reasoning to be linguistically plausible - Text-only Deployment: At inference time, only the transcript and prosodic description text are required, with no audio encoder running on GPU—substantially reducing deployment complexity
- Value of GRPO: Beyond accuracy gains, GRPO is critical for ensuring output format consistency (\<think>/\<answer> tags), which is essential in production environments
- The linguistic hypothesis that vowels serve as emotional anchors is thoroughly validated—vowel-level > consonant-level > sentence-level
- Deployment advantage of audio-encoder-free inference: Only a text LLM is required at inference time, with prosodic information passed as text, greatly simplifying deployment architecture
Limitations & Future Work¶
- Depends on forced alignment quality—MFA alignment accuracy degrades in noisy environments or non-standard speech
- Prosodic descriptors are extracted from audio, so audio input is still required during inference (even though the LLM reasoning itself is text-only, preprocessing requires audio)
- Evaluation is limited to a small number of SER benchmarks such as IEMOCAP and MELD; applicability to additional domains (e.g., customer service, mental health) remains to be validated
- The behavior of vowel-level features in tonal languages (e.g., Chinese) is unexplored—interactions between tone and emotion may be considerably more complex
- The impact of GRPO hyperparameters (e.g., KL coefficient) on cross-domain generalization requires systematic ablation
Related Work & Insights¶
- vs SpeechCueLLM: Both use natural language to describe prosody, but SpeechCueLLM operates at coarse sentence-level granularity, whereas VowelPrompt resolves features at the individual vowel level
- vs Emotion-LLaMA: Emotion-LLaMA directly fuses audio embeddings into the LLM and is not interpretable; VowelPrompt's intermediate representations are fully human-readable
- vs wav2vec/HuBERT: Deep features are strong but opaque; VowelPrompt surpasses them on several benchmarks while providing reasoning explanations
- Insight: Text-augmented speech understanding is a paradigm worth further exploration—"translating" audio information into natural language and exploiting the reasoning capabilities of LLMs
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ An elegant combination of vowel-level prosody and LLM reasoning with clear linguistic motivation
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets + 15 ablation/counterfactual experiments covering zero-shot/fine-tuning/cross-domain/cross-lingual settings
- Writing Quality: ⭐⭐⭐⭐ Detailed and comprehensive, though somewhat lengthy
- Value: ⭐⭐⭐⭐⭐ Interpretable, high-performing, and cross-lingual—representing a substantive advance for the SER field