VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation¶

Conference: ICLR 2026 arXiv: 2602.06270 Code: None Area: Speech Emotion Recognition Keywords: Speech Emotion Recognition, Prosodic Features, Vowel-level, LLM Reasoning, GRPO

TL;DR¶

This paper proposes VowelPrompt, which extracts vowel-level prosodic descriptors (pitch/energy/duration) grounded in phonetic evidence, converts them into natural language to augment LLM emotion recognition prompts, and employs a two-stage SFT+GRPO training pipeline. The method consistently outperforms state-of-the-art approaches under zero-shot, fine-tuning, cross-domain, and cross-lingual conditions, while generating interpretable emotion reasoning.

Background & Motivation¶

Background: Speech Emotion Recognition (SER) has undergone three generations of development: handcrafted features via openSMILE → deep self-supervised features via wav2vec/HuBERT → LLM-based emotion recognition via text prompting. Two technical paradigms currently coexist: Audio LLMs (e.g., Qwen2-Audio) that directly process audio embeddings but remain opaque, and text-only prompting methods (e.g., SpeechCueLLM) that describe prosody in natural language but at coarse granularity (e.g., "the voice is very loud").

Limitations of Prior Work: Deep features lack interpretability and cannot explain why a given emotion is predicted; text prompting methods rely on sentence-level prosodic descriptions (e.g., "high pitch, fast speech rate"), which discard fine-grained syllable-level prosodic variation—yet emotional expression is often concentrated in specific stressed syllables.

Key Challenge: How can interpretability be maintained while achieving performance on par with or superior to opaque deep features?

Phonetic Rationale: Vowels are the primary carriers of emotional prosody—they are voiced, acoustically stable (with well-defined F0 and formants), and dominate utterances in both duration and energy. By contrast, consonants contribute minimally to prosodic cues.

Core Idea: Extract vowel-level (phoneme-level) prosodic feature descriptors, convert them into natural language embedded in prompts, and enable LLMs to jointly reason over lexical semantics and local prosodic information.

Method¶

Overall Architecture¶

Speech + transcript → MFA forced alignment → vowel segment extraction → 6 LLD computation → speaker/vowel-type normalization → quantile discretization → natural language prosodic description → concatenation with transcript → LLM joint emotion reasoning. Training follows a two-stage pipeline: SFT (with reasoning traces generated by GPT-4o) → GRPO reinforcement learning.

Key Designs¶

Vowel-level Feature Extraction:
- Forced alignment (MFA) obtains phoneme-level time boundaries → vowel segments are filtered using the IPA vowel inventory
- 6 LLDs: pitch mean, pitch slope, pitch variance, energy mean, energy variance, duration
- Two-stage normalization: speaker-level z-score → vowel-type-level normalization
- Quantile discretization (K levels, e.g., "very low/low/medium/high/very high") → natural language description
- Design Motivation: Vowels are the primary carriers of emotional prosody (acoustically stable, continuous voicing) and localize emotional cues more precisely than full-phoneme or sentence-level features
Two-stage LLM Adaptation:
- SFT Stage: Small-scale training data + GPT-4o-generated reasoning traces (with explicit references to prosodic features), fine-tuning the LLM with CE loss
- GRPO Stage: \(R = R_{acc} + R_{format}\), accuracy reward (exact match) + format reward (completeness of \<think>/\<answer> tags), with KL constraint to prevent deviation from the SFT reference
- Design Motivation: SFT provides cold-start alignment; GRPO further improves reasoning quality and output format compliance
Multilingual Extension:
- MFA supports 20+ languages → unified vowel representation via IPA → language-level normalization
- Vowel prosodic features are described in English (even when the input is French/German), leveraging the cross-lingual capabilities of multilingual LLMs

Loss & Training¶

SFT: standard CE loss. GRPO: within-group relative advantage + KL regularization, with two verifiable rewards for accuracy and format.

Key Experimental Results¶

Main Results¶

Dataset	Condition	VowelPrompt	Prev. SOTA	Gain
IEMOCAP	Fine-tuning	72.8% WA	68.5%	+4.3%
MELD	Zero-shot	52.1% WA	46.3%	+5.8%
CaFE (French)	Cross-lingual	62.4%	54.1%	+8.3%
EmoDB (German)	Cross-lingual	78.9%	71.2%	+7.7%

Ablation Study¶

Configuration	IEMOCAP WA	Note
VowelPrompt (full)	72.8%	full
w/o prosodic descriptors	65.3%	text-only
w/o GRPO	70.1%	SFT-only
Consonant-level features	68.7%	vowels > consonants
Shuffled prosody	58.2%	confirms no spurious correlation

Key Findings¶

Vowel-level prosody significantly outperforms coarse sentence-level descriptions (IEMOCAP zero-shot: +1.2% UACC over SpeechCueLLM)
The GRPO stage yields +2.7% WA improvement, primarily benefiting format compliance and cross-domain generalization
Counterfactual experiments (shuffling prosodic description order, assigning prosodic features to incorrect vowels) confirm that the model genuinely leverages prosodic information rather than spurious correlations
Vowel-level features outperform consonant-level features (ablation), and combining both yields no significant improvement—indicating vowels already capture the primary emotional cues
Cross-lingual generalization: a model fine-tuned on English transfers effectively to French CaFE (+8.3%) and German EmoDB (+7.7%)
Placebo experiments matching marginal distributions rule out statistical artifacts—random prosodic descriptions degrade performance to chance level
Human evaluation: prosodic references in reasoning traces are rated as "linguistically plausible" by annotators at a rate exceeding 85%

Highlights & Insights¶

Interpretable Emotion Reasoning: The \ reasoning traces generated by the LLM explicitly identify which prosodic feature of which vowel drove the prediction—human evaluation finds >85% of such reasoning to be linguistically plausible
Text-only Deployment: At inference time, only the transcript and prosodic description text are required, with no audio encoder running on GPU—substantially reducing deployment complexity
Value of GRPO: Beyond accuracy gains, GRPO is critical for ensuring output format consistency (\<think>/\<answer> tags), which is essential in production environments
The linguistic hypothesis that vowels serve as emotional anchors is thoroughly validated—vowel-level > consonant-level > sentence-level
Deployment advantage of audio-encoder-free inference: Only a text LLM is required at inference time, with prosodic information passed as text, greatly simplifying deployment architecture

Limitations & Future Work¶

Depends on forced alignment quality—MFA alignment accuracy degrades in noisy environments or non-standard speech
Prosodic descriptors are extracted from audio, so audio input is still required during inference (even though the LLM reasoning itself is text-only, preprocessing requires audio)
Evaluation is limited to a small number of SER benchmarks such as IEMOCAP and MELD; applicability to additional domains (e.g., customer service, mental health) remains to be validated
The behavior of vowel-level features in tonal languages (e.g., Chinese) is unexplored—interactions between tone and emotion may be considerably more complex
The impact of GRPO hyperparameters (e.g., KL coefficient) on cross-domain generalization requires systematic ablation

vs SpeechCueLLM: Both use natural language to describe prosody, but SpeechCueLLM operates at coarse sentence-level granularity, whereas VowelPrompt resolves features at the individual vowel level
vs Emotion-LLaMA: Emotion-LLaMA directly fuses audio embeddings into the LLM and is not interpretable; VowelPrompt's intermediate representations are fully human-readable
vs wav2vec/HuBERT: Deep features are strong but opaque; VowelPrompt surpasses them on several benchmarks while providing reasoning explanations
Insight: Text-augmented speech understanding is a paradigm worth further exploration—"translating" audio information into natural language and exploiting the reasoning capabilities of LLMs

Rating¶

Novelty: ⭐⭐⭐⭐⭐ An elegant combination of vowel-level prosody and LLM reasoning with clear linguistic motivation
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 5 datasets + 15 ablation/counterfactual experiments covering zero-shot/fine-tuning/cross-domain/cross-lingual settings
Writing Quality: ⭐⭐⭐⭐ Detailed and comprehensive, though somewhat lengthy
Value: ⭐⭐⭐⭐⭐ Interpretable, high-performing, and cross-lingual—representing a substantive advance for the SER field