EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning¶
Conference: ICLR 2026 arXiv: 2601.15668 Code: Available Area: Audio & Speech Keywords: Speech Emotion Recognition, Explainable Reasoning, Reinforcement Learning, Prosody-Aware, Chain-of-Thought
TL;DR¶
This work is the first to reformulate Speech Emotion Recognition (SER) as a deep reasoning problem, leveraging a prosody-enhanced backbone model combined with GRPO-PTR (Progressive Trustworthy Reasoning reward) reinforcement learning to generate explainable emotion reasoning grounded in acoustic evidence.
Background & Motivation¶
- Current SpeechLLMs still treat emotion recognition as a simple classification task, producing labels without explaining "why."
- Existing SFT-based descriptive methods remain at the level of acoustic feature description, lacking a causal reasoning chain from acoustic observations to emotional judgments.
- Three major challenges:
- Absence of high-quality reasoning datasets; existing emotion corpora lack fine-grained acoustic annotations.
- Weak prosody perception in SpeechLLMs (insufficient sensitivity to pitch, energy, speaking rate, and stress).
- Standard RL relies solely on rule-based rewards (outcome accuracy), which cannot supervise open-ended reasoning quality.
Method¶
Overall Architecture¶
Three-stage training pipeline: 1. Data Construction: Build the EmotionCoT-35K prosody-aware CoT reasoning dataset. 2. Prosody-Enhanced SFT: Train a prosody-aware backbone model, EmotionThinker-Base, on Qwen2.5-Omni-7B. 3. GRPO-PTR Reinforcement Learning: Progressively introduce trustworthy reasoning rewards to refine reasoning quality.
Key Designs¶
EmotionCoT-35K Dataset Construction: - 35K audio–reasoning pairs (~200 hours) covering 9 emotion categories (Neutral/Happy/Sad/Angry/Contempt/Confused/Whisper/Surprise/Fear). - Sources: IEMOCAP, MELD, Expresso, MEAD, EARS. - Automated annotation pipeline extracting: - Low-level features: speaking rate, pitch, energy (standard speech tools). - Stressed words: identified from transcripts via WhiStress. - Intonation contours: frame-level pitch-energy trajectories smoothed with Savitzky-Golay filtering, classified into coarse-grained styles (expressive/flat) and fine-grained patterns (rising/falling/rise-fall/fall-rise). - Speaker attributes: gender and age group via wav2vec2.0 classifier. - All prosodic annotations are used as contextual prompts for GPT-4o to generate step-by-step reasoning trajectories. - The first prosody-aware CoT dataset, covering dimensions far beyond existing speech description datasets.
Prosody-Enhanced SFT (EmotionThinker-Base): - ~500 hours of prosody-enhanced data across four task types: 1. Word-level stress perception (Stress-17K dataset). 2. Prosodic attribute classification (pitch/energy/speaking rate/intonation level). 3. Comparative prosody enhancement (utterances with modified prosodic parameters concatenated to train the model to identify correct orderings). 4. Cold-start reasoning on 5K EmotionCoT samples. - Joint optimization of audio encoder, audio adapter, and LLM backbone.
GRPO-PTR: Progressive Trustworthy Reasoning Reward:
Three reward signals: 1. Format Reward \(R_f\): Binary (0/1) signal indicating whether the output follows the think/answer XML format. 2. Outcome Accuracy Reward \(R_o\): Binary (0/1) signal indicating whether the predicted label matches the ground truth. 3. Reasoning Quality Reward \(R_t\): Scored by a trained reward model (Qwen2.5-Omni-3B fine-tuned on 101.4K samples) across four dimensions (1–5 scale): - Factual alignment - Interpretative quality - Caption completeness - Fluency and structural clarity
Trustworthiness Weight \(\tau\): - Sampled responses are divided into two groups (correct/incorrect outcomes); the mean reasoning reward of each group is computed. - When the correct group's reasoning reward \(\geq\) the incorrect group's, \(\tau = 1\); otherwise \(\tau = \exp(\text{difference})\), applying exponential decay suppression. - Acts as a group-level alignment gate to prevent reward hacking where high reasoning scores co-occur with wrong answers. - In essence, the reasoning reward signal is trusted only when reasoning quality and outcome correctness are consistent at the group level.
Progressive Scheduling: - Early training uses only \(R_o + R_f\) until emotion accuracy stabilizes at ~50%. - \(R_t\) is introduced afterward to avoid interference from multiple unstable reward signals during early convergence.
Loss & Training¶
- Final reward: \(R_i = 0.3 \cdot R_f + 1.0 \cdot R_o + 0.5 \cdot \tau \cdot R_t\)
- KL divergence coefficient: 0.04; learning rate: 1e-6; \(K=8\) candidates sampled per input.
- RL training for 3,000 steps based on Qwen2.5-Omni-7B.
Key Experimental Results¶
Main Results¶
| Model | IEMOCAP | MELD | RAVDESS | SAVEE | Avg Acc | Reasoning Quality Avg |
|---|---|---|---|---|---|---|
| Kimi-Audio | 57.72 | 59.13 | 61.07 | 55.21 | 58.83 | 2.72 |
| BLSP-Emo | 76.00 | 57.30 | 72.00 | 63.73 | 65.41 | 2.73 |
| Qwen2.5-Omni-7B | 45.70 | 54.64 | 64.77 | 52.49 | 50.83 | 2.87 |
| MiniCPM-O | 35.54 | 52.78 | 40.93 | 35.47 | 43.60 | 3.01 |
| EmotionThinker | 77.68 | 59.71 | 71.56 | 73.96 | 68.89 | 3.98 |
EmotionThinker achieves the highest emotion accuracy (68.89%) among 16 open-source models and substantially outperforms all baselines in reasoning quality (3.98 vs. second-best 3.04).
| Prosody Perception Test | Pitch | Speaking Rate | Energy | Intonation | Stress |
|---|---|---|---|---|---|
| Qwen2.5-Omni-7B | 25.71 | 29.94 | 27.67 | 25.83 | 30.24 |
| EmotionThinker-Base | 75.11 | 68.70 | 69.42 | 60.25 | 71.50 |
Ablation Study¶
| Variant | SER Acc | Reasoning Quality |
|---|---|---|
| Qwen2.5-Omni-7B (Baseline 1) | 50.83 | 2.87 |
| EmotionThinker-Base (Baseline 2) | 52.63 | 3.41 |
| SFT (V1) | 53.91 | 3.78 |
| GRPO (V2) | 62.91 | 3.45 |
| GRPO-PTR w/o trained RM (V3) | 66.67 | 3.36 |
| GRPO-PTR w/o trustworthiness weight (V4) | 67.71 | 3.74 |
| GRPO-PTR w/o progressive scheduling (V5) | 62.80 | 3.76 |
| GRPO-PTR full (V6) | 68.89 | 3.98 |
Key Findings¶
- SFT improves reasoning quality but yields limited accuracy gains; GRPO substantially boosts accuracy but at the cost of reasoning quality; GRPO-PTR achieves both simultaneously.
- An untrained reward model introduces noise (V3 vs. V6), confirming that training the reward model is critical.
- Removing the trustworthiness weight (V4) has a minor impact on accuracy but degrades reasoning quality, indicating that \(\tau\) primarily prevents logically flawed reasoning from being rewarded.
- Disabling progressive scheduling (V5) causes a substantial accuracy drop to 62.80%, highlighting the stability challenges of multi-signal RL.
- Varying \(K\) from 4 to 16 yields limited performance differences; \(K=8\) is selected as an efficiency–performance trade-off.
Highlights & Insights¶
- First work to reformulate SER from a classification problem into an RL-driven deep reasoning task.
- Prosody-enhanced SFT is a critical prerequisite: without prosody perception capability, reasoning cannot be grounded in genuine acoustic cues.
- The trustworthiness weight \(\tau\) in GRPO-PTR is an elegant design; the group-level alignment mechanism effectively prevents reward hacking.
- The four-dimensional reasoning quality evaluation framework is transferable to reasoning quality assessment in other modalities.
- Human evaluation and GPT-based automatic evaluation rankings are consistent, validating the reliability of the evaluation scheme.
Limitations & Future Work¶
- The reward model is fine-tuned from only a 3B model, which may introduce evaluation bias.
- The nine emotion categories may lack sufficient granularity (e.g., sarcasm, mixed emotions).
- Validation is conducted exclusively on English datasets; cross-lingual generalization remains unknown.
- Reasoning generation increases inference latency, limiting applicability in real-time scenarios.
Related Work & Insights¶
- Conceptually aligned with DeepSeek-R1 (RL-incentivized reasoning), but extended to the speech modality with task-specific PTR customized for emotion recognition.
- Advances beyond descriptive methods such as SECap and OSUM-EChat by establishing a causal chain from acoustic features to emotional inference.
- The prosody-enhanced SFT strategy (particularly the comparative enhancement tasks) is generalizable to other speech understanding tasks.
Rating¶
- Novelty: 5/5 (First RL-driven explainable speech emotion reasoning framework; PTR strategy is highly original)
- Experimental Thoroughness: 4/5 (Four benchmarks, 16 baselines, human evaluation, and comprehensive ablation studies)
- Writing Quality: 4/5 (Modular and clear presentation; rigorous mathematical formulation)
- Value: 5/5 (Establishes a new paradigm for speech emotion reasoning; methodology is transferable)