EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning¶

Conference: ICLR 2026 arXiv: 2601.15668 Code: Available Area: Audio & Speech Keywords: Speech Emotion Recognition, Explainable Reasoning, Reinforcement Learning, Prosody-Aware, Chain-of-Thought

TL;DR¶

This work is the first to reformulate Speech Emotion Recognition (SER) as a deep reasoning problem, leveraging a prosody-enhanced backbone model combined with GRPO-PTR (Progressive Trustworthy Reasoning reward) reinforcement learning to generate explainable emotion reasoning grounded in acoustic evidence.

Background & Motivation¶

Current SpeechLLMs still treat emotion recognition as a simple classification task, producing labels without explaining "why."
Existing SFT-based descriptive methods remain at the level of acoustic feature description, lacking a causal reasoning chain from acoustic observations to emotional judgments.
Three major challenges:
Absence of high-quality reasoning datasets; existing emotion corpora lack fine-grained acoustic annotations.
Weak prosody perception in SpeechLLMs (insufficient sensitivity to pitch, energy, speaking rate, and stress).
Standard RL relies solely on rule-based rewards (outcome accuracy), which cannot supervise open-ended reasoning quality.

Method¶

Overall Architecture¶

Three-stage training pipeline: 1. Data Construction: Build the EmotionCoT-35K prosody-aware CoT reasoning dataset. 2. Prosody-Enhanced SFT: Train a prosody-aware backbone model, EmotionThinker-Base, on Qwen2.5-Omni-7B. 3. GRPO-PTR Reinforcement Learning: Progressively introduce trustworthy reasoning rewards to refine reasoning quality.

Key Designs¶

EmotionCoT-35K Dataset Construction: - 35K audio–reasoning pairs (~200 hours) covering 9 emotion categories (Neutral/Happy/Sad/Angry/Contempt/Confused/Whisper/Surprise/Fear). - Sources: IEMOCAP, MELD, Expresso, MEAD, EARS. - Automated annotation pipeline extracting: - Low-level features: speaking rate, pitch, energy (standard speech tools). - Stressed words: identified from transcripts via WhiStress. - Intonation contours: frame-level pitch-energy trajectories smoothed with Savitzky-Golay filtering, classified into coarse-grained styles (expressive/flat) and fine-grained patterns (rising/falling/rise-fall/fall-rise). - Speaker attributes: gender and age group via wav2vec2.0 classifier. - All prosodic annotations are used as contextual prompts for GPT-4o to generate step-by-step reasoning trajectories. - The first prosody-aware CoT dataset, covering dimensions far beyond existing speech description datasets.

Prosody-Enhanced SFT (EmotionThinker-Base): - ~500 hours of prosody-enhanced data across four task types: 1. Word-level stress perception (Stress-17K dataset). 2. Prosodic attribute classification (pitch/energy/speaking rate/intonation level). 3. Comparative prosody enhancement (utterances with modified prosodic parameters concatenated to train the model to identify correct orderings). 4. Cold-start reasoning on 5K EmotionCoT samples. - Joint optimization of audio encoder, audio adapter, and LLM backbone.

GRPO-PTR: Progressive Trustworthy Reasoning Reward:

Three reward signals: 1. Format Reward \(R_f\): Binary (0/1) signal indicating whether the output follows the think/answer XML format. 2. Outcome Accuracy Reward \(R_o\): Binary (0/1) signal indicating whether the predicted label matches the ground truth. 3. Reasoning Quality Reward \(R_t\): Scored by a trained reward model (Qwen2.5-Omni-3B fine-tuned on 101.4K samples) across four dimensions (1–5 scale): - Factual alignment - Interpretative quality - Caption completeness - Fluency and structural clarity

Trustworthiness Weight \(\tau\): - Sampled responses are divided into two groups (correct/incorrect outcomes); the mean reasoning reward of each group is computed. - When the correct group's reasoning reward \(\geq\) the incorrect group's, \(\tau = 1\); otherwise \(\tau = \exp(\text{difference})\), applying exponential decay suppression. - Acts as a group-level alignment gate to prevent reward hacking where high reasoning scores co-occur with wrong answers. - In essence, the reasoning reward signal is trusted only when reasoning quality and outcome correctness are consistent at the group level.

Progressive Scheduling: - Early training uses only \(R_o + R_f\) until emotion accuracy stabilizes at ~50%. - \(R_t\) is introduced afterward to avoid interference from multiple unstable reward signals during early convergence.

Loss & Training¶

Final reward: \(R_i = 0.3 \cdot R_f + 1.0 \cdot R_o + 0.5 \cdot \tau \cdot R_t\)
KL divergence coefficient: 0.04; learning rate: 1e-6; \(K=8\) candidates sampled per input.
RL training for 3,000 steps based on Qwen2.5-Omni-7B.

Key Experimental Results¶

Main Results¶

Model	IEMOCAP	MELD	RAVDESS	SAVEE	Avg Acc	Reasoning Quality Avg
Kimi-Audio	57.72	59.13	61.07	55.21	58.83	2.72
BLSP-Emo	76.00	57.30	72.00	63.73	65.41	2.73
Qwen2.5-Omni-7B	45.70	54.64	64.77	52.49	50.83	2.87
MiniCPM-O	35.54	52.78	40.93	35.47	43.60	3.01
EmotionThinker	77.68	59.71	71.56	73.96	68.89	3.98

EmotionThinker achieves the highest emotion accuracy (68.89%) among 16 open-source models and substantially outperforms all baselines in reasoning quality (3.98 vs. second-best 3.04).

Prosody Perception Test	Pitch	Speaking Rate	Energy	Intonation	Stress
Qwen2.5-Omni-7B	25.71	29.94	27.67	25.83	30.24
EmotionThinker-Base	75.11	68.70	69.42	60.25	71.50

Ablation Study¶

Variant	SER Acc	Reasoning Quality
Qwen2.5-Omni-7B (Baseline 1)	50.83	2.87
EmotionThinker-Base (Baseline 2)	52.63	3.41
SFT (V1)	53.91	3.78
GRPO (V2)	62.91	3.45
GRPO-PTR w/o trained RM (V3)	66.67	3.36
GRPO-PTR w/o trustworthiness weight (V4)	67.71	3.74
GRPO-PTR w/o progressive scheduling (V5)	62.80	3.76
GRPO-PTR full (V6)	68.89	3.98

Key Findings¶

SFT improves reasoning quality but yields limited accuracy gains; GRPO substantially boosts accuracy but at the cost of reasoning quality; GRPO-PTR achieves both simultaneously.
An untrained reward model introduces noise (V3 vs. V6), confirming that training the reward model is critical.
Removing the trustworthiness weight (V4) has a minor impact on accuracy but degrades reasoning quality, indicating that \(\tau\) primarily prevents logically flawed reasoning from being rewarded.
Disabling progressive scheduling (V5) causes a substantial accuracy drop to 62.80%, highlighting the stability challenges of multi-signal RL.
Varying \(K\) from 4 to 16 yields limited performance differences; \(K=8\) is selected as an efficiency–performance trade-off.

Highlights & Insights¶

First work to reformulate SER from a classification problem into an RL-driven deep reasoning task.
Prosody-enhanced SFT is a critical prerequisite: without prosody perception capability, reasoning cannot be grounded in genuine acoustic cues.
The trustworthiness weight \(\tau\) in GRPO-PTR is an elegant design; the group-level alignment mechanism effectively prevents reward hacking.
The four-dimensional reasoning quality evaluation framework is transferable to reasoning quality assessment in other modalities.
Human evaluation and GPT-based automatic evaluation rankings are consistent, validating the reliability of the evaluation scheme.

Limitations & Future Work¶

The reward model is fine-tuned from only a 3B model, which may introduce evaluation bias.
The nine emotion categories may lack sufficient granularity (e.g., sarcasm, mixed emotions).
Validation is conducted exclusively on English datasets; cross-lingual generalization remains unknown.
Reasoning generation increases inference latency, limiting applicability in real-time scenarios.

Conceptually aligned with DeepSeek-R1 (RL-incentivized reasoning), but extended to the speech modality with task-specific PTR customized for emotion recognition.
Advances beyond descriptive methods such as SECap and OSUM-EChat by establishing a causal chain from acoustic features to emotional inference.
The prosody-enhanced SFT strategy (particularly the comparative enhancement tasks) is generalizable to other speech understanding tasks.

Rating¶

Novelty: 5/5 (First RL-driven explainable speech emotion reasoning framework; PTR strategy is highly original)
Experimental Thoroughness: 4/5 (Four benchmarks, 16 baselines, human evaluation, and comprehensive ablation studies)
Writing Quality: 4/5 (Modular and clear presentation; rigorous mathematical formulation)
Value: 5/5 (Establishes a new paradigm for speech emotion reasoning; methodology is transferable)