UniSRM: A Unified Speech Reward Model for Fine-Grained Speech Evaluation¶
Conference: ACL 2026
arXiv: 2605.23261
Code: https://github.com/lavendery/UniSRM
Area: Speech Evaluation / Reward Models
Keywords: Speech Reward Model, Multi-dimensional Evaluation, Reasoning Consistency, TTS Evaluation
TL;DR¶
This paper proposes UniSRM, a unified speech reward model. Through a two-stage training (SFT+GRPO) and Reasoning Consistency Reward (RCR) mechanism, it supports multi-dimensional, interpretable speech evaluation ranging from utterance-level quality to dialogue-level coherence, significantly outperforming existing methods on multiple evaluation tasks.
Background & Motivation¶
Background: Speech generation quality assessment long relied on Mean Opinion Scores (MOS), which are costly, subjective, and difficult to scale. Recent explorations have begun using Large Audio-Language Models (LALMs) as automatic scorers, such as WavReward, SageLM, and SpeechJudge.
Limitations of Prior Work: (1) Existing methods cover limited tasks—most only handle utterance-level quality or single-turn dialogue, ignoring multi-turn interactions and contextual consistency; (2) Incomplete evaluation dimensions—some methods miss key metrics like speaker similarity; (3) Uncontrollable reasoning processes—rule-based RL lacks sufficient supervision over reasoning steps, leading to inconsistencies between generated rationales and final decisions; (4) Lack of transparency in scoring—traditional metrics (WER, SIM, UTMOS) each capture only a single aspect.
Key Challenge: The contradiction between the diversity of speech evaluation tasks (from isolated utterances to dialogue contexts) and the single-adapter nature of existing reward models; the tension between the freedom of rationale generation and the accuracy of final scoring.
Goal: Construct a unified reward model capable of (1) supporting multiple speech evaluation tasks; (2) outputting interpretable multi-dimensional scores and reasoning processes; (3) ensuring consistency between reasoning and decision-making.
Key Insight: It is observed that existing LLM-based scorers perform poorly when integrating text or multi-turn dialogue context—this suggests improvements can be made through better training strategies and reasoning supervision. Meanwhile, a multi-dimensional decomposition of the evaluation process aligns with the intuitive logic of human scoring.
Core Idea: Replace simple end-to-end fine-tuning with a "staged training + reasoning consistency reward" paradigm, allowing the model to receive explicit supervision while generating dimension-level intermediate reasoning, thereby improving overall reliability.
Method¶
Overall Architecture¶
UniSRM adopts a two-stage training pipeline. In the first stage (SFT), a pre-trained speech-language model (Qwen2.5-Omni-7B-thinker) is instruction-tuned on a unified dataset, UniSRM-Data, to learn to output multi-dimensional scores and reasoning in a structured format. In the second stage (GRPO), Group Relative Policy Optimization is used with a Reasoning Consistency Reward to further align model predictions with human preferences.
The input consists of task-related multimodal context, and the output has two parts: (1) dimension-level reasoning within <think> tags; (2) the final decision (binary preference or MOS-like scores) within <answer> tags.
Key Designs¶
-
Reasoning Consistency Reward (RCR):
- Function: Explicitly supervises the model's dimension-level scoring behavior during reasoning, ensuring high alignment between intermediate reasoning and final decisions.
- Mechanism: For paired tasks, the consistency reward is calculated as \(R_{\text{rc}}(o) = \frac{1}{D}\sum_{i=1}^{D}\mathbf{1}[\text{sign}(a_i - b_i) = \text{sign}(a_i^{\star} - b_i^{\star})]\), meaning the relative order of two samples in each dimension should match the labels. For quality scoring tasks, a normalized multi-dimensional scoring error is used as the reward.
- Design Motivation: Pure result accuracy rewards can lead to "shortcut" maneuvers where the final prediction is correct but dimension scores are contradictory, undermining interpretability. RCR enforces dimension-level consistency by directly constraining intermediate steps.
-
Multi-task Unified Framework & Structured Output:
- Function: Processes four complementary speech evaluation tasks in a single model: utterance-level paired preference, utterance-level quality scoring, scenario-aware style consistency (with text context), and multi-turn dialogue evaluation (with history).
- Mechanism: All tasks are unified as conditional generation problems. System prompts enforce a two-part structure (reasoning + answer). While answer formats differ across tasks (binary decision vs. multi-dimensional score vector), the reasoning part always includes task-relevant dimension-level scores.
- Design Motivation: A unified framework allows a single model to learn general evaluation capabilities, while structured output facilitates penalizing format violations during RL optimization.
-
Componentized Reward Function:
- Function: Combines format rewards, accuracy rewards, and reasoning consistency rewards, \(R(x,o) = \lambda_{\text{fmt}}R_{\text{fmt}}(o) + \lambda_{\text{acc}}R_{\text{acc}}(o) + \lambda_{\text{rc}}R_{\text{rc}}(o)\), to comprehensively constrain model behavior.
- Mechanism: Format rewards apply a \(-1\) penalty for malformed outputs; accuracy rewards use \(\mathbf{1}[y^{(g)} = y^{\star}]\) for paired tasks and normalized distance rewards for quality scoring; and RCR follows the design above.
- Design Motivation: These three components constrain the model from different dimensions, requiring correct final answers, standardized generation processes, and consistent dimensional reasoning.
Loss & Training¶
The SFT stage uses the standard auto-regressive maximum likelihood objective. The GRPO stage employs intra-group relative policy optimization with KL regularization. For each input \(x\), \(G\) responses are sampled from the current policy, and advantages are normalized using the group mean and standard deviation: \(A^{(g)} = (R^{(g)} - \mu(x))/(\sigma(x) + \epsilon)\), followed by a clipped policy gradient objective.
Key Experimental Results¶
Main Results¶
| Model | Task 1 (Paired) | Task 2 (Quality) | Task 3-En (Scenario) | Task 3-Zh | Task 4 (Dialogue) |
|---|---|---|---|---|---|
| WER / SIM / UTMOS / DNSMOS | 59.24–84.10 | 0.274–0.449 | 33.21–61.44 | 48.19–63.04 | 40.48–50.79 |
| GPT-4o-Audio | 61.04 | 0.060 | 64.02 | 64.82 | 71.96 |
| Gemini-2.5-Flash | 60.44 | 0.522 | 65.68 | 71.74 | 71.43 |
| UniSRM (Ours) | 65.06 | 0.551 | 85.61 | 91.30 | 88.89 |
UniSRM achieves the best performance across all tasks, particularly in tasks requiring text or dialogue context integration (Tasks 3 and 4), showing an improvement of over 20 percentage points compared to the strongest baseline.
Ablation Study¶
| Configuration | Task 1 | Task 2 | Task 3-En | Task 3-Zh | Task 4 |
|---|---|---|---|---|---|
| SFT Only (w/o GRPO) | 60.24 | 39.20 | 67.16 | 70.95 | 74.60 |
| GRPO w/o RCR | 60.44 | 37.58 | 80.81 | 81.42 | 82.54 |
| UniSRM Full | 65.06 | 39.74 | 85.61 | 91.30 | 88.89 |
Key Findings: (1) GRPO provides improvements over pure SFT; (2) The addition of RCR generally brings further improvements, with a maximum gain of 8.88 percentage points (Task 4); (3) Counter-intuitive phenomenon: GRPO without RCR performed worse than SFT in some dimensions, indicating that pure accuracy rewards induce "shortcut" behavior. RCR effectively prevents this via dimension-level supervision.
Cross-dataset Generalization¶
| Dataset | Metric | DNSMOS | Gemini-2.5-Pro | UniSRM |
|---|---|---|---|---|
| BVCC | PCC | 0.299 | 0.339 | 0.498 |
| SOMOS-Clean | PCC | 0.048 | 0.250 | 0.261 |
| SOMOS-Full | PCC | 0.053 | 0.222 | 0.235 |
UniSRM demonstrates strong cross-domain capability on human-annotated external datasets (BVCC, SOMOS), suggesting the model learns genuine evaluation abilities rather than overfitting to LLM-generated labels.
Highlights & Insights¶
- Ingenious RCR Design: RCR does not simply punish errors; it enforces "logical self-consistency" at the dimension level—ensuring consistency in relative comparisons. This constraint shifts the optimization goal from being "correct at the last step" to being "reasonable throughout," significantly reducing the model's room to "cheat."
- Synergy of Unified Data and Multi-task Learning: By carefully designing UniSRM-Data, four seemingly different tasks are unified into a "multi-dimensional reasoning + structured answer" format, enabling a single model to learn cross-task general evaluation capabilities.
- SFT and GRPO Collaboration: SFT teaches the model to mimic the rationales and decisions of annotators. With GRPO, the model improves final accuracy while increasing the diversity and interpretability of reasoning through varied sampling.
Limitations & Future Work¶
Limitations acknowledged by authors: (1) Current benchmarks have limited coverage of difficult scenarios such as heavy accents or overlapping speech; (2) High computational costs for training and inference limit scalability and low-latency deployment.
Self-discovered limitations: (1) Evaluation dimension definitions are relatively fixed and may not adapt to emerging application scenarios (e.g., multilingual mixing, specific character accents); (2) Reliance on LLM-generated labels in data sources—though cross-dataset generalization is proven, LLM systemic biases might be implicitly imported.
Future Work: (1) Explore distillation strategies to make UniSRM lightweight; (2) Introduce active learning to prioritize annotating high-uncertainty samples; (3) Research adaptive dimension selection.
Related Work & Insights¶
- vs WavReward / SageLM: These methods focus on single-turn or utterance-level evaluation and often use rule-based RL. UniSRM covers richer task scenarios and explicitly constrains dimension-level consistency via RCR.
- vs SpeechJudge: While it also generates evaluation rationales, it targets the utterance level with limited dimensions. UniSRM extends this to the dialogue level.
- vs QualiSpeech / AudioJudge: These focus on low-level quality features, whereas UniSRM incorporates high-level, context-aware evaluations.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The RCR design is a creative improvement for multi-step reasoning evaluation. While multi-task unified frameworks have precedents, the comprehensiveness and execution quality of this work are at the industry forefront.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes 4 complementary tasks, 3 levels of ablation, fine-grained dimension analysis, and cross-dataset generalization, providing comprehensive coverage.
- Writing Quality: ⭐⭐⭐⭐ The logic is clear and motivation is sufficient, though discussion on the trade-off between computational cost and practicality is slightly lacking.
- Value: ⭐⭐⭐⭐⭐ Provides a reusable paradigm and public dataset for reward modeling in speech generation, significantly contributing to the completion of the speech RLHF ecosystem.