SDiaReward: Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness¶
Conference: ACL2026
arXiv: 2603.14889
Code: https://github.com/MM-Speech/SDiaReward/
Area: Spoken Dialogue / Reward Model
Keywords: Spoken dialogue evaluation, preference learning, prosody naturalness, colloquialness, reward model
TL;DR¶
SDiaReward constructs a pairwise preference dataset and ESDR-Bench for multi-turn spoken dialogue and trains an end-to-end speech reward model. This allows evaluation to go beyond textual semantics to simultaneously judge the modality gap (prosody/emotion) and the colloquialness gap (natural spoken style).
Background & Motivation¶
Background: End-to-end spoken dialogue systems are evolving from cascaded ASR+LLM+TTS architectures toward unified models that directly perceive and generate speech. Reward models, RLHF, DPO, and preference learning are already widely used in textual dialogue and visual generation to optimize behavior.
Limitations of Prior Work: Preferences in spoken dialogue are determined not only by textual content but also by prosody, emotion, pauses, speaking style, and turn-level coherence. Textual reward models cannot perceive these signals, while traditional automatic metrics and single-turn TTS evaluations fail to cover multi-turn interactions.
Key Challenge: Speech output must simultaneously satisfy semantic correctness, naturalness, dialogue fluency, and colloquial expression. General-purpose audio LLMs often prioritize semantics in zero-shot evaluations; they can distinguish textual style differences but fail to stably differentiate between human speech and high-quality synthetic speech based on subtle prosody variations.
Goal: The authors aim to establish an episode-level reward modeling framework. By inputting multi-turn speech episodes and using pairwise preference supervision, the goal is to learn a scalar reward capable of assessing modality-aware naturalness and colloquialness, while providing a reproducible benchmark.
Key Insight: The paper decomposes spoken dialogue evaluation into two distinct gaps: the modality gap, referring to paralinguistic information like prosody, emotion, and channel conditions, and the colloquialness gap, referring to the stylistic difference between written text and natural speech. Data construction is designed around these two gaps to create contrastive samples.
Core Idea: Use large-scale pairwise speech episode preference data to train an end-to-end reward model, allowing the model to directly "hear" the multi-turn context and candidate speech rather than evaluating discretized text.
Method¶
SDiaReward consists of three core components: the SDiaReward-Dataset, the stratified sampling benchmark ESDR-Bench, and an end-to-end reward model based on Qwen2.5-Omni. Instead of scoring a single utterance, it assigns a scalar reward to the complete multi-turn context and the candidate final speech output.
Overall Architecture¶
The input consists of a multi-turn speech dialogue context \(\mathcal{C}\) and a candidate final response \(y\). The model outputs \(r_\theta(\mathcal{C}, y)\). The training data is composed of preference pairs: a preferred episode and a rejected episode. Data sources include wild YouTube multi-person dialogues, Semi-natural MELD acting dialogues, DailyTalk studio scripted speech, and LLM-rewritten dialogues in written versus colloquial styles.
Following data construction, the authors perform stratified sampling from the validation split based on source and metadata to form ESDR-Bench, preventing the benchmark from being dominated by a single distribution due to the large volume of "Wild" data. The final dataset includes 13,356 dialogue pairs, with 11,630 for training and 1,726 for validation.
Key Designs¶
-
Dual-gap Preference Data Construction:
- Function: Provide explicit preference supervision for speech naturalness and colloquial style respectively.
- Mechanism: The modality-aware subset pairs human speech with synthetic speech generated by SoulX-Podcast using the same text content, forcing the model to focus on prosody, emotion, and turn consistency rather than textual content. The colloquialness subset first generates written-style dialogues, then rewrites them into natural versions with fillers, fragmentation, and discourse markers, synthesized with the same TTS configuration to ensure preference signals stem from stylistic naturalness rather than audio quality differences.
- Design Motivation: If data does not control for textual content or audio conditions, models easily learn shortcuts (e.g., "cleaner recordings are better"). Pairwise construction isolates the evaluation dimensions, resulting in clearer supervision signals.
-
Episode-level End-to-end Reward Model:
- Function: Directly assign a scalar score to a complete multi-turn speech episode.
- Mechanism: The model uses a multimodal LLM backbone to project interleaved speech-text sequences into a joint representation space. It extracts hidden states \(\mathbf{H}=\{h_1,\ldots,h_L\}\) from the final layer and obtains the reward via pooling and an MLP. Comparing last-token, attention, and mean pooling, the authors found mean pooling to be the most stable.
- Design Motivation: Multi-turn speech preference information is scattered across context, final responses, and acoustic details. One cannot rely on a single token or ASR text to carry all signals. Mean pooling is better suited for aggregating episode-level representations.
-
Multi-criteria Conditioning and Center Regularization:
- Function: Use a single reward model to handle both modality-aware and colloquialness evaluations while stabilizing the reward scale.
- Mechanism: Model input includes criterion-specific instructions, making the reward \(r_\theta(\mathcal{C}, y, inst)\). Training utilizes the Bradley-Terry preference objective to ensure the preferred reward is higher than the rejected. To prevent unbounded score drift caused by pairwise loss, a center loss is added to anchor the mean reward within a reasonable range.
- Design Motivation: Speech data varies significantly across Wild, Semi-wild, and Scripted distributions. Purely pairwise optimization might treat channel and domain differences as absolute score offsets. Center regularization improves calibration and training stability.
Loss & Training¶
The primary loss is the Bradley-Terry preference loss: \(\mathcal{L}_{pref}(\theta)=-\mathbb{E}[\log \sigma(r_\theta(\mathcal{C}^+,y^+)-r_\theta(\mathcal{C}^-,y^-))]\), where the reward for the preferred response should be higher than the rejected. The authors also use center regularization to mitigate reward drift. The model is initialized from Qwen2.5-Omni, with scalar prediction performed on a linear score head. Audio is truncated or padded to 30 seconds.
Key Experimental Results¶
Main Results¶
The dataset covers four categories of preference sources. While Wild data represents the largest portion, ESDR-Bench avoids being dominated by it through stratified sampling.
| Category | Train | Val | Total |
|---|---|---|---|
| Wild modality | 6,879 | 824 | 7,703 |
| Semi-Wild modality | 309 | 186 | 495 |
| Scripted modality | 2,192 | 466 | 2,658 |
| Colloquialness | 2,250 | 250 | 2,500 |
| Total | 11,630 | 1,726 | 13,356 |
On ESDR-Bench, SDiaReward-7B significantly outperforms general audio LLMs, specialized speech evaluators, and cascaded systems in both modality and overall metrics.
| Model | Modality Micro | Modality Macro | Colloq. Acc | Overall Micro | Overall Macro |
|---|---|---|---|---|---|
| Gemini 2.5 Pro | 72.63 | 70.50 | 98.80 | 76.42 | 84.65 |
| GPT-4o Audio | 51.12 | 50.47 | 98.00 | 57.91 | 74.23 |
| Qwen 3 Omni 30B | 58.18 | 55.97 | 97.20 | 63.83 | 76.59 |
| SpeechJudge | 54.44 | 52.62 | 55.20 | 54.55 | 53.91 |
| AudioReasoner+Whisper+GPT-4o | 55.38 | 53.09 | 75.20 | 58.25 | 64.14 |
| SDiaReward 3B | 88.62 | 79.20 | 92.00 | 89.11 | 85.60 |
| SDiaReward 7B | 96.61 | 94.91 | 97.20 | 96.70 | 96.06 |
Ablation Study¶
OOD TTS tests demonstrate that SDiaReward is not merely an artifact detector. Wav2Vec2-DF dropped to 38.6% on CosyVoice 2, whereas SDiaReward-7B maintained high accuracy across three unseen TTS engines.
| OOD Engine | Wav2Vec2-DF Acc | SDiaReward-3B Acc | SDiaReward-7B Acc | SDiaReward-7B rejected score |
|---|---|---|---|---|
| OpenAI TTS | 89.9% | 93.0% | 98.3% | -0.62 |
| CosyVoice 2 | 38.6% | 93.1% | 95.3% | -0.04 |
| FireRedTTS-2 | 94.5% | 72.7% | 90.9% | 0.29 |
Ablations on pooling and center regularization show that mean pooling + center loss is the most stable configuration.
| Setup | Modality | Colloq. | Overall |
|---|---|---|---|
| 3B Last Hidden | 63.75 | 48.80 | 61.59 |
| 3B Attention | 87.94 | 93.60 | 88.76 |
| 3B Mean | 88.62 | 92.00 | 89.10 |
| 7B Last Hidden | 51.83 | 40.00 | 50.12 |
| 7B Attention | 70.60 | 55.20 | 68.37 |
| 7B Mean | 96.61 | 97.20 | 96.70 |
| 7B Mean w/o Center Loss | 95.05 | 97.20 | 95.37 |
| 7B Mean w/ Center Loss | 96.61 | 97.20 | 96.70 |
Key Findings¶
- General audio judges are nearly saturated on colloquialness (e.g., Gemini 2.5 Pro reaches 98.80%), but modality micro is only 72.63%, indicating that while text/style is easy to judge, acoustic naturalness is harder.
- SDiaReward-7B achieves 96.61% modality micro and 94.91% macro, showing more stability than the 3B version; 3B only scores 55.38% on Semi-wild, exposing insufficient generalization of smaller models to complex semi-acting speech.
- In human validation, the overall weighted agreement for 75 stratified samples was 83.5%±4.3%, with high-confidence samples at 88.3% and hard negatives at 93.3% agreement, suggesting the preference labels are largely reliable.
- FireRedTTS-2 has a higher rejected score, suggesting the model considers it closer to a real person rather than mechanically judging it as fake audio; this supports "relative expressiveness evaluation" over artifact shortcuts.
Highlights & Insights¶
- The paper accurately captures the core of spoken dialogue rewards: it is not "text answer + audio quality," but a holistic experience involving turn-taking rhythm, emotion, pauses, and colloquial habits.
- The design of modality-aware pairing is clean. By comparing real versus synthetic speech with identical text, the model must learn paralinguistic naturalness rather than relying on semantic differences.
- Colloquialness pairing is also rigorously handled: written and colloquial versions use the same TTS configuration, preventing audio quality from being mistaken for a preference for colloquial style.
- The value of center loss goes beyond an overall gain of 1.33 points; more importantly, it ensures the reward scale does not drift uncontrollably across domains. This is critical for subsequent use of the reward in DPO/GRPO for speech generation.
Limitations & Future Work¶
- The authors acknowledge that the data is currently biased toward in-the-wild recordings; more high-quality acted speech and diverse synthesis engines are needed to improve cross-domain robustness.
- While human validation results are promising, the sample size of 75 is small and has limited coverage of fine-grained subjective preferences, cultural differences, and speaker style biases.
- Reward still exhibits domain-dependent offsets. For example, absolute scores for positive samples in the Scripted domain may be low, suggesting the model learns within-domain relative ranking rather than a globally unified quality scale.
- Caution is needed when applying this to RL for speech generation. Reward models can be exploited by optimizers, particularly causing unnatural reward hacking in dimensions like acoustic channels, speaking rate, or emotional intensity.
Related Work & Insights¶
- vs SpeechJudge / SageLM: These specialized evaluators lean toward single-turn speech or TTS quality; SDiaReward evaluates episode-level multi-turn speech preferences, making it better suited for interactive spoken dialogue.
- vs WavReward / ParaS2S: Related works attempt to incorporate paralinguistic signals but often rely on hand-crafted acoustic features or rules; SDiaReward replaces brittle feature engineering with data-driven preference learning.
- vs cascade evaluator: AudioReasoner+Whisper+GPT-4o shows some ability in colloquialness, but ASR wipes out prosody and emotion, leading to weak performance on modality tasks.
- Insight: Multimodal reward models should aim to control for irrelevant variables in preference pairs. Speech, video, and embodied interaction can all utilize "same semantics, different modal realization" hard pairs to train perception-level rewards.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Specifically decomposing spoken dialogue rewards into modality and colloquialness gaps for episode-level modeling is highly targeted.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Main results, OOD TTS, human validation, and ablations on pooling and center loss are quite comprehensive.
- Writing Quality: ⭐⭐⭐⭐☆ The structure is clear and the experimental analysis is detailed; certain terms like relative expressiveness could be more formally defined.
- Value: ⭐⭐⭐⭐⭐ Direct value for the evaluation and subsequent RL alignment of end-to-end spoken dialogue systems.