Multi-Metric Preference Alignment for Generative Speech Restoration¶
Conference: AAAI 2026 arXiv: 2508.17229 Code: To be confirmed Area: Image Generation Keywords: Preference alignment, DPO, speech restoration, multi-metric, generative models
TL;DR¶
This paper proposes a Multi-Metric Preference Alignment strategy that constructs a preference dataset, GenSR-Pref (80K pairs), requiring unanimous agreement across multiple complementary metrics. DPO is applied to post-training alignment of three generative speech restoration paradigms (AR, MGM, FM), achieving substantial quality improvements while effectively mitigating reward hacking.
Background & Motivation¶
Generative Speech Restoration (GenSR) has made significant progress in recent years, covering tasks such as denoising, dereverberation, declipping, and super-resolution. However, these models are typically trained with likelihood maximization objectives, which are misaligned with human perceptual preferences. Post-training alignment has proven effective in NLP and image generation, yet remains largely unexplored in speech restoration.
Applying preference alignment to GenSR presents three key challenges:
Preference signal definition: How to construct automated proxies that capture the multi-dimensional nature of human auditory perception (clarity, naturalness, artifact-free output)?
High-quality preference data construction: How to effectively build preference pairs that robustly guide model optimization?
Mitigating reward hacking: How to ensure genuine holistic improvement rather than exploitation of biases in any single metric?
Method¶
Mechanism: Multi-Metric Unanimous Agreement Criterion¶
The authors argue that the key to addressing reward hacking lies in making the preference signal itself multi-dimensional and comprehensive. To this end, a strict unanimous agreement criterion is proposed: a valid preference pair is formed only when one sample is superior to another across all complementary metrics simultaneously.
GenSR-Pref Dataset Construction¶
Four complementary evaluation dimensions are selected to construct the preference signal:
| Dimension | Metric | Evaluation Scope |
|---|---|---|
| Perceptual quality | NISQA | Overall listening quality, naturalness, artifact level |
| Signal fidelity | DNSMOS (SIG/BAK/OVRL) | Signal distortion, background noise, overall quality |
| Content consistency | SpeechBERTScore | Semantic similarity to ground-truth transcription |
| Timbre preservation | Speaker Similarity | Speaker identity preservation (cosine similarity) |
The dataset comprises approximately 80K preference pairs in total: 69,456 pairs in the MGM subset for large-scale validation, and approximately 3K pairs each for AR/FM/MGM for controlled ablation studies.
DPO Adaptation for Three Generative Paradigms¶
The paper unifies DPO adaptation across three mainstream generative paradigms:
- Autoregressive model (AR): A two-stage AR+Soundstorm pipeline that first predicts semantic tokens and then converts them to acoustic tokens. DPO directly contrasts the log-probability ratios of winning and losing sequences.
- Masked generative model (MGM): AnyEnhance is employed, predicting acoustic tokens from partially masked sequences. DPO is extended to the non-autoregressive setting by contrasting conditional probabilities.
- Flow matching model (FM): Flow-SR, based on a DiT architecture, learns a velocity field from noise to clean mel-spectrograms. DPO uses single-step L2 error differences as a likelihood proxy.
Experiments¶
Experiment 1: Effectiveness of Multi-Metric Preference Alignment (Table 1)¶
Performance before and after alignment is evaluated on three GSR benchmarks:
| Dataset | Model | Aligned | SIG↑ | BAK↑ | OVRL↑ | NISQA↑ | SBERT↑ | SIM↑ |
|---|---|---|---|---|---|---|---|---|
| Voicefixer-GSR | AnyEnhance (MGM) | ✗ | 3.406 | 4.073 | 3.136 | 4.308 | 0.829 | 0.924 |
| ✓ | 3.532 | 4.091 | 3.267 | 4.639 | 0.834 | 0.935 | ||
| AR+Soundstorm (AR) | ✗ | 3.550 | 4.097 | 3.294 | 4.556 | 0.788 | 0.894 | |
| ✓ | 3.564 | 4.144 | 3.331 | 4.850 | 0.803 | 0.904 | ||
| Flow-SR (FM) | ✗ | 3.398 | 3.969 | 3.104 | 4.010 | 0.812 | 0.918 | |
| ✓ | 3.483 | 4.092 | 3.230 | 4.672 | 0.830 | 0.924 |
All three paradigms achieve consistent and significant improvements across all metrics after alignment. The MGM model achieves a NISQA gain of +0.519 on Librivox-GSR, while AR and FM obtain NISQA gains of +0.388 and +0.641, respectively, using only 3K preference pairs. In subjective A/B testing, the aligned model achieves a peak win rate of 54.5%.
Experiment 2: Multi-Metric vs. Single-Metric Ablation (Table 3)¶
Different preference criteria are compared on the AR model:
| Criterion | SIG | BAK | OVRL | NISQA | SBERT | SIM |
|---|---|---|---|---|---|---|
| No alignment | 3.550 | 4.097 | 3.294 | 4.556 | 0.788 | 0.894 |
| Multi-Metric | 3.564 | 4.144 | 3.331 | 4.850 | 0.803 | 0.904 |
| NISQA only | 3.531 | 4.137 | 3.300 | 4.810 | 0.785 | 0.896 |
| OVRL only | 3.561 | 4.117 | 3.317 | 4.600 | 0.792 | 0.896 |
| SIM only | 3.537 | 4.101 | 3.285 | 4.577 | 0.792 | 0.901 |
| SBERT only | 3.540 | 4.109 | 3.291 | 4.612 | 0.804 | 0.901 |
Single-metric alignment improves only the target metric, while non-target metrics frequently stagnate or degrade (e.g., SIG/OVRL decline under SIM-only alignment). The multi-metric strategy achieves the best results across all metrics, effectively mitigating reward hacking.
Key Findings¶
- DPO vs. SFT: DPO consistently outperforms both SFT baselines (SFT-GT and SFT-Winner), indicating that exposure to high-quality samples alone is insufficient for effective alignment.
- GT as fixed winner leads to model collapse: Using ground truth as a fixed winner causes the model to learn pathological shortcuts—extreme suppression of probabilities for all non-GT outputs—leading to inflated reward margins and saturated reward accuracy.
- In-Paradigm Alignment: Each model performs best when aligned with preference data from its own paradigm. Cosine similarity analysis of preference vectors demonstrates that in-paradigm data yields more consistent optimization directions.
- Pseudo-labeling application: The aligned model can serve as a data annotator, generating pseudo-labels to train discriminative models in data-scarce scenarios (e.g., singing voice restoration). Voicefixer fine-tuned with pseudo-labels achieves significant improvements across all metrics.
Highlights & Insights¶
- First systematic introduction of preference alignment to generative speech restoration, covering all three major paradigms: AR, MGM, and FM
- The multi-metric unanimous agreement criterion is elegantly designed to address reward hacking at its source
- Significant improvements are achieved with only ~3K preference pairs, demonstrating exceptional data efficiency
- The in-paradigm alignment principle is discovered and quantitatively explained via cosine similarity analysis of preference vectors
- The application of the aligned model as a pseudo-annotator demonstrates the bridging potential between generative and discriminative paradigms
Limitations & Future Work¶
- Whether four automated metrics can fully proxy human auditory preferences warrants further validation
- The substantial size disparity across paradigm subsets in GenSR-Pref (MGM 69K vs. AR/FM 3K) limits fairness of comparison
- The strict unanimous agreement criterion may discard a large number of valuable preference pairs, limiting data utilization
- Only DPO is explored as the alignment algorithm; comparisons with PPO, KTO, and other methods are absent
- The pseudo-labeling application is validated only on singing voice restoration, leaving generalizability insufficiently demonstrated
Related Work & Insights¶
- Generative speech restoration: SELM, GenSE, SpeechX (AR paradigm); MaskSR, AnyEnhance (MGM paradigm); SGMSE, FlowSE (continuous dynamics paradigm)
- Post-training alignment in audio: MetricGAN (adversarial training optimizing PESQ); NISQA+PPO-based alignment; UTMOS+DPO-based single-metric alignment
- Preference optimization: DPO, RLHF applied in NLP/vision/TTS; the INTP framework extending DPO to non-autoregressive settings
Rating¶
⭐⭐⭐⭐ — The method is concise and effective; the multi-metric unanimous agreement criterion is elegantly designed, and the unified three-paradigm framework demonstrates strong generality. Ablation studies thoroughly validate the core hypothesis. The discovery and quantitative analysis of the in-paradigm alignment principle carry academic value, and the pseudo-labeling application broadens the practical impact of the work.