DP²O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution¶
Conference: NeurIPS 2025 arXiv: 2510.18851 Code: github.com/cswry/DP2O-SR Area: LLM Alignment Keywords: Image Super-Resolution, Preference Optimization, Diffusion Models, Perceptual Quality, DPO
TL;DR¶
This paper proposes DP²O-SR, a framework that exploits the inherent stochasticity of diffusion models to generate diverse super-resolution outputs, constructs preference pairs via a hybrid perceptual reward, and introduces a Hierarchical Preference Optimization (HPO) strategy to adaptively weight training pairs — significantly improving perceptual quality in real-world image super-resolution without any human annotations.
Background & Motivation¶
Real-world Image Super-Resolution (Real-ISR) aims to reconstruct high-resolution images from low-resolution inputs. Traditional methods focus on pixel-level fidelity (PSNR/SSIM) but tend to produce over-smoothed results. Recent methods based on pretrained Text-to-Image (T2I) diffusion models (e.g., Stable Diffusion, FLUX) demonstrate strong capability in synthesizing rich details, yet they share a fundamental issue: the inherent stochasticity of diffusion models — different noise inputs lead to outputs with substantially varying perceptual quality.
Existing methods typically treat this stochasticity as a defect and attempt to eliminate it by stabilizing the generation process or training single-step models. This paper takes a fundamentally different perspective: treating stochasticity as a source of high-quality supervision signals. Outputs from different noise seeds form a diverse distribution over perceptual quality, which can be leveraged to construct preference pairs for optimization.
Furthermore, existing DPO methods (e.g., Diff-DPO) construct only a single "best vs. worst" preference pair from different model outputs, yielding limited supervision. There is also a lack of carefully designed perceptual rewards and preference data curation strategies tailored to Real-ISR.
Method¶
Overall Architecture¶
The DP²O-SR framework consists of three core components:
- Diverse ISR Sample Generation: A frozen reference model \(\pi_{\text{ref}}\) samples \(M\) SR candidates from the same LR input using different noise seeds.
- Perceptual Reward Ranking and Preference Pair Construction: A hybrid IQA reward ranks candidates; top-\(N\) and bottom-\(N\) outputs are selected to form \(N^2\) preference pairs.
- Hierarchical Preference Optimization (HPO): Training pairs are adaptively weighted to focus on the most informative comparisons.
Key Designs¶
1. Hybrid Perceptual Reward Design¶
The reward signal combines full-reference (FR) and no-reference (NR) metrics:
- FR set \(\mathcal{FR}\): LPIPS, TOPIQ-FR, AFINE-FR — promoting structural fidelity and suppressing hallucinated content.
- NR set \(\mathcal{NR}\): MANIQA, MUSIQ, CLIPIQA+, TOPIQ-NR, AFINE-NR, Q-Align — encouraging realism and aesthetic consistency.
For each candidate \(I_m\) and metric \(\phi\), raw scores \(s_m^\phi\) are direction-aligned and min-max normalized:
The final reward equally weights FR and NR contributions:
Key finding: using only FR rewards leads to over-smoothing; using only NR rewards introduces hallucinated details; the hybrid reward preserves structural consistency while enhancing realism.
2. Preference Pair Curation Strategy¶
Unlike Diff-DPO, which constructs a single "best vs. worst" pair, this work samples \(M\) outputs from the same model and selects top-\(N\) and bottom-\(N\) to form \(N^2\) preference pairs. Two key control parameters are introduced:
- Sample count \(M\): Larger \(M\) improves perceptual diversity and training stability, with diminishing returns.
- Selection ratio \(N/M\): Small \(N/M\) yields stronger reward contrast; large \(N/M\) increases coverage and diversity.
Key finding (architecture sensitivity): - Smaller model C-SD2 (0.8B): Optimal at \(N/M = 1/4\), requiring more redundancy and smoother gradients. - Larger model C-FLUX (12B): Optimal at \(N/M = 1/16\), capable of learning effectively from high-contrast signals.
3. Hierarchical Preference Optimization (HPO)¶
Adaptive weighting is applied at two levels:
Intra-group weight (focusing on pairs with larger reward gaps): $\(w_{\text{intra}}(\mathbf{x}_0^w, \mathbf{x}_0^l) = |R_w - R_l| + (1 - \mu_{\text{gap}})\)$
where \(\mu_{\text{gap}}\) is the mean reward gap across all pairs within the group.
Inter-group weight (prioritizing LR input groups with higher perceptual diversity): $\(w_{\text{inter}}(g) = \sigma_g + (1 - \mu_\sigma)\)$
where \(\sigma_g\) is the reward standard deviation within group \(g\), and \(\mu_\sigma\) is the mean standard deviation across all groups.
Loss & Training¶
The final loss extends Diff-DPO with hierarchical weights:
where \(w = w_{\text{intra}} \cdot w_{\text{inter}}\) and \(\ell(\cdot)\) denotes the Diff-DPO loss.
Training configuration: batch size 1024, learning rate \(2 \times 10^{-5}\), \(\beta = 5000\), 1000 training steps, 8×A800 GPUs. C-SD2 uses \(N=8, M=32\); C-FLUX uses \(N=4, M=64\).
Key Experimental Results¶
Main Results¶
Evaluation is conducted on Syn-Test and RealSR benchmarks across 14 IQA metrics in 4 categories.
Key improvements on Syn-Test (DP²O-SR vs. baseline):
| Metric | C-SD2 | DP²O-SR(SD2) | C-FLUX | DP²O-SR(FLUX) |
|---|---|---|---|---|
| MANIQA↑ | 0.6684 | 0.7165 | 0.6857 | 0.7199 |
| CLIPIQA+↑ | 0.7595 | 0.8124 | 0.7473 | 0.7993 |
| QALIGN↑ | 4.2481 | 4.5526 | 4.4266 | 4.7060 |
| VQ-R1↑ | 4.43 | 4.57 | 4.53 | 4.65 |
Results on RealSR (out-of-domain generalization, with standard deviation):
| Metric | C-SD2 | DP²O-SR(SD2) | C-FLUX | DP²O-SR(FLUX) |
|---|---|---|---|---|
| MANIQA↑ | 0.664±0.019 | 0.705±0.012 | 0.665±0.025 | 0.694±0.013 |
| MUSIQ↑ | 70.34±1.79 | 73.24±0.81 | 69.70±2.15 | 72.78±0.93 |
| QALIGN↑ | 3.630±0.187 | 4.017±0.117 | 3.654±0.231 | 4.143±0.113 |
DP²O-SR not only improves perceptual quality but also substantially reduces output variance (reflected in markedly smaller standard deviations), indicating improved generation stability.
Ablation Study¶
Effect of sample count \(M\): Increasing \(M\) consistently improves performance with diminishing returns; saturation is approached at \(M=64\).
Effect of selection ratio \(N/M\): - C-SD2 is optimal at \(N/M = 1/4\); excessively low ratios cause reward collapse. - C-FLUX is optimal at \(N/M = 1/16\); performance degrades at \(N/M = 1/2\).
Effectiveness of HPO: Hierarchical weighting substantially improves training efficiency and final perceptual quality over uniform-weight Diff-DPO.
Key Findings¶
- Perception–distortion trade-off: Improvements in perceptual quality accompany slight decreases in PSNR/SSIM, consistent with classical theory.
- Strong generalization: Superior performance is also observed on metrics not used during training, such as VQ-R1 and NIMA.
- Architecture generality: The method is effective on both diffusion models (SD2, UNet) and flow-based models (FLUX, DiT).
- Fast convergence: 500 steps suffice to surpass strong baselines such as SeeSR and OSEDiff.
- Improved robustness: DP²O-SR significantly improves Worst@M, ensuring even the worst-case outputs reach high quality.
Highlights & Insights¶
- Turning stochasticity into an advantage: Reframing diffusion model randomness as a source of preference learning signals is a clever and non-obvious insight.
- Balanced hybrid reward: Equal weighting of FR and NR metrics avoids the biases introduced by relying on either alone.
- Two-level adaptive weighting in HPO: Intra-group weighting targets hard pairs; inter-group weighting prioritizes informative inputs — a well-motivated design.
- Architecture-aware hyperparameter selection: Reveals a systematic relationship between model capacity and optimal preference pair curation strategy.
- No human annotations required: Fully automated preference data construction offers strong scalability.
Limitations & Future Work¶
- High offline candidate generation cost: Sampling 64 outputs for 30K images requires 168–432 GPU hours; IQA annotation adds another 72 hours.
- Degradation in PSNR/SSIM: Although theoretically expected, this may be problematic in applications with strict fidelity requirements.
- Robustness of reward metric selection: Whether equal weighting of FR and NR is universally optimal remains an open question; different applications may require adjustment.
- Validation limited to ControlNet architectures: Generalizability to other Real-ISR architectures (e.g., end-to-end trained models) has not been verified.
- Potential reward hacking: Whether prolonged training causes over-optimization toward specific IQA metrics warrants investigation.
Related Work & Insights¶
- Diff-DPO: The direct baseline of this work; constructs only a single preference pair. This paper effectively extends it via multi-sampling and ranking.
- RLHF → DPO: Transfers preference alignment ideas from the LLM domain to visual generation — a productive cross-domain adaptation.
- Perception–distortion trade-off theory: Provides theoretical grounding for the observed trade-off between perceptual quality and pixel-level accuracy.
- Inspiration: The preference alignment paradigm is extendable to other visual generation tasks (video super-resolution, image restoration, image generation); the hybrid reward design is broadly applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Converting diffusion stochasticity into preference supervision signals is innovative; HPO design is creative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 14 metrics, two architectures, extensive ablations; very comprehensive.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure with natural logical progression.
- Value: ⭐⭐⭐⭐ — Provides a systematic framework for preference alignment in visual generation.