OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution¶
Conference: CVPR 2026 arXiv: 2603.12811 Code: None Area: Image Generation / Image Super-Resolution Keywords: Real-World Super-Resolution, RLHF, reward model, Online RL, Flow Matching, MLLM, Image Quality Assessment
TL;DR¶
This paper proposes OARS, a framework that systematically addresses human preference alignment in generative real-world image super-resolution for the first time. It introduces COMPASS, an MLLM-based process-aware reward model, and a progressive online reinforcement learning pipeline (cold start → reference-guided RL → non-reference RL), significantly improving perceptual quality while preserving fidelity.
Background & Motivation¶
Problem Definition¶
Real-world image super-resolution (Real-ISR) aims to recover high-fidelity, perceptually pleasing high-resolution (HR) images from low-resolution (LR) inputs degraded by complex and unknown processes. Although diffusion models have brought substantial perceptual quality gains, standard supervised fine-tuning (SFT) suffers from two fundamental limitations: (1) poor generalization to unseen real-world degradations; and (2) lack of a direct optimization mechanism to align generated content with human aesthetic preferences, often resulting in hallucinations or over-smoothing.
Limitations of Prior Work¶
Applying RLHF to Real-ISR faces two critical bottlenecks:
Reward design dilemma: Full-reference (FR) metrics require unavailable ground-truth images; no-reference (NR) metrics lack the fine-grained sensitivity needed to distinguish subtle differences among generative SR outputs. Naively combining FR and NR metrics via static linear weighting ignores degradation severity—potentially under-enhancing high-quality inputs or over-sharpening low-quality ones.
Pseudo-diversity in offline RL: Offline methods such as DP2O-SR construct preference pairs by sampling from the same SFT model with different random seeds. Under the strong spatial constraints of SR, however, these noise variations degenerate into random texture hallucinations rather than genuine structural diversity, causing optimization to collapse over a narrow candidate pool.
Mechanism¶
The paper introduces two key innovations: (1) a process-aware, quality-adaptive reward model that evaluates the LR→SR transformation process rather than static outputs; and (2) an online exploration strategy that breaks the pseudo-diversity bottleneck.
Method¶
Overall Architecture¶
OARS comprises two main components: the COMPASS reward model and a progressive online RL framework.
┌─────────────────────────────────────────────────────────┐
│ OARS Overall Pipeline │
│ │
│ COMPASS-20K Dataset ──→ COMPASS Reward Model (MLLM) │
│ │ │ │
│ ▼ ▼ │
│ Stage 1: Cold Start → Stage 2: FR-RL → Stage 3: NR-RL │
│ (Flow Matching SFT) (with GT ref.) (no ref., COMPASS) │
│ │ │ │ │
│ └──────────────────────┴──────────────────┘ │
│ LoRA merging at inference │
└─────────────────────────────────────────────────────────┘
Key Designs 1: COMPASS Reward Model¶
COMPASS-20K Dataset¶
- Data sources: 800 synthetic LR images from DIV2K (Real-ESRGAN-style degradation) + 1,600 real-world LQ images covering noise, compression artifacts, defocus blur, motion blur, etc.
- SR outputs: 12 mainstream enhancement algorithms (DiffBIR, OSEDiff, SeeSR, etc.) × 2,400 inputs → 28,800 LR–SR pairs
- Annotation dimensions: Fidelity, Perceptual Gain, and textual descriptions
Three-Stage Perceptual Annotation Pipeline¶
This is one of the most elegant designs in the paper, addressing the core challenge of obtaining quality labels that are both globally comparable and intra-group discriminative:
| Stage | Content | Output |
|---|---|---|
| Stage 1: Global Anchor Scoring | Q-Insight independently scores LR and SR: \(Q_{LR}, Q_{SR} \in [1,5]\) | Globally comparable quality anchors |
| Stage 2: Intra-Group Ranking | A pairwise comparison model (trained on DiffIQA data) performs exhaustive pairwise comparisons among all SR outputs for each LR | Intra-group relative ranking \(r \in [0,1]\) |
| Stage 3: Rank-Guided Calibration | Linear calibration per group: \(\hat{Q}_{SR} = \alpha^* \cdot r + \beta^*\), preserving rankings while aligning to the global scale | Calibrated SR quality scores |
Input Quality-Adaptive Reward Mechanism¶
The final reward formula of COMPASS:
where \(\Delta Q = Q_{SR} - Q_{LR}\) and \(\gamma=7\).
- First term \(F \cdot Q_{LR}\): measures preservation of the original quality of the input image.
- Second term \(F^{Q_{LR}/\gamma} \cdot \Delta Q\): perceptual gain, adaptively controlled by input quality.
- When input quality is high, the exponent \(Q_{LR}/\gamma\) is large, making the reward highly sensitive to fidelity degradation → encouraging conservative enhancement.
- When input quality is low, the fidelity constraint is relaxed → allowing more aggressive perceptual improvement.
- This dynamic gating ensures that perceptual enhancement is strictly constrained by content preservation.
Key Designs 2: Progressive Online RL¶
Stage 1: Cold Start (Flow Matching SFT)¶
The model is trained on large-scale LR–HR paired data using a flow matching objective to learn basic SR capabilities:
Stage 2: Full-Reference RL¶
Applying RL directly on the SFT model leads to training instability and reward hacking. This stage serves as a buffer between SFT and non-reference optimization:
- Fidelity supervision: DISTS is computed directly between the SR output and the GT, rather than relying on a learned reward model to predict fidelity.
- Shallow LoRA optimization: RL updates are applied via LoRA on the base model rather than on the SFT weights, motivated by three reasons:
- The base model's parameter distribution is close to that of the SFT model, enabling stable merging.
- The base model has higher sampling stochasticity, facilitating exploration.
- It is less susceptible to reward hacking.
Negative-Aware Objective¶
Implicit positive and negative policy directions are defined as:
The final RL objective is:
where \(r\) is the optimality probability after intra-group reward normalization and variance filtering. Groups with high variance and low mean are discarded to avoid ambiguous supervision.
Stage 3: Non-Reference RL¶
Training continues on real-world LQ data without GT images, with rewards provided entirely by COMPASS. At inference time, the final LoRA parameters \(\Delta_{NR}\) are merged into the SFT model.
Loss & Training¶
- Base model: Qwen-Image-Edit-2509
- LoRA rank=32, alpha=64
- 6-step sampling during training, 40 steps at inference
- \(K=24\) candidate samples per LR image
- Group filtering threshold: groups with mean > 0.9 and variance < 0.05 are discarded (near-identical samples provide no discriminative value)
Key Experimental Results¶
Main Results: SR Performance on Three Datasets (Table 2, RealSR subset)¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | DISTS↓ | LIQE↑ | MUSIQ↑ | MANIQA↑ | Q-Insight↑ | TOPIQ↑ |
|---|---|---|---|---|---|---|---|---|---|
| DiffBIR | 23.20 | 0.6346 | 0.3350 | 0.2162 | 3.553 | 65.25 | 0.462 | 3.530 | 0.603 |
| SeeSR | 24.34 | 0.7187 | 0.2754 | 0.2134 | 3.394 | 65.53 | 0.486 | 3.285 | 0.625 |
| OSEDiff | 23.07 | 0.6850 | 0.2941 | 0.2109 | 4.068 | 68.95 | 0.488 | 3.712 | 0.644 |
| UARE | 21.38 | 0.6464 | 0.3095 | 0.2344 | 4.066 | 69.67 | 0.526 | 3.664 | 0.680 |
| Qwen-SFT | 22.71 | 0.6462 | 0.3100 | 0.2203 | 3.815 | 68.57 | 0.490 | 3.545 | 0.640 |
| OARS | 22.36 | 0.6481 | 0.3095 | 0.2244 | 4.305 | 71.41 | 0.528 | 3.701 | 0.680 |
Key Findings: OARS consistently achieves the best or second-best performance across all NR metrics, while FR metrics (PSNR/SSIM/LPIPS/DISTS) show virtually no degradation relative to Qwen-SFT. Compared to perceptual-oriented methods (PURE, UARE), OARS achieves superior FR metrics, demonstrating that the reward design and RL strategy effectively balance fidelity and perceptual enhancement.
Ablation Study: Reward Model Components (Table 3)¶
| Case | Score Calibration | Explicit Fidelity | Quality-Adaptive γ | Accuracy |
|---|---|---|---|---|
| 1 | ✗ | ✗ | ✗ | 78.8% |
| 2 | ✓ | ✗ | ✗ | 81.5% |
| 3 | ✓ | ✓ | ✗ | 82.3% |
| 4 | ✓ | ✓ | γ=5 | 82.7% |
| 5 | ✓ | ✓ | γ=7 | 83.1% |
| 6 | ✓ | ✓ | γ=9 | 82.8% |
Three-stage calibration yields +2.7%; explicit fidelity modeling contributes +0.8%; the quality-adaptive \(\gamma\) provides a further +0.8%.
Ablation Study: RL Stages and Initialization Strategy (Table 5, RealSR)¶
| Method | RL Stage | Initialization | PSNR↑ | LIQE↑ | MUSIQ↑ | TOPIQ↑ |
|---|---|---|---|---|---|---|
| Qwen-SFT | - | - | 22.71 | 3.815 | 68.57 | 0.640 |
| Case 1 | stage1 | base | 22.52 | 4.235 | 71.02 | 0.674 |
| Case 2 | stage1+2 | base | 22.36 | 4.305 | 71.41 | 0.680 |
| Case 3 | stage1 | sft | 22.15 | 4.078 | 70.60 | 0.676 |
| Case 4 | stage1+2 | sft | 21.31 | 4.094 | 70.56 | 0.677 |
Key Findings: Applying RL directly on the SFT model (Cases 3–4) leads to continuous degradation of FR metrics, with PSNR dropping from 22.71 to 21.31 after stage1+2. In contrast, shallow LoRA optimization on the base model (Cases 1–2) is substantially more robust, with only a marginal PSNR decrease and more pronounced NR metric gains. This validates the superiority of the online exploration strategy via shallow LoRA on the base model.
Other Key Findings¶
- User study: Among 27 expert evaluators, OARS received 47.62% of votes, far exceeding the second-ranked method DP2O-SR (27.68%).
- SRIQA-Bench preference accuracy: COMPASS achieves 83.1% All-Acc, surpassing all GT-Ref and GT-Free baselines.
- Generalizability: Applying OARS to a Flux backbone is also effective, with MANIQA improving from 0.469 to 0.504.
- Comparison with Flow-GRPO: Forward-process RL (DiffusionNFT paradigm) is 5–10× more efficient than trajectory-level RL and more stable under the strong spatial constraints of SR.
Highlights & Insights¶
-
Process-oriented evaluation paradigm: The paper shifts SR evaluation from an "output-centric" to a "process-aware" perspective, assessing the LR→SR transformation process rather than static outputs. This conceptual shift enables unified modeling of fidelity and perceptual gain.
-
Three-stage annotation pipeline: The pipeline elegantly resolves the tension between global comparability and intra-group fine-grained discrimination—global anchors provide comparability, intra-group rankings provide granularity, and linear calibration achieves a unified scale.
-
Dual role of shallow LoRA: Applying LoRA on the base model not only provides higher sampling stochasticity for online exploration but also reduces reward hacking risk by avoiding direct modification of SFT weights. Merging the LoRA back into the SFT model at inference time yields an exceptionally elegant design.
-
Input quality-adaptive gating: The design of \(F^{Q_{LR}/\gamma}\) allows the reward function to automatically adjust the fidelity–perception trade-off based on degradation severity—encouraging conservative enhancement for high-quality inputs while permitting greater perceptual improvement for low-quality inputs—highly intuitive and well-motivated.
Limitations & Future Work¶
- Computational cost: Training requires 8×H20 GPUs for RL and 8×A100 GPUs for deploying the reward server, imposing extremely high resource requirements.
- Multi-stage training complexity: The three-stage progressive training pipeline (SFT→FR-RL→NR-RL) increases engineering complexity and the hyperparameter search space.
- Reward model generalizability: COMPASS is validated on SRIQA-Bench, which is relatively small (100 LR images); its generalization to more diverse distributions remains to be verified.
- Lack of explicit degradation-aware modeling: Although degradation severity is implicitly captured via \(Q_{LR}\), degradation type is not explicitly modeled, potentially leading to suboptimal handling of specific degradation patterns (e.g., severe compression artifacts).
- Only 4× SR validated: Applicability to other upscaling factors and different resolution ranges has not been investigated.
Related Work & Insights¶
- DiffusionNFT (forward-process RL) and Flow-GRPO (trajectory-level RL): A comparison of these two RL paradigms suggests that forward-process RL is more efficient and stable for generation tasks with strong conditional constraints.
- DP2O-SR (offline DPO): This paper clearly articulates the root cause of pseudo-diversity in offline methods when applied to SR tasks.
- Q-Insight: Used as global quality anchors and the base comparison model, demonstrating the potential of MLLM-based IQA.
- DiffIQA/SRIQA-Bench: These provide benchmarks for SR quality assessment, but their pairwise annotations are incomplete; the three-stage pipeline in this paper is an important complement.
- Implications: The concept of process-aware rewards can be generalized to other image enhancement tasks (denoising, dehazing, HDR, etc.), and the input quality-adaptive mechanism is applicable to any scenario requiring a balance between fidelity and enhancement.
Rating¶
| Dimension | Score (1–5) | Notes |
|---|---|---|
| Novelty | 4.5 | The combination of process-aware rewards and progressive online RL represents the first systematic attempt in this domain. |
| Technical Depth | 4.5 | The three-stage annotation, adaptive reward formula, and shallow LoRA exploration strategy are all well-motivated with clear theoretical grounding. |
| Experimental Thoroughness | 4.5 | Three datasets, multiple metrics, user study, extensive ablations, and multi-backbone validation. |
| Writing Quality | 4.0 | Structure is clear, though the intuitive explanation behind some formulas and design choices could be more thorough. |
| Value | 3.5 | The method is effective but resource-intensive, limiting its practical deployment scope. |
| Overall | 4.2 | Makes systematic contributions to RLHF for generative SR, with a high level of completeness. |