OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution¶
Conference: CVPR 2026 arXiv: 2603.12811 Code: Unavailable (as of March 2026) Area: Image Generation Keywords: Real-World Super-Resolution, Online RL, MLLM Reward, Process-Aware, Perception-Fidelity Trade-off
TL;DR¶
This paper proposes OARS, a framework that aligns generative real-world image super-resolution models with human visual preferences via COMPASS—an MLLM-based process-aware reward model—and a progressive online reinforcement learning pipeline, achieving adaptive balance between perceptual quality and fidelity.
Background & Motivation¶
- Background: Real-world image super-resolution (Real-ISR) must handle complex and unknown degradations. Early CNN-based methods (L1/L2 losses) produce over-smoothed textures, while diffusion models improve perceptual quality but generalize poorly under standard SFT on unseen degradations, lacking direct optimization mechanisms aligned with human aesthetic preferences.
- Limitations of Prior Work: Existing IQA metrics as RL rewards suffer fundamental drawbacks: full-reference (FR) metrics are inapplicable in real-world scenarios without ground truth, while no-reference (NR) metrics lack fine-grained sensitivity to distinguish subtle differences among generative SR outputs. Naively combining FR and NR metrics linearly ignores the varying degrees of input degradation. Offline RL methods (e.g., DPO) face a "pseudo-diversity" problem: SR outputs sampled via different noise seeds collapse into random texture hallucinations rather than genuine structural diversity under strong spatial constraints, limiting preference alignment.
- Key Challenge: The absence of a process-aware, quality-adaptive reward model and an online exploration strategy capable of overcoming the pseudo-diversity bottleneck.
Core Problem¶
How to design a process-aware, quality-adaptive reward model and an online exploration strategy that breaks the pseudo-diversity bottleneck, enabling effective alignment of generative Real-ISR models with human visual preferences.
Method¶
Overall Architecture¶
OARS consists of two core components: (1) the COMPASS reward model—an MLLM-based process-aware scorer that evaluates fidelity preservation and perceptual gain during LR→SR conversion; and (2) a progressive online RL framework—comprising cold-start SFT, full-reference RL, and no-reference RL stages, with shallow LoRA optimization on the base model to enable on-policy exploration.
Key Designs¶
-
COMPASS-20K Dataset and Three-Stage Annotation Pipeline:
- Constructs 2,400 input images (800 synthetic DIV2K LR + 1,600 real LQ images), processed by 12 SR algorithms to yield 28,800 LR-SR pairs.
- Fidelity annotation: On the synthetic subset, DISTS(SR, GT) distances are normalized to \([0, 1]\).
- Perceptual quality gain annotation (three stages):
- Stage 1 — Global anchor scoring: Q-Insight independently scores LR and SR images to obtain globally comparable quality scores.
- Stage 2 — Intra-group ranking: A dedicated pairwise comparison model (trained on DiffIQA) exhaustively compares all SR outputs for the same LR input and aggregates results into a continuous ranking score \(r \in [0, 1]\).
- Stage 3 — Ranking-guided calibration: Linear calibration (via least-squares estimation of \(\alpha^*\) and \(\beta^*\)) aligns intra-group rankings with the global scale.
- Qwen3-VL-32B generates explanatory text descriptions; manual inspection removes conflicting samples in the top/bottom 5%.
-
COMPASS Reward Function — Input Quality-Adaptive Mechanism:
- Full-parameter SFT on QwenVL-8B jointly predicts fidelity \(F\), input quality \(Q_{LR}\), and output quality \(Q_{SR}\).
- Reward formula: \(R = F \cdot Q_{LR} + F^{Q_{LR}/\gamma} \cdot \Delta Q\), where \(\Delta Q = Q_{SR} - Q_{LR}\) and \(\gamma = 7\).
- The first term \(F \cdot Q_{LR}\) measures preservation of original quality; the second term uses the exponent \(Q_{LR}/\gamma\) to realize adaptive control.
- High-quality input → large exponent → high sensitivity to fidelity degradation → conservative enhancement encouraged; low-quality input → small exponent → relaxed fidelity constraint → greater perceptual improvement permitted.
-
Progressive Online RL (Three Stages):
- Cold-start stage: Trains on LR-HR paired data with a Flow Matching objective to acquire basic SR capability.
- Full-reference RL stage: On data with available ground truth, consistency rewards are computed directly via DISTS (rather than via reward model prediction) to avoid reward hacking. A key design choice is applying LoRA to the base model (rather than the SFT model)—the base model's higher sampling stochasticity facilitates better exploration.
- No-reference RL stage: On real LQ data without ground truth, COMPASS provides the sole reward signal; LoRA fine-tuning continues on the base model.
- At inference time, the final LoRA parameters \(\Delta_{NR}\) are merged into the cold-start SFT model.
-
Negative Perceptual Objective and Group Filtering:
- For each LR input, \(K = 24\) candidates are sampled; groups with high mean and low variance (thresholds 0.9 / 0.05) are filtered out after reward computation.
- Implicit positive/negative policies are defined as linear combinations of the old and current policies.
- The loss simultaneously trains the model to learn "what to do" from high-reward samples and "what to avoid" from low-reward samples.
Loss & Training¶
- Cold-start: Flow Matching loss \(\mathcal{L}_{SFT} = \mathbb{E}[\|v - v_\theta(x_t, t | x_{LR}, c)\|^2]\)
- RL stages: Negative perceptual objective \(\mathcal{L}_{RL} = \mathbb{E}[r\|v_\theta^+ - v\|^2 + (1-r)\|v_\theta^- - v\|^2]\), where \(r\) is the normalized and clipped optimal probability.
- LoRA configuration: rank=32, alpha=64; 6-step sampling for training, 40-step for inference; 8×H20 GPUs.
Key Experimental Results¶
| Dataset | Metric | OARS | Qwen-SFT | Best Baseline | Gain (vs. Qwen-SFT) |
|---|---|---|---|---|---|
| RealSR | LIQE↑ | 4.3045 | 3.8146 | UARE: 4.0658 | +0.49 |
| RealSR | MUSIQ↑ | 71.41 | 68.57 | UARE: 69.67 | +2.84 |
| DIV2K | LIQE↑ | 4.6668 | 4.3404 | UARE: 4.2627 | +0.33 |
| DIV2K | MUSIQ↑ | 74.07 | 72.35 | UARE: 70.45 | +1.72 |
| RealSet80 | LIQE↑ | 4.5465 | 4.1602 | SeeSR: 4.3317 | +0.39 |
| SRIQA-Bench | All-Acc | 83.1% | — | A-FINE: 82.4% | Best GT-Free |
- OARS achieves consistent top performance on all NR metrics without notable degradation in FR metrics (PSNR/SSIM) relative to Qwen-SFT.
- User study: OARS receives 47.62% of votes, the highest among all methods (DP2O-SR: 27.68%).
Ablation Study¶
- The three-stage annotation calibration improves accuracy from 78.8% to 81.5%; adding explicit fidelity modeling raises it to 82.3%; the quality-adaptive mechanism at \(\gamma = 7\) achieves the best 83.1%.
- Applying RL on the base model (vs. the SFT model) is critical: RL on the SFT model causes severe FR degradation (PSNR: 22.71→21.31), while the base model remains stable (22.71→22.36).
- Using only perceptual gain \(\Delta Q\) as reward leads to severe reward hacking (PSNR drops to 21.38, producing artifacts with spuriously high NR scores).
- General-purpose rewards (HPSv2, Qwen25-VL) fail to provide effective feedback for SR; RALI improves perceptual gain but causes severe fidelity degradation.
- The NFT-based OARS converges 5–10× faster than Flow-GRPO and achieves superior NR metric performance.
Highlights & Insights¶
- Paradigm shift to process-aware evaluation: Rather than scoring SR outputs as static results, OARS evaluates the LR→SR transformation process, decoupling fidelity preservation from perceptual gain—a fundamental innovation in the evaluation framework.
- Input quality-adaptive reward: The exponential gating mechanism \(F^{Q_{LR}/\gamma}\) dynamically modulates the perception-fidelity trade-off based on input degradation level, yielding an elegant and effective design.
- Shallow LoRA on the base model: Performing shallow LoRA optimization on the base model (rather than the SFT model) leverages the base model's higher stochasticity for better on-policy exploration while avoiding reward hacking.
- Three-stage annotation pipeline: Unifying global comparability with intra-group fine-grained discriminability addresses the key limitation of NR-IQA insensitivity to generative SR outputs.
Limitations & Future Work¶
- The MLLM reward model incurs high computational overhead, constraining online RL training efficiency—distillation into a lightweight scorer is a natural next step.
- The framework is limited to image SR and has not been extended to video SR, where temporal consistency poses additional challenges.
- Validation is restricted to 4× SR; generalization to other scale factors and tasks (denoising, deblurring) remains unexplored.
- The 12 SR methods in COMPASS-20K may not cover the full diversity of all generative paradigms.
Related Work & Insights¶
- vs. DP2O-SR (offline DPO): OARS overcomes the pseudo-diversity problem of offline sampling via online RL; applying OARS's no-reference RL stage on top of DP2O-SR yields further consistent improvements across all metrics.
- vs. Flow-GRPO: Both perform online RL alignment, but OARS employs forward-process RL (DiffusionNFT-style) rather than trajectory-level RL. As a strongly constrained generation task, SR does not require trajectory-level exploration; OARS is 5–10× more training-efficient.
- vs. Q-Insight / CLIP-IQA+ and other NR-IQA metrics: These metrics assess the output in isolation and cannot perceive the enhancement process from LR to SR. COMPASS surpasses all FR and NR baselines on SRIQA-Bench through process-aware evaluation combined with the adaptive mechanism.
Highlights & Insights (General)¶
- The concept of "process-aware" evaluation generalizes to other image enhancement and editing tasks: rather than asking only whether the result is good, one should ask what was improved and what was preserved relative to the input.
- The strategy of shallow LoRA on the base model for RL warrants validation in other conditional generation tasks.
- The three-stage annotation pipeline (global + intra-group + calibration) constitutes a general methodology for fine-grained preference annotation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Process-aware reward and adaptive gating are core innovations; the progressive RL framework is well-designed, though individual components are not entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, nine baselines, comprehensive ablations (reward formula / RL stages / backbone / RL method), and a user study; extremely thorough.
- Writing Quality: ⭐⭐⭐⭐ — Logically clear with well-motivated methodology and concise formulations, though the high information density requires careful re-reading.
- Value: ⭐⭐⭐⭐ — Provides a complete RLHF pipeline for generative SR post-training; COMPASS can be independently reused for other low-level vision tasks.