OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution¶

Conference: CVPR 2026 arXiv: 2603.12811 Code: Unavailable (as of March 2026) Area: Image Generation Keywords: Real-World Super-Resolution, Online RL, MLLM Reward, Process-Aware, Perception-Fidelity Trade-off

TL;DR¶

This paper proposes OARS, a framework that aligns generative real-world image super-resolution models with human visual preferences via COMPASS—an MLLM-based process-aware reward model—and a progressive online reinforcement learning pipeline, achieving adaptive balance between perceptual quality and fidelity.

Background & Motivation¶

Background: Real-world image super-resolution (Real-ISR) must handle complex and unknown degradations. Early CNN-based methods (L1/L2 losses) produce over-smoothed textures, while diffusion models improve perceptual quality but generalize poorly under standard SFT on unseen degradations, lacking direct optimization mechanisms aligned with human aesthetic preferences.
Limitations of Prior Work: Existing IQA metrics as RL rewards suffer fundamental drawbacks: full-reference (FR) metrics are inapplicable in real-world scenarios without ground truth, while no-reference (NR) metrics lack fine-grained sensitivity to distinguish subtle differences among generative SR outputs. Naively combining FR and NR metrics linearly ignores the varying degrees of input degradation. Offline RL methods (e.g., DPO) face a "pseudo-diversity" problem: SR outputs sampled via different noise seeds collapse into random texture hallucinations rather than genuine structural diversity under strong spatial constraints, limiting preference alignment.
Key Challenge: The absence of a process-aware, quality-adaptive reward model and an online exploration strategy capable of overcoming the pseudo-diversity bottleneck.

Core Problem¶

How to design a process-aware, quality-adaptive reward model and an online exploration strategy that breaks the pseudo-diversity bottleneck, enabling effective alignment of generative Real-ISR models with human visual preferences.

Method¶

Overall Architecture¶

OARS consists of two core components: (1) the COMPASS reward model—an MLLM-based process-aware scorer that evaluates fidelity preservation and perceptual gain during LR→SR conversion; and (2) a progressive online RL framework—comprising cold-start SFT, full-reference RL, and no-reference RL stages, with shallow LoRA optimization on the base model to enable on-policy exploration.

Key Designs¶

COMPASS-20K Dataset and Three-Stage Annotation Pipeline:
- Constructs 2,400 input images (800 synthetic DIV2K LR + 1,600 real LQ images), processed by 12 SR algorithms to yield 28,800 LR-SR pairs.
- Fidelity annotation: On the synthetic subset, DISTS(SR, GT) distances are normalized to \([0, 1]\).
- Perceptual quality gain annotation (three stages):
  - Stage 1 — Global anchor scoring: Q-Insight independently scores LR and SR images to obtain globally comparable quality scores.
  - Stage 2 — Intra-group ranking: A dedicated pairwise comparison model (trained on DiffIQA) exhaustively compares all SR outputs for the same LR input and aggregates results into a continuous ranking score \(r \in [0, 1]\).
  - Stage 3 — Ranking-guided calibration: Linear calibration (via least-squares estimation of \(\alpha^*\) and \(\beta^*\)) aligns intra-group rankings with the global scale.
- Qwen3-VL-32B generates explanatory text descriptions; manual inspection removes conflicting samples in the top/bottom 5%.
COMPASS Reward Function — Input Quality-Adaptive Mechanism:
- Full-parameter SFT on QwenVL-8B jointly predicts fidelity \(F\), input quality \(Q_{LR}\), and output quality \(Q_{SR}\).
- Reward formula: \(R = F \cdot Q_{LR} + F^{Q_{LR}/\gamma} \cdot \Delta Q\), where \(\Delta Q = Q_{SR} - Q_{LR}\) and \(\gamma = 7\).
- The first term \(F \cdot Q_{LR}\) measures preservation of original quality; the second term uses the exponent \(Q_{LR}/\gamma\) to realize adaptive control.
- High-quality input → large exponent → high sensitivity to fidelity degradation → conservative enhancement encouraged; low-quality input → small exponent → relaxed fidelity constraint → greater perceptual improvement permitted.
Progressive Online RL (Three Stages):
- Cold-start stage: Trains on LR-HR paired data with a Flow Matching objective to acquire basic SR capability.
- Full-reference RL stage: On data with available ground truth, consistency rewards are computed directly via DISTS (rather than via reward model prediction) to avoid reward hacking. A key design choice is applying LoRA to the base model (rather than the SFT model)—the base model's higher sampling stochasticity facilitates better exploration.
- No-reference RL stage: On real LQ data without ground truth, COMPASS provides the sole reward signal; LoRA fine-tuning continues on the base model.
- At inference time, the final LoRA parameters \(\Delta_{NR}\) are merged into the cold-start SFT model.
Negative Perceptual Objective and Group Filtering:
- For each LR input, \(K = 24\) candidates are sampled; groups with high mean and low variance (thresholds 0.9 / 0.05) are filtered out after reward computation.
- Implicit positive/negative policies are defined as linear combinations of the old and current policies.
- The loss simultaneously trains the model to learn "what to do" from high-reward samples and "what to avoid" from low-reward samples.

Loss & Training¶

Cold-start: Flow Matching loss \(\mathcal{L}_{SFT} = \mathbb{E}[\|v - v_\theta(x_t, t | x_{LR}, c)\|^2]\)
RL stages: Negative perceptual objective \(\mathcal{L}_{RL} = \mathbb{E}[r\|v_\theta^+ - v\|^2 + (1-r)\|v_\theta^- - v\|^2]\), where \(r\) is the normalized and clipped optimal probability.
LoRA configuration: rank=32, alpha=64; 6-step sampling for training, 40-step for inference; 8×H20 GPUs.

Key Experimental Results¶

Dataset	Metric	OARS	Qwen-SFT	Best Baseline	Gain (vs. Qwen-SFT)
RealSR	LIQE↑	4.3045	3.8146	UARE: 4.0658	+0.49
RealSR	MUSIQ↑	71.41	68.57	UARE: 69.67	+2.84
DIV2K	LIQE↑	4.6668	4.3404	UARE: 4.2627	+0.33
DIV2K	MUSIQ↑	74.07	72.35	UARE: 70.45	+1.72
RealSet80	LIQE↑	4.5465	4.1602	SeeSR: 4.3317	+0.39
SRIQA-Bench	All-Acc	83.1%	—	A-FINE: 82.4%	Best GT-Free

OARS achieves consistent top performance on all NR metrics without notable degradation in FR metrics (PSNR/SSIM) relative to Qwen-SFT.
User study: OARS receives 47.62% of votes, the highest among all methods (DP2O-SR: 27.68%).

Ablation Study¶

The three-stage annotation calibration improves accuracy from 78.8% to 81.5%; adding explicit fidelity modeling raises it to 82.3%; the quality-adaptive mechanism at \(\gamma = 7\) achieves the best 83.1%.
Applying RL on the base model (vs. the SFT model) is critical: RL on the SFT model causes severe FR degradation (PSNR: 22.71→21.31), while the base model remains stable (22.71→22.36).
Using only perceptual gain \(\Delta Q\) as reward leads to severe reward hacking (PSNR drops to 21.38, producing artifacts with spuriously high NR scores).
General-purpose rewards (HPSv2, Qwen25-VL) fail to provide effective feedback for SR; RALI improves perceptual gain but causes severe fidelity degradation.
The NFT-based OARS converges 5–10× faster than Flow-GRPO and achieves superior NR metric performance.

Highlights & Insights¶

Paradigm shift to process-aware evaluation: Rather than scoring SR outputs as static results, OARS evaluates the LR→SR transformation process, decoupling fidelity preservation from perceptual gain—a fundamental innovation in the evaluation framework.
Input quality-adaptive reward: The exponential gating mechanism \(F^{Q_{LR}/\gamma}\) dynamically modulates the perception-fidelity trade-off based on input degradation level, yielding an elegant and effective design.
Shallow LoRA on the base model: Performing shallow LoRA optimization on the base model (rather than the SFT model) leverages the base model's higher stochasticity for better on-policy exploration while avoiding reward hacking.
Three-stage annotation pipeline: Unifying global comparability with intra-group fine-grained discriminability addresses the key limitation of NR-IQA insensitivity to generative SR outputs.

Limitations & Future Work¶

The MLLM reward model incurs high computational overhead, constraining online RL training efficiency—distillation into a lightweight scorer is a natural next step.
The framework is limited to image SR and has not been extended to video SR, where temporal consistency poses additional challenges.
Validation is restricted to 4× SR; generalization to other scale factors and tasks (denoising, deblurring) remains unexplored.
The 12 SR methods in COMPASS-20K may not cover the full diversity of all generative paradigms.

vs. DP2O-SR (offline DPO): OARS overcomes the pseudo-diversity problem of offline sampling via online RL; applying OARS's no-reference RL stage on top of DP2O-SR yields further consistent improvements across all metrics.
vs. Flow-GRPO: Both perform online RL alignment, but OARS employs forward-process RL (DiffusionNFT-style) rather than trajectory-level RL. As a strongly constrained generation task, SR does not require trajectory-level exploration; OARS is 5–10× more training-efficient.
vs. Q-Insight / CLIP-IQA+ and other NR-IQA metrics: These metrics assess the output in isolation and cannot perceive the enhancement process from LR to SR. COMPASS surpasses all FR and NR baselines on SRIQA-Bench through process-aware evaluation combined with the adaptive mechanism.

Highlights & Insights (General)¶

The concept of "process-aware" evaluation generalizes to other image enhancement and editing tasks: rather than asking only whether the result is good, one should ask what was improved and what was preserved relative to the input.
The strategy of shallow LoRA on the base model for RL warrants validation in other conditional generation tasks.
The three-stage annotation pipeline (global + intra-group + calibration) constitutes a general methodology for fine-grained preference annotation.

Rating¶

Novelty: ⭐⭐⭐⭐ — Process-aware reward and adaptive gating are core innovations; the progressive RL framework is well-designed, though individual components are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three datasets, nine baselines, comprehensive ablations (reward formula / RL stages / backbone / RL method), and a user study; extremely thorough.
Writing Quality: ⭐⭐⭐⭐ — Logically clear with well-motivated methodology and concise formulations, though the high information density requires careful re-reading.
Value: ⭐⭐⭐⭐ — Provides a complete RLHF pipeline for generative SR post-training; COMPASS can be independently reused for other low-level vision tasks.