Skip to content

FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

Conference: CVPR 2026 arXiv: 2512.22647 Code: https://github.com/lyd-2022/FinPercep-RM Area: Image Restoration / Super-Resolution Keywords: Reinforcement learning super-resolution, reward model, fine-grained perception, reward hacking, curriculum learning

TL;DR

This paper proposes FinPercep-RM, a fine-grained perceptual reward model that predicts both a global quality score and a perceptual degradation map to spatially localize artifacts. Combined with a co-evolutionary curriculum learning (CCL) strategy that balances training stability and reward robustness, the method effectively mitigates reward hacking in RL-based real-world super-resolution.

Background & Motivation

  1. Background: Diffusion-based Real-ISR methods leverage powerful generative priors to synthesize rich textures, and RLHF has been adopted to further optimize perceptual quality.
  2. Limitations of Prior Work: Typical IQA models (CLIP-IQA, MANIQA) output only global scores and are insensitive to local fine-grained distortions — subtle artifacts receive spuriously high rewards (reward hacking), causing local artifacts and unrealistic "painterly" appearances in generated results.
  3. Key Challenge: Simple global IQA rewards are stable but converge to suboptimal solutions (hacking); FinPercep-RM is robust but its spatially complex reward signal destabilizes policy learning — a dilemma between stability and robustness.
  4. Goal: Design a reward model capable of diagnosing where defects occur as well as assessing how good the quality is, while resolving training instability.
  5. Key Insight: An encoder-decoder architecture that jointly outputs a global score and a degradation heatmap, with curriculum learning to progressively introduce complex rewards.
  6. Core Idea: Couple the global score with the degradation map — the global score is computed via modulation by the degradation map, making it inherently sensitive to local defects.

Method

Overall Architecture

The generator produces a super-resolved image → FinPercep-RM evaluates it (global score + degradation map) → the reward signal guides policy updates of the generator. The CCL mechanism controls the progressive evolution of the reward model from simple to complex.

Key Designs

  1. FinPercep-RM Encoder-Decoder Architecture:

    • Function: Simultaneously predict a global quality score and a spatial degradation map.
    • Mechanism: An encoder (IQA backbone such as CLIP-IQA) extracts multi-scale features \(\{f_i\}_{i=1}^N\); a decoder reconstructs the fine-grained perceptual degradation map \(M_{\text{fg-pdm}} \in [0,1]\) via upsampling and cross-layer fusion. The global score is computed by modulating the deepest feature with the degradation map: \(S_{\text{fgc-global}} = \text{MLP}(f_N \odot \text{interpolate}(M_{\text{fg-pdm}}))\).
    • Design Motivation: Coupling the global score with the degradation map ensures the score is sensitive to local defects. The degradation map endows the reward with spatial diagnostic capability.
  2. FGR-30k Dataset:

    • Function: Provide fine-grained degradation annotations for training FinPercep-RM.
    • Mechanism: Outputs \(I_{SR}\) from multiple Real-ISR models are collected; local defects are "implanted" between \(I_{GT}\) and \(I_{SR}\) via a region-swapping strategy using random masks and SAM semantic masks. Degradation map ground truth is generated by fusing pixel-level L1 differences and DINOv3 feature-level cosine distances: \(M_{gt} = \text{Normalize}(\alpha \cdot \text{Diff}_{\text{pixel}} + (1-\alpha) \cdot \text{Diff}_{\text{feat}})\).
    • Design Motivation: Existing IQA datasets lack spatial degradation annotations. Synthetic samples incorporate artifacts produced by real SR models, ensuring that the training signal is consistent with practical application scenarios.
  3. Co-evolutionary Curriculum Learning (CCL):

    • Function: Balance training stability and reward robustness.
    • Mechanism: Dual co-evolutionary paths: (1) Progressive reward model expansion — starting from a simple global IQA model \(RM_0\), decoder parameters are incrementally introduced, evolving into the full FinPercep-RM \(RM_N\); (2) Co-evolutionary generator curriculum — the generator initially uses global rewards for stable convergence, then progressively transitions to increasingly strict FinPercep-RM versions.
    • Design Motivation: Directly applying the full FinPercep-RM causes policy gradient oscillations and convergence failure. An easy-to-hard design ensures stable early convergence followed by fine-grained late-stage optimization.

Loss & Training

FinPercep-RM training: \(\mathcal{L}_{total} = \lambda_{map} \mathcal{L}_{map} + \lambda_{rank} \mathcal{L}_{rank} + \lambda_{align} \mathcal{L}_{align}\), comprising a heatmap loss (L1), a triplet ranking loss (hinge), and an anchor alignment loss (to prevent score drift).

Key Experimental Results

Main Results

Dataset / Method LPIPS↓ MUSIQ↑ MANIQA↑ ClipIQA↑
SUPIR baseline 0.452 65.67 0.629 0.572
SUPIR w/ IQA 0.465 64.89 0.612 0.589
SUPIR w/ Ours 0.428 67.23 0.648 0.586

Ablation Study

Configuration Outcome Notes
Standard IQA reward Fast convergence but reward hacking Global metrics improve while local artifacts are prominent
FinPercep-RM w/o CCL Unstable oscillation Robust but fails to converge
FinPercep-RM + CCL Stable, optimal convergence Best of both worlds

Key Findings

  • Standard IQA rewards lead to pronounced reward hacking — global scores rise while visual quality degrades.
  • FinPercep-RM's user study ratings are highly consistent with human judgment.
  • CCL is critical — training curves of FinPercep-RM without CCL exhibit severe oscillations.

Highlights & Insights

  • Diagnostic Reward Model: Evaluating not only "how good" but also diagnosing "where it fails" represents an important advance for RLHF in ISR.
  • Global-Local Coupling Design: Modulating the global score via the degradation map elegantly resolves the blind spot of purely global scoring.
  • Clever Data Construction: The region-swapping combined with dual-level difference fusion synthesis strategy is both concise and effective.

Limitations & Future Work

  • The cached content covers only partial experimental results; the full ablation study may be richer.
  • The encoder-decoder introduces additional inference overhead, potentially limiting real-time applicability.
  • The stage partitioning and transition timing of CCL require manual tuning.
  • The synthetic strategy of the FGR-30k dataset may not cover all artifact types.
  • Future work could explore extending the diagnostic reward model to tasks such as video super-resolution.
  • vs. Direct IQA Rewards: IQA-only global scoring leads to reward hacking; FinPercep-RM provides spatial diagnostic capability.
  • vs. Large-scale IQA Models: Large-scale IQA models (e.g., Q-Align) offer some fine-grained perception but incur computational costs incompatible with iterative training.

Rating

  • Novelty: ⭐⭐⭐⭐ — First systematic treatment of reward hacking in Real-ISR; the diagnostic reward model concept is novel.
  • Experimental Thoroughness: ⭐⭐⭐ — Cached content is limited, but core ablations are clear and validated across multiple ISR models.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is conveyed very intuitively through figures; training curve comparisons are persuasive.
  • Value: ⭐⭐⭐⭐ — Provides an important methodological contribution to RL-based image restoration; the CCL strategy is transferable to other RLHF scenarios.