FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution¶

Conference: CVPR 2026 arXiv: 2512.22647 Code: https://github.com/lyd-2022/FinPercep-RM Area: Image Restoration / Super-Resolution Keywords: Reinforcement learning super-resolution, reward model, fine-grained perception, reward hacking, curriculum learning

TL;DR¶

This paper proposes FinPercep-RM, a fine-grained perceptual reward model that predicts both a global quality score and a perceptual degradation map to spatially localize artifacts. Combined with a co-evolutionary curriculum learning (CCL) strategy that balances training stability and reward robustness, the method effectively mitigates reward hacking in RL-based real-world super-resolution.

Background & Motivation¶

Background: Diffusion-based Real-ISR methods leverage powerful generative priors to synthesize rich textures, and RLHF has been adopted to further optimize perceptual quality.
Limitations of Prior Work: Typical IQA models (CLIP-IQA, MANIQA) output only global scores and are insensitive to local fine-grained distortions — subtle artifacts receive spuriously high rewards (reward hacking), causing local artifacts and unrealistic "painterly" appearances in generated results.
Key Challenge: Simple global IQA rewards are stable but converge to suboptimal solutions (hacking); FinPercep-RM is robust but its spatially complex reward signal destabilizes policy learning — a dilemma between stability and robustness.
Goal: Design a reward model capable of diagnosing where defects occur as well as assessing how good the quality is, while resolving training instability.
Key Insight: An encoder-decoder architecture that jointly outputs a global score and a degradation heatmap, with curriculum learning to progressively introduce complex rewards.
Core Idea: Couple the global score with the degradation map — the global score is computed via modulation by the degradation map, making it inherently sensitive to local defects.

Method¶

Overall Architecture¶

The generator produces a super-resolved image → FinPercep-RM evaluates it (global score + degradation map) → the reward signal guides policy updates of the generator. The CCL mechanism controls the progressive evolution of the reward model from simple to complex.

Key Designs¶

FinPercep-RM Encoder-Decoder Architecture:
- Function: Simultaneously predict a global quality score and a spatial degradation map.
- Mechanism: An encoder (IQA backbone such as CLIP-IQA) extracts multi-scale features \(\{f_i\}_{i=1}^N\); a decoder reconstructs the fine-grained perceptual degradation map \(M_{\text{fg-pdm}} \in [0,1]\) via upsampling and cross-layer fusion. The global score is computed by modulating the deepest feature with the degradation map: \(S_{\text{fgc-global}} = \text{MLP}(f_N \odot \text{interpolate}(M_{\text{fg-pdm}}))\).
- Design Motivation: Coupling the global score with the degradation map ensures the score is sensitive to local defects. The degradation map endows the reward with spatial diagnostic capability.
FGR-30k Dataset:
- Function: Provide fine-grained degradation annotations for training FinPercep-RM.
- Mechanism: Outputs \(I_{SR}\) from multiple Real-ISR models are collected; local defects are "implanted" between \(I_{GT}\) and \(I_{SR}\) via a region-swapping strategy using random masks and SAM semantic masks. Degradation map ground truth is generated by fusing pixel-level L1 differences and DINOv3 feature-level cosine distances: \(M_{gt} = \text{Normalize}(\alpha \cdot \text{Diff}_{\text{pixel}} + (1-\alpha) \cdot \text{Diff}_{\text{feat}})\).
- Design Motivation: Existing IQA datasets lack spatial degradation annotations. Synthetic samples incorporate artifacts produced by real SR models, ensuring that the training signal is consistent with practical application scenarios.
Co-evolutionary Curriculum Learning (CCL):
- Function: Balance training stability and reward robustness.
- Mechanism: Dual co-evolutionary paths: (1) Progressive reward model expansion — starting from a simple global IQA model \(RM_0\), decoder parameters are incrementally introduced, evolving into the full FinPercep-RM \(RM_N\); (2) Co-evolutionary generator curriculum — the generator initially uses global rewards for stable convergence, then progressively transitions to increasingly strict FinPercep-RM versions.
- Design Motivation: Directly applying the full FinPercep-RM causes policy gradient oscillations and convergence failure. An easy-to-hard design ensures stable early convergence followed by fine-grained late-stage optimization.

Loss & Training¶

FinPercep-RM training: \(\mathcal{L}_{total} = \lambda_{map} \mathcal{L}_{map} + \lambda_{rank} \mathcal{L}_{rank} + \lambda_{align} \mathcal{L}_{align}\), comprising a heatmap loss (L1), a triplet ranking loss (hinge), and an anchor alignment loss (to prevent score drift).

Key Experimental Results¶

Main Results¶

Dataset / Method	LPIPS↓	MUSIQ↑	MANIQA↑	ClipIQA↑
SUPIR baseline	0.452	65.67	0.629	0.572
SUPIR w/ IQA	0.465	64.89	0.612	0.589
SUPIR w/ Ours	0.428	67.23	0.648	0.586

Ablation Study¶

Configuration	Outcome	Notes
Standard IQA reward	Fast convergence but reward hacking	Global metrics improve while local artifacts are prominent
FinPercep-RM w/o CCL	Unstable oscillation	Robust but fails to converge
FinPercep-RM + CCL	Stable, optimal convergence	Best of both worlds

Key Findings¶

Standard IQA rewards lead to pronounced reward hacking — global scores rise while visual quality degrades.
FinPercep-RM's user study ratings are highly consistent with human judgment.
CCL is critical — training curves of FinPercep-RM without CCL exhibit severe oscillations.

Highlights & Insights¶

Diagnostic Reward Model: Evaluating not only "how good" but also diagnosing "where it fails" represents an important advance for RLHF in ISR.
Global-Local Coupling Design: Modulating the global score via the degradation map elegantly resolves the blind spot of purely global scoring.
Clever Data Construction: The region-swapping combined with dual-level difference fusion synthesis strategy is both concise and effective.

Limitations & Future Work¶

The cached content covers only partial experimental results; the full ablation study may be richer.
The encoder-decoder introduces additional inference overhead, potentially limiting real-time applicability.
The stage partitioning and transition timing of CCL require manual tuning.
The synthetic strategy of the FGR-30k dataset may not cover all artifact types.
Future work could explore extending the diagnostic reward model to tasks such as video super-resolution.

vs. Direct IQA Rewards: IQA-only global scoring leads to reward hacking; FinPercep-RM provides spatial diagnostic capability.
vs. Large-scale IQA Models: Large-scale IQA models (e.g., Q-Align) offer some fine-grained perception but incur computational costs incompatible with iterative training.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic treatment of reward hacking in Real-ISR; the diagnostic reward model concept is novel.
Experimental Thoroughness: ⭐⭐⭐ — Cached content is limited, but core ablations are clear and validated across multiple ISR models.
Writing Quality: ⭐⭐⭐⭐ — Motivation is conveyed very intuitively through figures; training curve comparisons are persuasive.
Value: ⭐⭐⭐⭐ — Provides an important methodological contribution to RL-based image restoration; the CCL strategy is transferable to other RLHF scenarios.