FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution¶
Conference: CVPR 2026 arXiv: 2512.22647 Code: https://github.com/lyd-2022/FinPercep-RM Area: Image Restoration / Super-Resolution Keywords: Reinforcement learning super-resolution, reward model, fine-grained perception, reward hacking, curriculum learning
TL;DR¶
This paper proposes FinPercep-RM, a fine-grained perceptual reward model that predicts both a global quality score and a perceptual degradation map to spatially localize artifacts. Combined with a co-evolutionary curriculum learning (CCL) strategy that balances training stability and reward robustness, the method effectively mitigates reward hacking in RL-based real-world super-resolution.
Background & Motivation¶
- Background: Diffusion-based Real-ISR methods leverage powerful generative priors to synthesize rich textures, and RLHF has been adopted to further optimize perceptual quality.
- Limitations of Prior Work: Typical IQA models (CLIP-IQA, MANIQA) output only global scores and are insensitive to local fine-grained distortions — subtle artifacts receive spuriously high rewards (reward hacking), causing local artifacts and unrealistic "painterly" appearances in generated results.
- Key Challenge: Simple global IQA rewards are stable but converge to suboptimal solutions (hacking); FinPercep-RM is robust but its spatially complex reward signal destabilizes policy learning — a dilemma between stability and robustness.
- Goal: Design a reward model capable of diagnosing where defects occur as well as assessing how good the quality is, while resolving training instability.
- Key Insight: An encoder-decoder architecture that jointly outputs a global score and a degradation heatmap, with curriculum learning to progressively introduce complex rewards.
- Core Idea: Couple the global score with the degradation map — the global score is computed via modulation by the degradation map, making it inherently sensitive to local defects.
Method¶
Overall Architecture¶
The generator produces a super-resolved image → FinPercep-RM evaluates it (global score + degradation map) → the reward signal guides policy updates of the generator. The CCL mechanism controls the progressive evolution of the reward model from simple to complex.
Key Designs¶
-
FinPercep-RM Encoder-Decoder Architecture:
- Function: Simultaneously predict a global quality score and a spatial degradation map.
- Mechanism: An encoder (IQA backbone such as CLIP-IQA) extracts multi-scale features \(\{f_i\}_{i=1}^N\); a decoder reconstructs the fine-grained perceptual degradation map \(M_{\text{fg-pdm}} \in [0,1]\) via upsampling and cross-layer fusion. The global score is computed by modulating the deepest feature with the degradation map: \(S_{\text{fgc-global}} = \text{MLP}(f_N \odot \text{interpolate}(M_{\text{fg-pdm}}))\).
- Design Motivation: Coupling the global score with the degradation map ensures the score is sensitive to local defects. The degradation map endows the reward with spatial diagnostic capability.
-
FGR-30k Dataset:
- Function: Provide fine-grained degradation annotations for training FinPercep-RM.
- Mechanism: Outputs \(I_{SR}\) from multiple Real-ISR models are collected; local defects are "implanted" between \(I_{GT}\) and \(I_{SR}\) via a region-swapping strategy using random masks and SAM semantic masks. Degradation map ground truth is generated by fusing pixel-level L1 differences and DINOv3 feature-level cosine distances: \(M_{gt} = \text{Normalize}(\alpha \cdot \text{Diff}_{\text{pixel}} + (1-\alpha) \cdot \text{Diff}_{\text{feat}})\).
- Design Motivation: Existing IQA datasets lack spatial degradation annotations. Synthetic samples incorporate artifacts produced by real SR models, ensuring that the training signal is consistent with practical application scenarios.
-
Co-evolutionary Curriculum Learning (CCL):
- Function: Balance training stability and reward robustness.
- Mechanism: Dual co-evolutionary paths: (1) Progressive reward model expansion — starting from a simple global IQA model \(RM_0\), decoder parameters are incrementally introduced, evolving into the full FinPercep-RM \(RM_N\); (2) Co-evolutionary generator curriculum — the generator initially uses global rewards for stable convergence, then progressively transitions to increasingly strict FinPercep-RM versions.
- Design Motivation: Directly applying the full FinPercep-RM causes policy gradient oscillations and convergence failure. An easy-to-hard design ensures stable early convergence followed by fine-grained late-stage optimization.
Loss & Training¶
FinPercep-RM training: \(\mathcal{L}_{total} = \lambda_{map} \mathcal{L}_{map} + \lambda_{rank} \mathcal{L}_{rank} + \lambda_{align} \mathcal{L}_{align}\), comprising a heatmap loss (L1), a triplet ranking loss (hinge), and an anchor alignment loss (to prevent score drift).
Key Experimental Results¶
Main Results¶
| Dataset / Method | LPIPS↓ | MUSIQ↑ | MANIQA↑ | ClipIQA↑ |
|---|---|---|---|---|
| SUPIR baseline | 0.452 | 65.67 | 0.629 | 0.572 |
| SUPIR w/ IQA | 0.465 | 64.89 | 0.612 | 0.589 |
| SUPIR w/ Ours | 0.428 | 67.23 | 0.648 | 0.586 |
Ablation Study¶
| Configuration | Outcome | Notes |
|---|---|---|
| Standard IQA reward | Fast convergence but reward hacking | Global metrics improve while local artifacts are prominent |
| FinPercep-RM w/o CCL | Unstable oscillation | Robust but fails to converge |
| FinPercep-RM + CCL | Stable, optimal convergence | Best of both worlds |
Key Findings¶
- Standard IQA rewards lead to pronounced reward hacking — global scores rise while visual quality degrades.
- FinPercep-RM's user study ratings are highly consistent with human judgment.
- CCL is critical — training curves of FinPercep-RM without CCL exhibit severe oscillations.
Highlights & Insights¶
- Diagnostic Reward Model: Evaluating not only "how good" but also diagnosing "where it fails" represents an important advance for RLHF in ISR.
- Global-Local Coupling Design: Modulating the global score via the degradation map elegantly resolves the blind spot of purely global scoring.
- Clever Data Construction: The region-swapping combined with dual-level difference fusion synthesis strategy is both concise and effective.
Limitations & Future Work¶
- The cached content covers only partial experimental results; the full ablation study may be richer.
- The encoder-decoder introduces additional inference overhead, potentially limiting real-time applicability.
- The stage partitioning and transition timing of CCL require manual tuning.
- The synthetic strategy of the FGR-30k dataset may not cover all artifact types.
- Future work could explore extending the diagnostic reward model to tasks such as video super-resolution.
Related Work & Insights¶
- vs. Direct IQA Rewards: IQA-only global scoring leads to reward hacking; FinPercep-RM provides spatial diagnostic capability.
- vs. Large-scale IQA Models: Large-scale IQA models (e.g., Q-Align) offer some fine-grained perception but incur computational costs incompatible with iterative training.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First systematic treatment of reward hacking in Real-ISR; the diagnostic reward model concept is novel.
- Experimental Thoroughness: ⭐⭐⭐ — Cached content is limited, but core ablations are clear and validated across multiple ISR models.
- Writing Quality: ⭐⭐⭐⭐ — Motivation is conveyed very intuitively through figures; training curve comparisons are persuasive.
- Value: ⭐⭐⭐⭐ — Provides an important methodological contribution to RL-based image restoration; the CCL strategy is transferable to other RLHF scenarios.