Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment¶
Conference: ICCV 2025 arXiv: 2507.19002 Code: GitHub Area: Image Generation Evaluation / Reward Models Keywords: Reward Model, ICT Score, HP Score, Text-Image Alignment, Human Preference, Diffusion Model Optimization
TL;DR¶
This paper identifies a "scoring paradox" in CLIP/BLIP-based reward models when evaluating high-quality images — detail-rich, high-fidelity images are paradoxically assigned lower scores. The authors propose two new metrics: ICT Score (Image-Contained-Text, measuring the degree to which an image encodes the textual information) and HP Score (a purely image-modal human preference score). Training on the Pick-High dataset yields over 10% improvement in preference prediction accuracy and successfully guides SD3.5-Turbo toward generating higher-quality images.
Background & Motivation¶
Core Problem¶
Modern diffusion models (SD3.5, FLUX) can generate highly faithful and aesthetically rich images that far exceed basic text-image alignment requirements. However, existing evaluation frameworks (CLIP Score, PickScore, ImageReward) have not evolved accordingly.
The Scoring Paradox of Reward Models¶
Reward models fine-tuned from CLIP/BLIP exhibit a fundamental flaw: they assign lower scores to detail-rich, aesthetically superior images, diverging significantly from genuine human preferences.
Information-theoretic interpretation: Following the principle of information decomposition, the total information in an image is \(I(v) = I(v;t) + I(v|t)\), where \(I(v;t)\) denotes the mutual information between image and text (the alignment component) and \(I(v|t)\) denotes image-specific information (aesthetics, texture, atmosphere, etc.).
The CLIP scoring mechanism is based on cosine similarity:
When a high-quality model generates a detail-rich image, although \(I(v;t)\) increases, \(I(v|t)\) grows faster, causing the denominator to increase more than the numerator — resulting in an overall score decrease.
Practical Consequences¶
This means that using CLIP Score / PickScore / ImageReward as reward functions to optimize advanced models such as SD3.5 actually steers the model toward generating visually sparse, aesthetically impoverished images — a core dilemma in the field.
Method¶
Overall Architecture¶
The paper makes three core contributions: 1. Pick-High Dataset: A large-scale high-quality image preference dataset 2. ICT Score: A new evaluation objective beyond text-image alignment 3. HP Score: A purely image-modal human preference score
Key Design 1: Pick-High Dataset¶
- 360K text prompts curated from PickAPic_v2
- LLM Chain-of-Thought reasoning used to craft refined prompts that better reflect human aesthetic preferences
- 360K high-quality images generated by SOTA models from refined prompts
- Triplet preference ranking constructed: \(I_1\) (non-preferred) \(< I_2\) (preferred) \(< I_3\) (generated from refined prompt)
Key Design 2: ICT Score (Image-Contained-Text)¶
The core idea is to measure the degree to which an image encodes textual information, rather than enforcing bidirectional alignment. This avoids penalizing high-quality images that contain rich visual details beyond the text description.
Thresholding mechanism (to mitigate CLIP's bias against high-quality images):
ICT scores for the base prompt: - \(E_3 = 1\) (full score for high-quality images) - \(E_2 = \mathcal{C}(I_2, P_{\text{easy}})\) - \(E_1 = \min(\mathcal{C}(I_1, P_{\text{easy}}), E_2)\) (to ensure ranking consistency)
ICT scores for the refined prompt: incorporating inter-text similarity - \(R_3 = 1\) - \(R_2 = E_2 \times \text{CLIP}(P_{\text{easy}}, P_{\text{ref}})\) - \(R_1 = E_1 \times \text{CLIP}(P_{\text{easy}}, P_{\text{ref}})\)
ICT model training: CLIP is fine-tuned using MSE loss to align predicted scores with ICT labels:
Hard negative mining: A sigmoid-based weighting strategy is introduced to handle potential false negatives:
Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ICT}} + \lambda \mathcal{L}_{\text{neg}}\)
Key Design 3: HP Score (High-Preference)¶
Once the ICT score reaches its upper bound (i.e., the image fully conveys the textual semantics), quality must be further assessed from a purely image-modal perspective.
A margin ranking loss is applied to fine-tune a CLIP image encoder with an MLP head on triplets \(\{I_1, I_2, I_3\}\):
Combined usage: ICT × HP = ICT-HP Score, jointly evaluating textual expressiveness and aesthetic quality.
Diffusion Model Optimization¶
The DRaFT-K method is employed to fine-tune SD3.5-Large-Turbo by directly maximizing the differentiable ICT/HP/ICT-HP reward functions.
Key Experimental Results¶
Preference Prediction Accuracy¶
Evaluated on the Pick-High + PickAPic_v2 test sets:
| Model | Mean Acc. ↑ | \(I_2 > I_1\) ↑ | \(I_3 > I_2\) ↑ | \(I_3 > I_1\) ↑ |
|---|---|---|---|---|
| Random | 50.00 | 50.00 | 50.00 | 50.00 |
| CLIP | 60.30 | 64.29 | 52.80 | 63.79 |
| ImageReward | 63.81 | 64.58 | 58.02 | 68.84 |
| PickScore | 79.04 | 74.80 | 75.37 | 86.94 |
| ICT | 87.58 | 64.65 | 100 | 100 |
| HP | 88.47 | 64.97 | 100 | 100 |
| ICT-HP | 88.84 | 66.42 | 100 | 100 |
Key findings: - ICT-HP improves mean accuracy over PickScore by approximately 10% (88.84 vs. 79.04) - On high-quality image comparisons (\(I_3 > I_2\)), ICT/HP/ICT-HP achieve 100% accuracy (PickScore: 75.37%) - CLIP/BLIP-based multimodal baselines only marginally outperform random chance, confirming that text-image alignment objectives are fundamentally inadequate for evaluating generation quality
Diffusion Model Optimization Results¶
Quantitative evaluation on the GenEval benchmark:
| Model | Mean ↑ | Single ↑ | Counting ↑ | Colors ↑ | Position ↑ |
|---|---|---|---|---|---|
| SD3.5-Turbo | 0.69 | 0.99 | 0.69 | 0.80 | 0.25 |
| + PickScore | 0.66 ↓ | 0.99 | 0.67 | 0.74 ↓ | 0.24 |
| + ImageReward | 0.70 | 0.99 | 0.68 | 0.80 | 0.28 |
| + ICT | 0.71 | 0.98 | 0.70 | 0.81 | 0.31 |
| + ICT-HP | 0.70 | 0.99 | 0.68 | 0.79 | 0.28 |
| + CLIP (crash) | 0.13 | 0.38 | 0.06 | 0.26 | 0.01 |
Key findings: - Optimizing with PickScore causes Colors and other metrics to degrade, validating the deficiencies of existing reward models - Using CLIP directly as a reward function causes training collapse (Mean: 0.13), rendering it completely unusable - ICT achieves the best results on Mean, Counting, and Position, demonstrating that the ICT objective effectively avoids penalizing high-quality images
JPEG Compression Rate and Aesthetic Score¶
| Model | JPEG Compression Rate ↑ | Aesthetic Score ↑ |
|---|---|---|
| SD3.5-Large | 374.80 | 6.307 |
| FLUX.1-dev | 270.58 | 6.436 |
| SD3.5-Turbo | 313.10 | 6.293 |
| + HP (Ours) | 334.86 | 6.448 |
| + ICT-HP (Ours) | 330.23 | 6.300 |
HP-optimized SD3.5-Turbo surpasses even the larger SD3.5-Large and FLUX.1-dev in aesthetic score.
Highlights & Insights¶
- Precise problem formulation: The scoring paradox of the CLIP alignment paradigm is rigorously derived from an information-theoretic perspective, grounded in theoretical analysis rather than empirical observation alone.
- Elegant design of ICT Score: Reformulating "bidirectional alignment" as a unidirectional "image contains text" evaluation eliminates the penalty imposed on rich visual information beyond the text description.
- Two-stage evaluation system: ICT ensures textual information is fully expressed → HP further assesses aesthetic quality on top of that, with the two metrics complementing each other.
- Practical impact: Reward functions used in existing RLHF-based diffusion model optimization may actively degrade the quality of advanced models; this paper provides a corrective solution.
- Transferability of the ICT text encoder: Direct transfer to SD2.1 significantly improves image quality.
Limitations & Future Work¶
- The Pick-High dataset relies on LLM-generated refined prompts, which may introduce systematic bias.
- The threshold parameter \(\theta\) in ICT Score requires tuning and may need different settings for different models.
- Experiments are primarily validated on SD3.5-Turbo; generalizability to other architectures (e.g., DiT, FLUX) remains to be verified.
- HP Score operates solely on the image modality and may fail to distinguish "high-quality but off-topic" images from genuinely preferred ones.
Related Work & Insights¶
- Reward models: CLIP Score → ImageReward (human preference fine-tuning) → PickScore (large-scale preference data) → HPSv2 → Ours (ICT+HP beyond alignment)
- Preference datasets: PickAPic_v2, ImageRewardDB, HPDv2 → Ours: Pick-High (high-quality triplet ranking)
- Diffusion model optimization: Reward-based fine-tuning methods such as DRaFT-K and ReFL
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Reveals the scoring paradox from an information-theoretic standpoint and proposes objectives beyond alignment; the insight is profound.
- Technical Depth: ⭐⭐⭐⭐ — ICT/HP Score designs are elegant, though the overall methodology is not particularly complex.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage across preference prediction, GenEval, aesthetics, human evaluation, and transfer experiments.
- Value: ⭐⭐⭐⭐⭐ — Directly addresses the core pain point of RLHF-based optimization for state-of-the-art diffusion models.