Skip to content

Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment

Conference: ICCV 2025 arXiv: 2507.19002 Code: GitHub Area: Image Generation Evaluation / Reward Models Keywords: Reward Model, ICT Score, HP Score, Text-Image Alignment, Human Preference, Diffusion Model Optimization

TL;DR

This paper identifies a "scoring paradox" in CLIP/BLIP-based reward models when evaluating high-quality images — detail-rich, high-fidelity images are paradoxically assigned lower scores. The authors propose two new metrics: ICT Score (Image-Contained-Text, measuring the degree to which an image encodes the textual information) and HP Score (a purely image-modal human preference score). Training on the Pick-High dataset yields over 10% improvement in preference prediction accuracy and successfully guides SD3.5-Turbo toward generating higher-quality images.

Background & Motivation

Core Problem

Modern diffusion models (SD3.5, FLUX) can generate highly faithful and aesthetically rich images that far exceed basic text-image alignment requirements. However, existing evaluation frameworks (CLIP Score, PickScore, ImageReward) have not evolved accordingly.

The Scoring Paradox of Reward Models

Reward models fine-tuned from CLIP/BLIP exhibit a fundamental flaw: they assign lower scores to detail-rich, aesthetically superior images, diverging significantly from genuine human preferences.

Information-theoretic interpretation: Following the principle of information decomposition, the total information in an image is \(I(v) = I(v;t) + I(v|t)\), where \(I(v;t)\) denotes the mutual information between image and text (the alignment component) and \(I(v|t)\) denotes image-specific information (aesthetics, texture, atmosphere, etc.).

The CLIP scoring mechanism is based on cosine similarity:

\[\text{CLIP}(v,t) \approx \frac{I(v;t)}{\sqrt{I(t) \cdot (I(v;t) + I(v|t))}}\]

When a high-quality model generates a detail-rich image, although \(I(v;t)\) increases, \(I(v|t)\) grows faster, causing the denominator to increase more than the numerator — resulting in an overall score decrease.

Practical Consequences

This means that using CLIP Score / PickScore / ImageReward as reward functions to optimize advanced models such as SD3.5 actually steers the model toward generating visually sparse, aesthetically impoverished images — a core dilemma in the field.

Method

Overall Architecture

The paper makes three core contributions: 1. Pick-High Dataset: A large-scale high-quality image preference dataset 2. ICT Score: A new evaluation objective beyond text-image alignment 3. HP Score: A purely image-modal human preference score

Key Design 1: Pick-High Dataset

  • 360K text prompts curated from PickAPic_v2
  • LLM Chain-of-Thought reasoning used to craft refined prompts that better reflect human aesthetic preferences
  • 360K high-quality images generated by SOTA models from refined prompts
  • Triplet preference ranking constructed: \(I_1\) (non-preferred) \(< I_2\) (preferred) \(< I_3\) (generated from refined prompt)

Key Design 2: ICT Score (Image-Contained-Text)

The core idea is to measure the degree to which an image encodes textual information, rather than enforcing bidirectional alignment. This avoids penalizing high-quality images that contain rich visual details beyond the text description.

Thresholding mechanism (to mitigate CLIP's bias against high-quality images):

\[\mathcal{C}(I, P) = \min\left(\frac{\text{CLIP}(I, P)}{\theta}, 1\right)\]

ICT scores for the base prompt: - \(E_3 = 1\) (full score for high-quality images) - \(E_2 = \mathcal{C}(I_2, P_{\text{easy}})\) - \(E_1 = \min(\mathcal{C}(I_1, P_{\text{easy}}), E_2)\) (to ensure ranking consistency)

ICT scores for the refined prompt: incorporating inter-text similarity - \(R_3 = 1\) - \(R_2 = E_2 \times \text{CLIP}(P_{\text{easy}}, P_{\text{ref}})\) - \(R_1 = E_1 \times \text{CLIP}(P_{\text{easy}}, P_{\text{ref}})\)

ICT model training: CLIP is fine-tuned using MSE loss to align predicted scores with ICT labels:

\[\mathcal{L}_{\text{ICT}} = \sum_{i=1}^3 (E_i - y_{i,e})^2 + \sum_{i=1}^3 (R_i - y_{i,r})^2\]

Hard negative mining: A sigmoid-based weighting strategy is introduced to handle potential false negatives:

\[w(y) = \frac{1}{1 + e^{\alpha(|y| - \beta)}}\]

Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{ICT}} + \lambda \mathcal{L}_{\text{neg}}\)

Key Design 3: HP Score (High-Preference)

Once the ICT score reaches its upper bound (i.e., the image fully conveys the textual semantics), quality must be further assessed from a purely image-modal perspective.

A margin ranking loss is applied to fine-tune a CLIP image encoder with an MLP head on triplets \(\{I_1, I_2, I_3\}\):

\[\mathcal{L}_{\text{margin}} = \sum\left[\max(0, -\Delta(I_2, I_1) + m) + \max(0, -\Delta(I_3, I_2) + m)\right]\]

Combined usage: ICT × HP = ICT-HP Score, jointly evaluating textual expressiveness and aesthetic quality.

Diffusion Model Optimization

The DRaFT-K method is employed to fine-tune SD3.5-Large-Turbo by directly maximizing the differentiable ICT/HP/ICT-HP reward functions.

Key Experimental Results

Preference Prediction Accuracy

Evaluated on the Pick-High + PickAPic_v2 test sets:

Model Mean Acc. ↑ \(I_2 > I_1\) \(I_3 > I_2\) \(I_3 > I_1\)
Random 50.00 50.00 50.00 50.00
CLIP 60.30 64.29 52.80 63.79
ImageReward 63.81 64.58 58.02 68.84
PickScore 79.04 74.80 75.37 86.94
ICT 87.58 64.65 100 100
HP 88.47 64.97 100 100
ICT-HP 88.84 66.42 100 100

Key findings: - ICT-HP improves mean accuracy over PickScore by approximately 10% (88.84 vs. 79.04) - On high-quality image comparisons (\(I_3 > I_2\)), ICT/HP/ICT-HP achieve 100% accuracy (PickScore: 75.37%) - CLIP/BLIP-based multimodal baselines only marginally outperform random chance, confirming that text-image alignment objectives are fundamentally inadequate for evaluating generation quality

Diffusion Model Optimization Results

Quantitative evaluation on the GenEval benchmark:

Model Mean ↑ Single ↑ Counting ↑ Colors ↑ Position ↑
SD3.5-Turbo 0.69 0.99 0.69 0.80 0.25
+ PickScore 0.66 ↓ 0.99 0.67 0.74 ↓ 0.24
+ ImageReward 0.70 0.99 0.68 0.80 0.28
+ ICT 0.71 0.98 0.70 0.81 0.31
+ ICT-HP 0.70 0.99 0.68 0.79 0.28
+ CLIP (crash) 0.13 0.38 0.06 0.26 0.01

Key findings: - Optimizing with PickScore causes Colors and other metrics to degrade, validating the deficiencies of existing reward models - Using CLIP directly as a reward function causes training collapse (Mean: 0.13), rendering it completely unusable - ICT achieves the best results on Mean, Counting, and Position, demonstrating that the ICT objective effectively avoids penalizing high-quality images

JPEG Compression Rate and Aesthetic Score

Model JPEG Compression Rate ↑ Aesthetic Score ↑
SD3.5-Large 374.80 6.307
FLUX.1-dev 270.58 6.436
SD3.5-Turbo 313.10 6.293
+ HP (Ours) 334.86 6.448
+ ICT-HP (Ours) 330.23 6.300

HP-optimized SD3.5-Turbo surpasses even the larger SD3.5-Large and FLUX.1-dev in aesthetic score.

Highlights & Insights

  1. Precise problem formulation: The scoring paradox of the CLIP alignment paradigm is rigorously derived from an information-theoretic perspective, grounded in theoretical analysis rather than empirical observation alone.
  2. Elegant design of ICT Score: Reformulating "bidirectional alignment" as a unidirectional "image contains text" evaluation eliminates the penalty imposed on rich visual information beyond the text description.
  3. Two-stage evaluation system: ICT ensures textual information is fully expressed → HP further assesses aesthetic quality on top of that, with the two metrics complementing each other.
  4. Practical impact: Reward functions used in existing RLHF-based diffusion model optimization may actively degrade the quality of advanced models; this paper provides a corrective solution.
  5. Transferability of the ICT text encoder: Direct transfer to SD2.1 significantly improves image quality.

Limitations & Future Work

  1. The Pick-High dataset relies on LLM-generated refined prompts, which may introduce systematic bias.
  2. The threshold parameter \(\theta\) in ICT Score requires tuning and may need different settings for different models.
  3. Experiments are primarily validated on SD3.5-Turbo; generalizability to other architectures (e.g., DiT, FLUX) remains to be verified.
  4. HP Score operates solely on the image modality and may fail to distinguish "high-quality but off-topic" images from genuinely preferred ones.
  • Reward models: CLIP Score → ImageReward (human preference fine-tuning) → PickScore (large-scale preference data) → HPSv2 → Ours (ICT+HP beyond alignment)
  • Preference datasets: PickAPic_v2, ImageRewardDB, HPDv2 → Ours: Pick-High (high-quality triplet ranking)
  • Diffusion model optimization: Reward-based fine-tuning methods such as DRaFT-K and ReFL

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Reveals the scoring paradox from an information-theoretic standpoint and proposes objectives beyond alignment; the insight is profound.
  • Technical Depth: ⭐⭐⭐⭐ — ICT/HP Score designs are elegant, though the overall methodology is not particularly complex.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive coverage across preference prediction, GenEval, aesthetics, human evaluation, and transfer experiments.
  • Value: ⭐⭐⭐⭐⭐ — Directly addresses the core pain point of RLHF-based optimization for state-of-the-art diffusion models.