HPSv3: Towards Wide-Spectrum Human Preference Score¶

Conference: ICCV 2025

Area: Image Generation / Human Preference Assessment

Keywords: Human preference score, image quality assessment, uncertainty-aware ranking, VLM, image generation evaluation metric

TL;DR¶

HPSv3 constructs the first wide-spectrum human preference dataset HPDv3 (1.08M image-text pairs, 1.17M annotated pairs), trains a preference model using a VLM backbone (Qwen2-VL) with an uncertainty-aware ranking loss, and proposes a Chain-of-Human-Preference (CoHP) iterative generation method, significantly improving the accuracy and coverage of image generation evaluation.

Background & Motivation¶

Insufficient data coverage: Existing human preference datasets (HPDv2, Pick-a-Pic, ImageReward) primarily contain outputs from Stable Diffusion series models, making them inadequate for evaluating more advanced diffusion Transformers (FLUX) and autoregressive models (Infinity). High-quality real photographs as quality upper-bound references are also absent.
Insufficient feature extraction: HPSv2, PickScore, and similar methods rely on CLIP as the backbone, while ImageReward uses BLIP; however, these encoders are inferior to modern VLMs in multimodal feature extraction capability.
Coarse training strategy: Directly applying KL divergence or simple ranking losses fails to account for uncertainty and inconsistency in annotations, which can introduce bias on hard samples.
Lack of high-quality real image references: Prior datasets lack comparisons between real photographs and AI-generated images, preventing the establishment of a complete quality spectrum.

Method¶

Overall Architecture¶

HPSv3 consists of three components: (1) construction of the HPDv3 wide-spectrum human preference dataset; (2) a VLM-based uncertainty-aware preference model; and (3) CoHP chain-of-preference iterative image generation.

Key Designs¶

1. HPDv3 Dataset Construction

Data sourced from three parts:

Extended HPDv2: Retains the original 103,700 text prompts and regenerates images using 10+ recent models (FLUX.1-dev, Infinity, Hunyuan, Kolors, SD3, etc.).
Generation from real photo captions: High-quality photographs are collected from the internet → classified into 12 categories → distribution-aligned with JourneyDB prompts → top 10% filtered by aesthetics → captions generated by a VLM → corresponding images generated by each model. This yields 57,759 high-quality real photographs.
Midjourney data: 331,955 user-generated images collected along with real user preference selections from the Discord platform.

Dataset scale: 1.08M image-text pairs + 1.17M annotated comparison pairs, covering GANs, diffusion models, autoregressive models, and high-/low-quality real photographs.

Annotation quality control: - Annotators must pass a validation set of 600 pairs (labeled by 20 professional artists with 80% convergence rate), correctly evaluating at least 16 out of 20 pairs. - Each image pair is evaluated by 9–19 annotators, with an average convergence rate of 76.5% (vs. 59.9% in HPDv2). - Only pairs exceeding 95% confidence are used for training.

2. HPSv3 Preference Model

Backbone: Qwen2-VL is adopted as the vision-language model for joint image-text feature extraction, replacing CLIP/BLIP.

Uncertainty-aware ranking loss: Conventional methods predict a deterministic score \(r\), with preference probability expressed as \(\text{sigmoid}(r_1 - r_2)\). HPSv3 models the score as a Gaussian distribution \(r \sim \mathcal{N}(\mu, \sigma)\), introducing predictive uncertainty. The final two layers of the MLP predict \(\mu\) and \(\sigma\) respectively, and the preference probability is obtained by integrating over the Gaussian distribution. This enables the model to distinguish between high-confidence preferences and hard samples with divergent annotations, avoiding overconfidence on the latter.

3. CoHP: Chain-of-Human-Preference Image Generation

Two-stage iterative generation pipeline:

Model-wise Preference: Given a prompt, \(M\) candidate models each generate \(N\) rounds of images; HPSv3 scores them and selects the best-performing model.
Sample-wise Preference: The selected model generates \(B\) images → HPSv3 scores and selects the best → the best image is blended with noise as the conditioning input for the next round → iterated for \(S\) rounds → the globally highest-scoring image is selected.

Loss & Training¶

HPSv3 is trained using a negative log-likelihood loss for uncertainty-aware ranking, with the final two MLP layers predicting the mean and standard deviation respectively.

Key Experimental Results¶

HPDv3 Benchmark Model Rankings (HPSv3 Score)¶

Model	Overall Score
Kolors	10.55
FLUX-dev	10.43
Playground-v2.5	10.27
Infinity	10.26
CogView4	9.61
PixArt-Sigma	9.37
Gemini 2.0 Flash	9.21
SDXL	8.20
Hunyuan	8.19
SD3	5.31
SD v2.0	-0.24

Dataset Comparison¶

Dataset	# Images	# Pairs	Model Types	Real Photos	Convergence Rate
HPDv2	458K	798K	GAN+Diff+AR	No (HQI)	59.9%
Pick-a-Pic	638K	584K	Diff	No	-
MHP	608K	918K	GAN+Diff+AR	No	-
HPDv3	1.08M	1.17M	All	Yes	76.5%

Key Findings¶

The VLM backbone in HPSv3 significantly outperforms CLIP/BLIP-based backbones.
Uncertainty-aware ranking is more robust on hard samples with high annotation disagreement.
CoHP iterative generation improves image quality without requiring additional training data.
Kolors and FLUX-dev rank at the top in overall HPSv3 scores.
Category-level evaluation reveals that different models excel in different categories (e.g., FLUX performs stronger in architecture and vehicle categories).

Highlights & Insights¶

"Wide-spectrum" concept: For the first time, GANs, diffusion models, autoregressive models, and high-quality real photographs are systematically incorporated into a unified evaluation framework, establishing a complete quality spectrum from lowest to highest.
VLM replacing CLIP: Adopting Qwen2-VL as the feature extractor is a natural yet effective upgrade, fully leveraging the superior multimodal understanding capability of modern VLMs.
Uncertainty modeling: Extending preference scores from point estimates to Gaussian distributions offers a more principled treatment of the inherent subjectivity and inconsistency in human annotations.
Training-free improvement via CoHP: Using HPSv3 as a reward model to guide iterative sampling, the core mechanism combines best-of-N selection with image-to-image iterative refinement.
Engineering value of the dataset: HPDv3 itself, as a large-scale preference dataset with high-quality annotations, holds significant infrastructural value for the research community.
Rigorous annotation quality control: Annotator qualification testing, multi-person cross-annotation, and 95% confidence filtering substantially surpass prior work.

Limitations & Future Work¶

The VLM backbone (Qwen2-VL) has a large parameter count, leading to significantly higher inference latency and deployment cost compared to CLIP-based methods.
CoHP requires multiple rounds of generation and scoring, resulting in lower inference efficiency.
Cultural and individual differences in subjective preferences are not explicitly modeled.
The quality of user preference labels in the Midjourney data may be lower than that of professional annotations.
The dataset is predominantly composed of English prompts, and multilingual generalization has not been validated.

HPSv2 [Wu et al., 2023]: Predecessor work, using a CLIP backbone with the HPDv2 dataset.
PickScore [Kirstain et al., 2023]: A preference model fine-tuned from CLIP.
ImageReward [Xu et al., 2023]: A reward model with a BLIP backbone.
MPS [Zhang et al., 2024]: CVPR 2024, a human preference score incorporating diversity.
Qwen2-VL [Wang et al., 2024]: The vision-language model serving as the backbone of HPSv3.

Rating¶

Dimension	Score
Novelty	4/5
Experimental Thoroughness	4/5
Value	5/5
Writing Quality	4/5
Overall	4/5