ICCV 2025 Image Generation cycle consistency reward model image-text alignment preference learning DPO self-supervised

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences¶

Conference: ICCV 2025 arXiv: 2506.02095 Code: https://cyclereward.github.io/ Area: Image Generation Keywords: cycle consistency, reward model, image-text alignment, preference learning, DPO, self-supervised

TL;DR¶

This paper proposes CycleReward, which leverages cycle consistency as a self-supervised signal to replace human preference annotations — captions are reconstructed into images via a T2I model and ranked by visual similarity, yielding the 866K preference-pair dataset CyclePrefDB. The trained reward model outperforms HPSv2/PickScore/ImageReward by 6%+ on detailed captioning, and DPO training with it improves VLM performance across multiple vision-language tasks, all without any human annotation.

Background & Motivation¶

Background: Image-text alignment measurement is a central challenge in multimodal learning. Existing reward models (ImageReward, HPSv2, PickScore) rely on large-scale human preference annotations, which are costly and difficult to scale. GPT-4V annotation is an alternative but is expensive, closed-source, and rate-limited.

Limitations of Prior Work: (1) Human preference data collection is costly and hard to scale; (2) existing preference data mainly targets short texts (~20 tokens), making it ineffective for evaluating long descriptive captions; (3) embedding-based methods such as CLIP are insensitive to long text.

Key Challenge: Long descriptive captions are increasingly important (e.g., detailed descriptions generated by ShareGPT4V and LLaVA), yet effective alignment metrics for evaluating them are lacking. Direct cross-modal comparison (image vs. text) is difficult, whereas comparing images to images in the same modality is considerably more tractable.

Goal: To construct a preference dataset and reward model without human annotation via cycle consistency, with a particular focus on alignment evaluation for long descriptive captions.

Key Insight: The classical cycle consistency idea — \(x \xrightarrow{F} y \xrightarrow{G} x'\) — states that the more accurate the caption \(y\), the closer the reconstructed image \(x' = G(y)\) is to the original \(x\). This similarity is used as a preference signal rather than directly as a metric.

Core Idea: Rank caption/image candidates using cycle consistency scores to construct a preference dataset for training a reward model, thereby enabling image-text alignment learning without human annotation.

Method¶

Overall Architecture¶

(1) Cycle consistency scoring: Image → Captions (generated by multiple models) → T2I reconstruction → DreamSim similarity; Text → Images (generated by multiple models) → I2T reconstruction → SBERT similarity. (2) Preference ranking: the candidate with higher similarity is labeled as preferred. (3) Construction of the 866K preference-pair dataset CyclePrefDB. (4) Training of the reward model CycleReward (BLIP backbone + Bradley-Terry loss).

Key Designs¶

Cycle Consistency as a Preference Signal:
- Image-to-Text direction: Given an image \(x\), 11 I2T models (BLIP2, LLaVA series, InternVL2 series) generate captions \(\{y_i\}\) of varying quality. Each caption is reconstructed back into an image \(G(y_i)\) using Stable Diffusion 3, and DreamSim computes \(d_{img}(x, G(y_i))\). Higher similarity implies the preferred caption.
- Text-to-Image direction: Given a caption \(y\), 4 T2I models (SD1.5, SDXL, SD3, FLUX) × 3 seeds generate candidate images \(\{x_i\}\). LLaVA-1.5-13B reverse-captions each image to \(F(x_i)\), and SBERT computes \(d_{text}(y, F(x_i))\). Higher similarity implies the preferred image.
- Design Motivation: Same-modality comparison (image-to-image or text-to-text) is more reliable than cross-modal comparison. Cycle consistency elegantly transforms cross-modal alignment into a same-modality similarity problem.
CyclePrefDB Dataset:
- 866K preference pairs (398K I2T + 468K T2I), built from high-resolution images and dense captions in the 7.6K DCI dataset.
- Average text length of 56 tokens, substantially longer than HPDv2 (19 tokens) and Pick-A-Pic (24 tokens), making the dataset particularly suitable for evaluating detailed captioning.
- Data filtering: duplicates are removed; pairs with an insufficient reward gap (\(|r_i - r_j| < \tau_{sim}\)) or with a preferred reward that is too low are discarded.
- Design Motivation: Multiple models spanning a wide capability range (from BLIP2 to InternVL2-40B) are used to generate candidates with a clear quality gradient, ensuring that preference pairs are informative.
CycleReward Reward Model:
- Architecture: BLIP backbone (ViT-L/16 + BERT-base + 5-layer MLP + scalar output head).
- Three variants: I2T-only, T2I-only, and Combo (joint training with \(\mathcal{L} = \mathcal{L}_{text} + \lambda \mathcal{L}_{img}\)).
- Training establishes a key finding: a reward model distilled from cycle consistency outperforms directly using raw cycle consistency scores — even multi-seed-averaged raw scores cannot match CycleReward (+4% on DetailCaps), because the reward model learns richer alignment concepts beyond pixel-level reconstruction.

DPO Application¶

I2T direction: CyclePrefDB-I2T is used to apply DPO to Qwen-VL-Chat → performance improves on detailed captioning and comprehensively across VL tasks including perception, reasoning, and hallucination, matching or surpassing VLFeedback (annotated by GPT-4V).
T2I direction: CyclePrefDB-T2I is used to apply Diffusion DPO to SD1.5 → performance on T2I-CompBench and PartiPrompts exceeds or matches Pick-A-Pic (851K human-annotated pairs).

Key Experimental Results¶

Alignment Metric (Pairwise Accuracy)¶

Method	DetailCaps-4870	GenAI-Bench	Annotation Type
CLIPScore	51.66	49.73	None
VQAScore (11B)	50.24	64.13	None
HPSv2	54.34	56.13	Human
PickScore	51.01	57.05	Human
ImageReward	50.70	56.70	Human
Raw Cycle Consistency	56.46	52.52	Self-supervised
CycleReward-Combo	60.50	55.52	Self-supervised

CycleReward outperforms all human-annotated methods by 6%+ on detailed captioning, and surpasses VQAScore (11B, 24× larger) by 10.26%.

Best-of-N Sampling¶

CycleReward achieves the largest BoN gains on LLaVA-WD and DeCapBench (substantially outperforming VQAScore and ImageReward).
On T2I-CompBench, performance is on par with ImageReward (human-annotated), and is even better on complex prompts.

DPO Results¶

I2T DPO (Qwen-VL-Chat):

Model	DeCapBench	LLaVA-W	MMMU	MME-P	MMHal
Base	26.47	61.67	73.10	1460.2	2.99
DPO w/ VLFeedback (GPT-4V annotated)	28.03	69.17	76.39	1551.5	3.32
DPO w/ CyclePrefDB-I2T	30.63	70.00	74.13	1485.7	3.11

Ablation Study¶

Similarity Metric	DetailCaps	GenAI
DreamSim (ours)	58.02	53.49
LPIPS	53.16	52.97
CLIP	57.90	53.30

Key Findings¶

Distilled reward model > raw cycle consistency score: CycleReward outperforms raw cycle consistency on all benchmarks (+4% on DetailCaps) because: (1) the reward model learns high-level alignment concepts beyond pixel-level reconstruction (e.g., the red bird vs. blue bird example in Figure 7); (2) the reward model is faster (single forward pass vs. T2I reconstruction); (3) it is differentiable.
Self-supervised annotation matches human annotation: IRDB-Cycle (ImageRewardDB data re-annotated with cycle consistency) achieves performance comparable to the original ImageReward, demonstrating that cycle consistency is an effective proxy for human preference.
Surprising DPO generalization: CyclePrefDB-I2T contains only captioning instructions, yet DPO training yields comprehensive improvements across perception, reasoning, and hallucination tasks, indicating that gains in detailed captioning ability positively transfer to general vision-language capabilities.
Stronger I2T decoder yields better results: Using InternVL2-26B as the I2T decoder for the text-to-image cycle significantly outperforms LLaVA-1.5-13B (+5.47% on DetailCaps), as a more capable LLM reconstructs captions more accurately.
DreamSim outperforms LPIPS and CLIP for image similarity: DreamSim is specifically trained to model human visual similarity judgments.

Highlights & Insights¶

Key insight: cycle consistency need not be computed online: Prior methods such as Image2Text2Image use cycle consistency directly as a metric (slow, non-differentiable). CycleReward instead uses it as a preference signal to train a reward model, achieving better performance, faster inference, and differentiability — a further demonstration of the principle that "offline distillation outperforms online computation."
Theoretical connection to PMI: The cycle consistency score is theoretically equivalent to \(\log p(x,y) + \text{PMI}(x,y)\), simultaneously measuring the likelihood of a pair and its pointwise mutual information. This provides a theoretical foundation for why cycle consistency is an effective alignment signal.
"Training-free data annotation at scale" paradigm: Without human annotators or GPT-4V, millions of preference pairs can be automatically annotated using only the cycle consistency of open-source models. This has significant implications for reducing the data acquisition cost of RLHF/DPO.

Limitations & Future Work¶

The approach depends on the quality of T2I/I2T models — poor reconstruction quality can produce erroneous preference labels.
Stable Diffusion 3's 77-token limit constrains the evaluation of longer texts.
VQAScore (11B) remains stronger on text-to-image generation; CycleReward is better suited for captioning evaluation.
The method inherits biases from the underlying models (e.g., DreamSim's foreground bias and SD3's generation biases).
Cycle consistency for other modalities such as video and audio has not been explored.

vs. ImageReward: ImageReward uses 137K human preferences; CycleReward uses 866K automatic preferences. CycleReward is substantially stronger on detailed captioning (+9.8%), as human preference data is biased toward short text.
vs. VQAScore: VQAScore uses a model 24× larger (11B vs. 477M) and performs better on T2I tasks, but CycleReward surpasses it by 10.26% on detailed captioning. The two are complementary — VQAScore excels at compositional relations, CycleReward at detailed description.
vs. VisVM (prior batch): VisVM also uses CLIP as a PRM to construct self-supervision for VLM improvement. CycleReward goes further — employing cross-modal cycle consistency rather than unimodal similarity, and applying it to DPO training. Both works demonstrate the important direction of "annotation-free self-improvement of VLMs."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Using cycle consistency as a large-scale preference signal is a novel and elegant idea, with a clear theoretical connection to PMI.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 5 alignment benchmarks + BoN + DPO (both I2T and T2I) + ablations (metric / decoder / data scale / filtering) + Winoground + human preference agreement rate — extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Clear and elegant; the overview in Figure 1 and the failure analysis in Figure 7 are both well executed.
Value: ⭐⭐⭐⭐⭐ The self-supervised preference learning paradigm as a replacement for human annotation has major implications for reducing the cost of VLM alignment.