Skip to content

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences

Conference: ICCV 2025 arXiv: 2506.02095 Code: https://cyclereward.github.io/ Area: Image Generation Keywords: cycle consistency, reward model, image-text alignment, preference learning, DPO, self-supervised

TL;DR

This paper proposes CycleReward, which leverages cycle consistency as a self-supervised signal to replace human preference annotations — captions are reconstructed into images via a T2I model and ranked by visual similarity, yielding the 866K preference-pair dataset CyclePrefDB. The trained reward model outperforms HPSv2/PickScore/ImageReward by 6%+ on detailed captioning, and DPO training with it improves VLM performance across multiple vision-language tasks, all without any human annotation.

Background & Motivation

Background: Image-text alignment measurement is a central challenge in multimodal learning. Existing reward models (ImageReward, HPSv2, PickScore) rely on large-scale human preference annotations, which are costly and difficult to scale. GPT-4V annotation is an alternative but is expensive, closed-source, and rate-limited.

Limitations of Prior Work: (1) Human preference data collection is costly and hard to scale; (2) existing preference data mainly targets short texts (~20 tokens), making it ineffective for evaluating long descriptive captions; (3) embedding-based methods such as CLIP are insensitive to long text.

Key Challenge: Long descriptive captions are increasingly important (e.g., detailed descriptions generated by ShareGPT4V and LLaVA), yet effective alignment metrics for evaluating them are lacking. Direct cross-modal comparison (image vs. text) is difficult, whereas comparing images to images in the same modality is considerably more tractable.

Goal: To construct a preference dataset and reward model without human annotation via cycle consistency, with a particular focus on alignment evaluation for long descriptive captions.

Key Insight: The classical cycle consistency idea — \(x \xrightarrow{F} y \xrightarrow{G} x'\) — states that the more accurate the caption \(y\), the closer the reconstructed image \(x' = G(y)\) is to the original \(x\). This similarity is used as a preference signal rather than directly as a metric.

Core Idea: Rank caption/image candidates using cycle consistency scores to construct a preference dataset for training a reward model, thereby enabling image-text alignment learning without human annotation.

Method

Overall Architecture

(1) Cycle consistency scoring: Image → Captions (generated by multiple models) → T2I reconstruction → DreamSim similarity; Text → Images (generated by multiple models) → I2T reconstruction → SBERT similarity. (2) Preference ranking: the candidate with higher similarity is labeled as preferred. (3) Construction of the 866K preference-pair dataset CyclePrefDB. (4) Training of the reward model CycleReward (BLIP backbone + Bradley-Terry loss).

Key Designs

  1. Cycle Consistency as a Preference Signal:

    • Image-to-Text direction: Given an image \(x\), 11 I2T models (BLIP2, LLaVA series, InternVL2 series) generate captions \(\{y_i\}\) of varying quality. Each caption is reconstructed back into an image \(G(y_i)\) using Stable Diffusion 3, and DreamSim computes \(d_{img}(x, G(y_i))\). Higher similarity implies the preferred caption.
    • Text-to-Image direction: Given a caption \(y\), 4 T2I models (SD1.5, SDXL, SD3, FLUX) × 3 seeds generate candidate images \(\{x_i\}\). LLaVA-1.5-13B reverse-captions each image to \(F(x_i)\), and SBERT computes \(d_{text}(y, F(x_i))\). Higher similarity implies the preferred image.
    • Design Motivation: Same-modality comparison (image-to-image or text-to-text) is more reliable than cross-modal comparison. Cycle consistency elegantly transforms cross-modal alignment into a same-modality similarity problem.
  2. CyclePrefDB Dataset:

    • 866K preference pairs (398K I2T + 468K T2I), built from high-resolution images and dense captions in the 7.6K DCI dataset.
    • Average text length of 56 tokens, substantially longer than HPDv2 (19 tokens) and Pick-A-Pic (24 tokens), making the dataset particularly suitable for evaluating detailed captioning.
    • Data filtering: duplicates are removed; pairs with an insufficient reward gap (\(|r_i - r_j| < \tau_{sim}\)) or with a preferred reward that is too low are discarded.
    • Design Motivation: Multiple models spanning a wide capability range (from BLIP2 to InternVL2-40B) are used to generate candidates with a clear quality gradient, ensuring that preference pairs are informative.
  3. CycleReward Reward Model:

    • Architecture: BLIP backbone (ViT-L/16 + BERT-base + 5-layer MLP + scalar output head).
    • Three variants: I2T-only, T2I-only, and Combo (joint training with \(\mathcal{L} = \mathcal{L}_{text} + \lambda \mathcal{L}_{img}\)).
    • Training establishes a key finding: a reward model distilled from cycle consistency outperforms directly using raw cycle consistency scores — even multi-seed-averaged raw scores cannot match CycleReward (+4% on DetailCaps), because the reward model learns richer alignment concepts beyond pixel-level reconstruction.

DPO Application

  • I2T direction: CyclePrefDB-I2T is used to apply DPO to Qwen-VL-Chat → performance improves on detailed captioning and comprehensively across VL tasks including perception, reasoning, and hallucination, matching or surpassing VLFeedback (annotated by GPT-4V).
  • T2I direction: CyclePrefDB-T2I is used to apply Diffusion DPO to SD1.5 → performance on T2I-CompBench and PartiPrompts exceeds or matches Pick-A-Pic (851K human-annotated pairs).

Key Experimental Results

Alignment Metric (Pairwise Accuracy)

Method DetailCaps-4870 GenAI-Bench Annotation Type
CLIPScore 51.66 49.73 None
VQAScore (11B) 50.24 64.13 None
HPSv2 54.34 56.13 Human
PickScore 51.01 57.05 Human
ImageReward 50.70 56.70 Human
Raw Cycle Consistency 56.46 52.52 Self-supervised
CycleReward-Combo 60.50 55.52 Self-supervised

CycleReward outperforms all human-annotated methods by 6%+ on detailed captioning, and surpasses VQAScore (11B, 24× larger) by 10.26%.

Best-of-N Sampling

  • CycleReward achieves the largest BoN gains on LLaVA-WD and DeCapBench (substantially outperforming VQAScore and ImageReward).
  • On T2I-CompBench, performance is on par with ImageReward (human-annotated), and is even better on complex prompts.

DPO Results

I2T DPO (Qwen-VL-Chat):

Model DeCapBench LLaVA-W MMMU MME-P MMHal
Base 26.47 61.67 73.10 1460.2 2.99
DPO w/ VLFeedback (GPT-4V annotated) 28.03 69.17 76.39 1551.5 3.32
DPO w/ CyclePrefDB-I2T 30.63 70.00 74.13 1485.7 3.11

Ablation Study

Similarity Metric DetailCaps GenAI
DreamSim (ours) 58.02 53.49
LPIPS 53.16 52.97
CLIP 57.90 53.30

Key Findings

  • Distilled reward model > raw cycle consistency score: CycleReward outperforms raw cycle consistency on all benchmarks (+4% on DetailCaps) because: (1) the reward model learns high-level alignment concepts beyond pixel-level reconstruction (e.g., the red bird vs. blue bird example in Figure 7); (2) the reward model is faster (single forward pass vs. T2I reconstruction); (3) it is differentiable.
  • Self-supervised annotation matches human annotation: IRDB-Cycle (ImageRewardDB data re-annotated with cycle consistency) achieves performance comparable to the original ImageReward, demonstrating that cycle consistency is an effective proxy for human preference.
  • Surprising DPO generalization: CyclePrefDB-I2T contains only captioning instructions, yet DPO training yields comprehensive improvements across perception, reasoning, and hallucination tasks, indicating that gains in detailed captioning ability positively transfer to general vision-language capabilities.
  • Stronger I2T decoder yields better results: Using InternVL2-26B as the I2T decoder for the text-to-image cycle significantly outperforms LLaVA-1.5-13B (+5.47% on DetailCaps), as a more capable LLM reconstructs captions more accurately.
  • DreamSim outperforms LPIPS and CLIP for image similarity: DreamSim is specifically trained to model human visual similarity judgments.

Highlights & Insights

  • Key insight: cycle consistency need not be computed online: Prior methods such as Image2Text2Image use cycle consistency directly as a metric (slow, non-differentiable). CycleReward instead uses it as a preference signal to train a reward model, achieving better performance, faster inference, and differentiability — a further demonstration of the principle that "offline distillation outperforms online computation."
  • Theoretical connection to PMI: The cycle consistency score is theoretically equivalent to \(\log p(x,y) + \text{PMI}(x,y)\), simultaneously measuring the likelihood of a pair and its pointwise mutual information. This provides a theoretical foundation for why cycle consistency is an effective alignment signal.
  • "Training-free data annotation at scale" paradigm: Without human annotators or GPT-4V, millions of preference pairs can be automatically annotated using only the cycle consistency of open-source models. This has significant implications for reducing the data acquisition cost of RLHF/DPO.

Limitations & Future Work

  • The approach depends on the quality of T2I/I2T models — poor reconstruction quality can produce erroneous preference labels.
  • Stable Diffusion 3's 77-token limit constrains the evaluation of longer texts.
  • VQAScore (11B) remains stronger on text-to-image generation; CycleReward is better suited for captioning evaluation.
  • The method inherits biases from the underlying models (e.g., DreamSim's foreground bias and SD3's generation biases).
  • Cycle consistency for other modalities such as video and audio has not been explored.
  • vs. ImageReward: ImageReward uses 137K human preferences; CycleReward uses 866K automatic preferences. CycleReward is substantially stronger on detailed captioning (+9.8%), as human preference data is biased toward short text.
  • vs. VQAScore: VQAScore uses a model 24× larger (11B vs. 477M) and performs better on T2I tasks, but CycleReward surpasses it by 10.26% on detailed captioning. The two are complementary — VQAScore excels at compositional relations, CycleReward at detailed description.
  • vs. VisVM (prior batch): VisVM also uses CLIP as a PRM to construct self-supervision for VLM improvement. CycleReward goes further — employing cross-modal cycle consistency rather than unimodal similarity, and applying it to DPO training. Both works demonstrate the important direction of "annotation-free self-improvement of VLMs."

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Using cycle consistency as a large-scale preference signal is a novel and elegant idea, with a clear theoretical connection to PMI.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 5 alignment benchmarks + BoN + DPO (both I2T and T2I) + ablations (metric / decoder / data scale / filtering) + Winoground + human preference agreement rate — extremely comprehensive.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear and elegant; the overview in Figure 1 and the failure analysis in Figure 7 are both well executed.
  • Value: ⭐⭐⭐⭐⭐ The self-supervised preference learning paradigm as a replacement for human annotation has major implications for reducing the cost of VLM alignment.