Skip to content

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Conference: CVPR 2026
arXiv: 2602.24233
Code: None
Area: Text-to-Image Generation / Reinforcement Learning
Keywords: Spatial Understanding, Reward Model, GRPO, Diffusion Models, FLUX

TL;DR

The authors construct the SpatialReward-Dataset, an 80K adversarial preference dataset, to train SpatialScore—a reward model specifically for evaluating spatial relationship accuracy (outperforming GPT-5). By integrating a top-k filtering strategy with GRPO online RL, they significantly enhance the spatial generation capabilities of FLUX.1-dev.

Background & Motivation

Despite significant progress in visual quality for text-to-image generation, accurately depicting complex spatial relationships remains difficult, particularly in long-prompt scenarios involving multiple objects. Enhancing spatial understanding via Reinforcement Learning (RL) is a natural direction, but the core bottleneck is the lack of reliable reward models:

Human Preference Reward Models (HPSv2, PickScore, etc.): Focus on overall aesthetics and text-image alignment; unable to accurately evaluate complex spatial relations.

VQA Alignment Models (VQAScore, etc.): Similarly perform poorly on multi-object spatial reasoning.

Proprietary Large VLMs (GPT-5, Gemini): High cost; unsuitable for frequent RL queries.

Open-source VLMs (Qwen2.5-VL 72B): Exhibit severe hallucinations; spatial reasoning is unreliable.

Rule-based GenEval: Only covers simple two-object template prompts; fails to generalize to long-prompt scenarios, and object detectors are sensitive to occlusions.

Method

Overall Architecture

A three-stage pipeline: (1) Construction of the SpatialReward-Dataset preference pair dataset → (2) Training the SpatialScore reward model → (3) Optimizing FLUX.1-dev via GRPO online RL using SpatialScore as the reward signal.

Key Designs

  1. SpatialReward-Dataset (80K Adversarial Preference Pairs):

    • Use GPT-5 to generate initial prompts containing complex multi-object spatial relationships.
    • GPT-5 performs spatial relationship perturbations on the original prompts (e.g., left → right, swapping relative positions) while keeping other relations unchanged.
    • A "perfect image" is generated for the original prompt, and a "perturbed image" for the perturbed prompt.
    • Images are generated using strong text-to-image models such as Qwen-Image, HunyuanImage-2.1, and Seedream-4.0.
    • Samples that do not satisfy spatial constraints are filtered out via manual review to ensure high data quality.
  2. SpatialScore Reward Model:

    • Backbone: Qwen2.5-VL-7B + LoRA fine-tuning.
    • Models the reward score as a Gaussian distribution \(s \sim \mathcal{N}(\mu, \sigma^2)\) rather than a deterministic value for better robustness.
    • Inserts a special <reward> token at the end of the prompt; the final layer embedding is mapped to \(\mu, \sigma\) via an MLP.
    • Optimization of preference loss using the Bradley-Terry model:

    \(\mathcal{L}_{\text{Reward}}(\theta) = \mathbb{E}_{c, y_w, y_l}[-\log \sigma(R_\phi(H_\phi(y_w, c)) - R_\phi(H_\phi(y_l, c)))]\)

  3. Top-k Filtered GRPO: Addresses the advantage bias issue caused by prompts of varying difficulty:

    • Simple prompts generate many high-reward samples → some high-quality samples receive negative advantage.
    • Difficult prompts generally yield low rewards → also leading to advantage bias.
    • For each group of \(G\) samples, rewards are ranked, and only the top-\(k\) and bottom-\(k\) are selected to compute advantage values for training.
    • A choice of \(k=6\) (group size \(G=24\)) achieves the best trade-off between diversity and balance.
    • Significantly reduces NFE: from \(24 \times 6\) to \(12 \times 6\).

Loss & Training

Reward Model Training: - Qwen2.5-VL-7B + LoRA, learning rate \(2 \times 10^{-6}\), batch size 32. - Completed in 1 day on 8×H20 GPUs.

RL Training: - Base Model: FLUX.1-dev + LoRA (rank=32). - GRPO: Learning rate \(3 \times 10^{-4}\), clip range \(1 \times 10^{-4}\), KL penalty 0.01. - Policy exploration achieved by converting deterministic ODEs to stochastic SDEs (Euler-Maruyama discretization). - 32×H20 GPUs.

Key Experimental Results

Main Results

Method SpatialScore DPG-Bench Overall TIIF-short BR TIIF-long BR UniBench-short Lay-2D UniBench-long Lay-2D
FLUX.1-dev 2.18 82.91 0.769 0.758 0.766 0.819
Flow-GRPO* 3.01 57.02 0.851 0.577 0.726 0.445
Ours 7.81 85.03 0.875 0.845 0.875 0.891

Internal evaluation of SpatialScore increased from 2.18 to 7.81 (+258%), and the overall DPG-Bench score approached GPT-Image-1 (85.03 vs 85.15).

Reward Model Evaluation

Model Overall Accuracy
PickScore 0.509
HPSv3 0.605
Qwen2.5-VL-72B 0.764
GPT-5 0.890
Gemini-2.5 Pro 0.951
SpatialScore (7B) 0.958

SpatialScore with 7B parameters outperforms GPT-5 and Gemini-2.5 Pro in spatial understanding evaluation.

Ablation Study

Configuration SpatialScore DPG-bench Rel UniBench Lay-3D(long) NFE/step
w/o top-k 7.73 0.919 0.793 24×6
top-k (k=4) 7.71 0.916 0.796 8×6
top-k (k=6) 7.81 0.932 0.801 12×6

Key Findings

  • Flow-GRPO trained on GenEval improves short prompts but severely degrades on long prompts, even losing the base model's long-text following capability.
  • Scaling SpatialScore from 3B to 7B increased accuracy from 89.1% to 95.8%, demonstrating a significant scaling effect.
  • Improvement in spatial understanding shows positive transfer; all five dimensions of DPG-Bench improved.

Highlights & Insights

  • Adversarial Data Construction: Generation of preference pairs via spatial relation perturbation precisely eliminates interference from non-spatial factors.
  • 7B Model Outperforms Proprietary Models: Small models trained for specific tasks can surpass general large-scale models in specialized domains.
  • Simple and Effective Top-k Filtering: Resolves advantage bias in GRPO caused by uneven prompt difficulty while reducing computation by 2x.
  • The technical route of transitioning from SDE/ODE to policy exploration has become relatively mature.

Limitations & Future Work

  • Focused solely on spatial relationships; does not cover other compositional generation dimensions (e.g., attribute binding, numerical accuracy).
  • SpatialReward-Dataset depends on strong generative models (Qwen-Image, etc.); evaluation of weaker models might be biased.
  • High computational cost for RL training (32×H20 GPUs).
  • Impact of improved spatial understanding on aesthetic quality was not discussed.
  • Key difference from Flow-GRPO: Specialized reward model vs. rule-based GenEval reward; the latter is unreliable in complex scenarios.
  • The success pattern of RLHF in LLMs is being systematically migrated to image generation; this work is a representative for the spatial dimension.
  • Insight: Similar specialized reward models and RL frameworks could be built for other dimensions (e.g., attribute binding, action consistency).

Rating

  • Novelty: ⭐⭐⭐⭐ First reward model specifically for spatial understanding; top-k filtering strategy is valuable.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks, detailed ablations, and comparisons with various baselines and proprietary models.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation analysis and experimental presentation with rich visualizations.
  • Value: ⭐⭐⭐⭐ Successful validation of the reward model + RL paradigm for improving generation quality in the spatial dimension; holds methodological significance.