Enhancing Spatial Understanding in Image Generation via Reward Modeling¶

Conference: CVPR 2026
arXiv: 2602.24233
Code: None
Area: Text-to-Image Generation / Reinforcement Learning
Keywords: Spatial Understanding, Reward Model, GRPO, Diffusion Models, FLUX

TL;DR¶

The authors construct the SpatialReward-Dataset, an 80K adversarial preference dataset, to train SpatialScore—a reward model specifically for evaluating spatial relationship accuracy (outperforming GPT-5). By integrating a top-k filtering strategy with GRPO online RL, they significantly enhance the spatial generation capabilities of FLUX.1-dev.

Background & Motivation¶

Despite significant progress in visual quality for text-to-image generation, accurately depicting complex spatial relationships remains difficult, particularly in long-prompt scenarios involving multiple objects. Enhancing spatial understanding via Reinforcement Learning (RL) is a natural direction, but the core bottleneck is the lack of reliable reward models:

Human Preference Reward Models (HPSv2, PickScore, etc.): Focus on overall aesthetics and text-image alignment; unable to accurately evaluate complex spatial relations.

VQA Alignment Models (VQAScore, etc.): Similarly perform poorly on multi-object spatial reasoning.

Proprietary Large VLMs (GPT-5, Gemini): High cost; unsuitable for frequent RL queries.

Open-source VLMs (Qwen2.5-VL 72B): Exhibit severe hallucinations; spatial reasoning is unreliable.

Rule-based GenEval: Only covers simple two-object template prompts; fails to generalize to long-prompt scenarios, and object detectors are sensitive to occlusions.

Method¶

Overall Architecture¶

A three-stage pipeline: (1) Construction of the SpatialReward-Dataset preference pair dataset → (2) Training the SpatialScore reward model → (3) Optimizing FLUX.1-dev via GRPO online RL using SpatialScore as the reward signal.

Key Designs¶

SpatialReward-Dataset (80K Adversarial Preference Pairs):
- Use GPT-5 to generate initial prompts containing complex multi-object spatial relationships.
- GPT-5 performs spatial relationship perturbations on the original prompts (e.g., left → right, swapping relative positions) while keeping other relations unchanged.
- A "perfect image" is generated for the original prompt, and a "perturbed image" for the perturbed prompt.
- Images are generated using strong text-to-image models such as Qwen-Image, HunyuanImage-2.1, and Seedream-4.0.
- Samples that do not satisfy spatial constraints are filtered out via manual review to ensure high data quality.
SpatialScore Reward Model:
- Backbone: Qwen2.5-VL-7B + LoRA fine-tuning.
- Models the reward score as a Gaussian distribution \(s \sim \mathcal{N}(\mu, \sigma^2)\) rather than a deterministic value for better robustness.
- Inserts a special <reward> token at the end of the prompt; the final layer embedding is mapped to \(\mu, \sigma\) via an MLP.
- Optimization of preference loss using the Bradley-Terry model:
\(\mathcal{L}_{\text{Reward}}(\theta) = \mathbb{E}_{c, y_w, y_l}[-\log \sigma(R_\phi(H_\phi(y_w, c)) - R_\phi(H_\phi(y_l, c)))]\)
Top-k Filtered GRPO: Addresses the advantage bias issue caused by prompts of varying difficulty:
- Simple prompts generate many high-reward samples → some high-quality samples receive negative advantage.
- Difficult prompts generally yield low rewards → also leading to advantage bias.
- For each group of \(G\) samples, rewards are ranked, and only the top-\(k\) and bottom-\(k\) are selected to compute advantage values for training.
- A choice of \(k=6\) (group size \(G=24\)) achieves the best trade-off between diversity and balance.
- Significantly reduces NFE: from \(24 \times 6\) to \(12 \times 6\).

Loss & Training¶

Reward Model Training: - Qwen2.5-VL-7B + LoRA, learning rate \(2 \times 10^{-6}\), batch size 32. - Completed in 1 day on 8×H20 GPUs.

RL Training: - Base Model: FLUX.1-dev + LoRA (rank=32). - GRPO: Learning rate \(3 \times 10^{-4}\), clip range \(1 \times 10^{-4}\), KL penalty 0.01. - Policy exploration achieved by converting deterministic ODEs to stochastic SDEs (Euler-Maruyama discretization). - 32×H20 GPUs.

Key Experimental Results¶

Main Results¶

Method	SpatialScore	DPG-Bench Overall	TIIF-short BR	TIIF-long BR	UniBench-short Lay-2D	UniBench-long Lay-2D
FLUX.1-dev	2.18	82.91	0.769	0.758	0.766	0.819
Flow-GRPO*	3.01	57.02	0.851	0.577	0.726	0.445
Ours	7.81	85.03	0.875	0.845	0.875	0.891

Internal evaluation of SpatialScore increased from 2.18 to 7.81 (+258%), and the overall DPG-Bench score approached GPT-Image-1 (85.03 vs 85.15).

Reward Model Evaluation¶

Model	Overall Accuracy
PickScore	0.509
HPSv3	0.605
Qwen2.5-VL-72B	0.764
GPT-5	0.890
Gemini-2.5 Pro	0.951
SpatialScore (7B)	0.958

SpatialScore with 7B parameters outperforms GPT-5 and Gemini-2.5 Pro in spatial understanding evaluation.

Ablation Study¶

Configuration	SpatialScore	DPG-bench Rel	UniBench Lay-3D(long)	NFE/step
w/o top-k	7.73	0.919	0.793	24×6
top-k (k=4)	7.71	0.916	0.796	8×6
top-k (k=6)	7.81	0.932	0.801	12×6

Key Findings¶

Flow-GRPO trained on GenEval improves short prompts but severely degrades on long prompts, even losing the base model's long-text following capability.
Scaling SpatialScore from 3B to 7B increased accuracy from 89.1% to 95.8%, demonstrating a significant scaling effect.
Improvement in spatial understanding shows positive transfer; all five dimensions of DPG-Bench improved.

Highlights & Insights¶

Adversarial Data Construction: Generation of preference pairs via spatial relation perturbation precisely eliminates interference from non-spatial factors.
7B Model Outperforms Proprietary Models: Small models trained for specific tasks can surpass general large-scale models in specialized domains.
Simple and Effective Top-k Filtering: Resolves advantage bias in GRPO caused by uneven prompt difficulty while reducing computation by 2x.
The technical route of transitioning from SDE/ODE to policy exploration has become relatively mature.

Limitations & Future Work¶

Focused solely on spatial relationships; does not cover other compositional generation dimensions (e.g., attribute binding, numerical accuracy).
SpatialReward-Dataset depends on strong generative models (Qwen-Image, etc.); evaluation of weaker models might be biased.
High computational cost for RL training (32×H20 GPUs).
Impact of improved spatial understanding on aesthetic quality was not discussed.

Key difference from Flow-GRPO: Specialized reward model vs. rule-based GenEval reward; the latter is unreliable in complex scenarios.
The success pattern of RLHF in LLMs is being systematically migrated to image generation; this work is a representative for the spatial dimension.
Insight: Similar specialized reward models and RL frameworks could be built for other dimensions (e.g., attribute binding, action consistency).

Rating¶

Novelty: ⭐⭐⭐⭐ First reward model specifically for spatial understanding; top-k filtering strategy is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across multiple benchmarks, detailed ablations, and comparisons with various baselines and proprietary models.
Writing Quality: ⭐⭐⭐⭐ Clear motivation analysis and experimental presentation with rich visualizations.
Value: ⭐⭐⭐⭐ Successful validation of the reward model + RL paradigm for improving generation quality in the spatial dimension; holds methodological significance.