Synthetic Curriculum Reinforces Compositional Text-to-Image Generation¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: None
Area: Diffusion Models / Text-to-Image Generation / Reinforcement Learning
Keywords: Compositional Generation, Curriculum Learning, Scene Graphs, MCMC Sampling, GRPO

TL;DR¶

CompGen defines "compositional difficulty" through the structural complexity of scene graphs, utilizes adaptive MCMC to sample scene graphs within specified difficulty intervals to construct training prompts, and integrates "easy-to-hard" curriculum weights into the rewards of Group Relative Policy Optimization (GRPO). Without requiring any ground-truth images, this approach improves the compositional generation capabilities of diffusion and autoregressive T2I models by an average of 7~12 points.

Background & Motivation¶

Background: Text-to-Image (T2I) generation has achieved high image quality, but "compositional generation"—the simultaneous appearance of multiple objects, each with distinct attributes and spatial/semantic relations (e.g., "a brown dog standing to the right of a white kitten")—remains a well-recognized challenge. Mainstream improvement routes include attention map modification (e.g., DenseDiffusion, CONFORM), introducing intermediate structures (layouts, skeletons), and fine-tuning (vision-language supervision or RL).

Limitations of Prior Work: Attention-based methods only work during inference, with limited scalability; planning-based methods require extra layout/VQA modules, increasing inference costs and risking attribute binding errors; training-based methods often require synthetic ground-truth images or intermediate skeletons, incurring high data preparation costs. Crucially, large-scale RL for compositional T2I is unstable because "compositional ability" is heterogeneous, involving object existence, attribute binding, relation understanding, and counting—mixing these during training leads to oscillation.

Key Challenge: Compositional difficulty lacks a quantifiable and controllable metric. Without the ability to precisely sequence "simple samples before complex samples," RL blindly optimizes on mixed-difficulty data, leading to instability and sub-optimal performance.

Goal: (1) Define a grounded metric for compositional difficulty; (2) Efficiently generate training data by difficulty; (3) Integrate a difficulty curriculum into RL without relying on ground-truth images.

Key Insight: Drawing from human cognitive development—learning single objects/attributes before complex multi-object relations—the "easy-to-hard" curriculum can be characterized by scene graphs. The structural density of objects, attributes, and relations naturally reflects compositional complexity.

Core Idea: Structural complexity of scene graphs serves as the difficulty yardstick. Adaptive MCMC samples scene graphs in target difficulty ranges to "synthesize a curriculum," and curriculum weights reshape GRPO rewards, coupling curriculum learning with ground-truth-free RL.

Method¶

Overall Architecture¶

CompGen is a two-stage "Synthetic Curriculum + Curriculated RL" framework. Stage 1 defines difficulty and uses adaptive MCMC to sample scene graphs within the \([Diff_{min}, Diff_{max}]\) interval. Stage 2 instantiates each scene graph into a text prompt for the T2I model and employs programmatically generated binary Question-Answering (QA) pairs + Multimodal LLM (MLLM) scoring as rewards. Finally, Curriculated GRPO (C-GRPO) updates the T2I model. The input is "text only," and the output is a "compositionally enhanced T2I model," completely bypassing the need for reference images.

graph TD
    A["Input: Target Difficulty Range<br/>[Diffmin, Diffmax]"] --> B["Scene Graph Difficulty Metric<br/>Multiplicative Structural Complexity"]
    B --> C["Adaptive MCMC Sampler<br/>Energy Function + Simulated Annealing"]
    C --> D["Scene Graph Instantiation<br/>LLM Asset Library → Prompt"]
    D --> E["Scene Graph-Driven Binary QA Rewards<br/>Object/Count/Attr/Relation × MLLM Scoring"]
    E --> F["Curriculated GRPO (C-GRPO)<br/>Reward Re-weighting by Progress"]
    F -->|Policy Update| G["Output: Compositionally Enhanced T2I Model"]
    F -.Next Difficulty Batch.-> C

Key Designs¶

1. Scene Graph Difficulty Metric: Quantifying Complexity Multiplicatively

Compositional difficulty previously lacked a standard metric to drive curricula. This paper formalizes a scene graph as \(G=(O,A,R)\) (Object set \(O\), Attribute set \(A\), Relation set \(R\)) and defines difficulty as:

\[\mathrm{Diff}(G) = \lVert O\rVert \cdot \max\!\left(1, \frac{\lVert A\rVert}{\lVert O\rVert}\right)\cdot \max\!\left(1, \frac{\lVert R\rVert}{\lVert O\rVert}\right)\]

The factors represent total object count, average attribute density, and average relational connectivity. The multiplicative form (rather than additive/average) captures the exponential nature of combinatorial explosions as components increase. Ablations show this outperforms additive baselines (e.g., \(\lVert O\rVert+\lVert A\rVert+\lVert R\rVert\)) by 4.56 points on average.

2. Adaptive MCMC Scene Graph Sampling: Efficient Data Generation

To sample graphs where difficulty falls within \([\mathrm{Diff}_{min},\mathrm{Diff}_{max}]\), the authors use an iterative sampling approach starting from a minimal graph \(G_0\). Two reversible transformations, \(T_{add}\) (adding a node/edge) and \(T_{delete}\) (removing a node/edge), propose candidate graphs \(G'\). The proposal distribution \(q(G'|G)\) is designed to be symmetric. To target specific difficulties, an energy function measures the deviation:

\[\mathrm{Energy}(G) = \mathrm{Dist}\big(\mathrm{Diff}(G),\,[\mathrm{Diff}_{min},\mathrm{Diff}_{max}]\big)\]

Metropolis-Hastings is used for decisions, with the acceptance probability \(\mathrm{Acc}(G'|G)=\min\!\big(1,\exp(\tfrac{\mathrm{Energy}(G)-\mathrm{Energy}(G')}{\tau})\big)\). A simulated annealing strategy for temperature \(\tau\) allows broad exploration initially, later converging to the constraint, ensuring both accuracy and diversity.

3. Scene Graph-Driven Binary QA Rewards: Fine-grained Feedback without Ground Truth

Rewards are generated by using the same scene graph to produce both the prompt and the evaluation questions. A constrained LLM (DeepSeek-V3) converts the graph to a prompt, ensuring all elements are included. Four types of binary questions are programmatically generated from the graph—Object Existence \(Q_{object}\), Counting \(Q_{count}\), Attribute \(Q_{attribute}\), and Relation \(Q_{relation}\). An MLLM (LLaVA-v1.6-13B) calculates the probability of a "yes" answer: \(r_j^{(i)}=p_{reward}(\text{answer}_j\mid I^{(i)},\text{question}_j)\). Averaging these provides a structural reward signal more granular than holistic scoring.

4. Curriculated GRPO (C-GRPO): Sequencing Policy Optimization

C-GRPO re-weights rewards across different difficulty levels based on training progress. The curriculated reward at step \(t\) is \(\hat r_j^{(i)}(t)=\sum_{j'}\hat p(t,j')\cdot r_j^{(i)}\), where \(\hat p(t,j')\) is the sampling probability of difficulty level \(j'\) at step \(t\) (controlled by Easy-to-Hard or Gaussian scheduling). The overall reward \(\hat r^{(i)}(t)\) for an image is the mean of all sampled questions. Advantages are normalized within each group of \(G\) images to calculate \(A_i(t)\), which is then plugged into the GRPO objective with clip and KL regularization:

\[J_{\text{C-GRPO}}(\theta)=\mathbb{E}_T\Big[\tfrac1G\sum_i \min\big(\tfrac{\pi_\theta}{\pi_{\theta_{old}}}A_i(t),\ \mathrm{clip}(\tfrac{\pi_\theta}{\pi_{\theta_{old}}},1-\epsilon,1+\epsilon)A_i(t)\big)-\beta\,\mathrm{KL}(p_\theta\Vert p_{ref})\Big]\]

This encourages the model to master simple concepts before tackling complex combinations.

Key Experimental Results¶

Main Results¶

On GenEval, DPG, TIFA, T2I-CompBench, and DSG, CompGen provides significant gains for both diffusion and autoregressive backbones:

Model	Params	GenEval	DPG	TIFA	T2I-CompBench	DSG	Average
Stable-Diffusion-1.5 (Baseline)	0.9B	42.08%	62.24%	78.67%	29.94%	61.57%	54.90%
Stable-Diffusion-2.1	0.9B	50.00%	65.47%	82.00%	32.01%	68.09%	59.51%
Playground-V2	2.6B	59.00%	74.54%	86.20%	36.13%	74.54%	66.08%
SimpleAR-SFT (Baseline)	0.5B	53.00%	78.48%	81.06%	33.76%	71.98%	63.66%
Emu3	14B	54.00%	74.19%	81.86%	31.20%	70.31%	62.31%
SD-1.5 w/ CompGen (Ours)	0.9B	53.88%	78.67%	85.71%	37.68%	77.16%	66.62% (↑11.72)
SimpleAR w/ CompGen (Ours)	0.5B	63.24%	81.20%	85.53%	40.27%	86.11%	71.27% (↑7.61)

The 0.9B SD-1.5 with CompGen outperforms the 2.6B Playground-V2. The 0.5B SimpleAR with CompGen achieves 71.27%, surpassing all evaluated models, including the 14B Emu3.

Ablation Study¶

Reward Model Impact (SD-1.5 backbone, Average Score):

Reward Model	Average
InstructBLIP	57.22%
CLIP-FlanT5-XXL	60.63%
LLaVA-v1.5-13B	64.40%
LLaVA-v1.6-13B (Selected)	66.62%

Performance scales linearly with the capability of the reward MLLM.

Difficulty Metric Impact:

Difficulty Metric	Average
\(\lVert O\rVert+\lVert A\rVert+\lVert R\rVert\) (Additive)	62.06%
\((\lVert O\rVert+\lVert R\rVert)/2\) (Mean)	61.34%
Ours (Multiplicative)	66.62%

Key Findings¶

Reward Model is the Ceiling: Performance improves with MLLM power, providing a clear path for future scaling.
Multiplicative Metric is Key: It characterises the combinatorial explosion better than additive metrics.
Curriculum Scheduling Strategy: On GenEval, Gaussian scheduling reached 54.6% within 500 steps (a 30% relative gain). Curriculum learning extends the effective training duration for continuous improvement.
Difficulty Balance: Focusing exclusively on easy or hard samples degrades generalization; a balanced progression is essential.

Highlights & Insights¶

Dual Role of Scene Graphs: The scene graph acts as both "questioner" (prompt gen) and "grader" (QA gen), grounding RL without needing reference images.
Difficulty as a Controllable Sampling Variable: Using MCMC to hit precise difficulty targets allows for "data on demand."
Curriculum Integrated into Rewards: C-GRPO re-weights rewards rather than changing architecture, making it easy to integrate into existing pipelines.

Limitations & Future Work¶

Complexity is currently structural; semantic complexity (e.g., visual realism requirements) is not yet factored into the difficulty metric.
Curriculum scheduling is currently fixed (Easy-to-Hard/Gaussian) rather than adaptive to real-time model performance.
Reliance on MLLM binary scoring: Systemic biases in the MLLM (e.g., specific relation errors) directly affect reward quality.
Structural counts may not always align with human perception of difficulty (e.g., semantically unusual but structurally simple prompts).

Vs Attention Methods (DenseDiffusion): CompGen updates weights for zero-cost inference, whereas attention methods are inference-only.
Vs Planning Methods: CompGen avoids the need for external layout modules during inference.
Vs SFT with GT Images: CompGen only requires text and RL, bypassing expensive ground-truth image generation.
Vs Standard GRPO: Standard GRPO oscillates on mixed-difficulty data; C-GRPO ensures healthier scaling via staged learning.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐