Synthetic Curriculum Reinforces Compositional Text-to-Image Generation¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: None
Area: Diffusion Models / Text-to-Image Generation / Reinforcement Learning
Keywords: Compositional Generation, Curriculum Learning, Scene Graphs, MCMC Sampling, GRPO
TL;DR¶
CompGen defines "compositional difficulty" through the structural complexity of scene graphs, utilizes adaptive MCMC to sample scene graphs within specified difficulty intervals to construct training prompts, and integrates "easy-to-hard" curriculum weights into the rewards of Group Relative Policy Optimization (GRPO). Without requiring any ground-truth images, this approach improves the compositional generation capabilities of diffusion and autoregressive T2I models by an average of 7~12 points.
Background & Motivation¶
Background: Text-to-Image (T2I) generation has achieved high image quality, but "compositional generation"—the simultaneous appearance of multiple objects, each with distinct attributes and spatial/semantic relations (e.g., "a brown dog standing to the right of a white kitten")—remains a well-recognized challenge. Mainstream improvement routes include attention map modification (e.g., DenseDiffusion, CONFORM), introducing intermediate structures (layouts, skeletons), and fine-tuning (vision-language supervision or RL).
Limitations of Prior Work: Attention-based methods only work during inference, with limited scalability; planning-based methods require extra layout/VQA modules, increasing inference costs and risking attribute binding errors; training-based methods often require synthetic ground-truth images or intermediate skeletons, incurring high data preparation costs. Crucially, large-scale RL for compositional T2I is unstable because "compositional ability" is heterogeneous, involving object existence, attribute binding, relation understanding, and counting—mixing these during training leads to oscillation.
Key Challenge: Compositional difficulty lacks a quantifiable and controllable metric. Without the ability to precisely sequence "simple samples before complex samples," RL blindly optimizes on mixed-difficulty data, leading to instability and sub-optimal performance.
Goal: (1) Define a grounded metric for compositional difficulty; (2) Efficiently generate training data by difficulty; (3) Integrate a difficulty curriculum into RL without relying on ground-truth images.
Key Insight: Drawing from human cognitive development—learning single objects/attributes before complex multi-object relations—the "easy-to-hard" curriculum can be characterized by scene graphs. The structural density of objects, attributes, and relations naturally reflects compositional complexity.
Core Idea: Structural complexity of scene graphs serves as the difficulty yardstick. Adaptive MCMC samples scene graphs in target difficulty ranges to "synthesize a curriculum," and curriculum weights reshape GRPO rewards, coupling curriculum learning with ground-truth-free RL.
Method¶
Overall Architecture¶
CompGen is a two-stage "Synthetic Curriculum + Curriculated RL" framework. Stage 1 defines difficulty and uses adaptive MCMC to sample scene graphs within the \([Diff_{min}, Diff_{max}]\) interval. Stage 2 instantiates each scene graph into a text prompt for the T2I model and employs programmatically generated binary Question-Answering (QA) pairs + Multimodal LLM (MLLM) scoring as rewards. Finally, Curriculated GRPO (C-GRPO) updates the T2I model. The input is "text only," and the output is a "compositionally enhanced T2I model," completely bypassing the need for reference images.
graph TD
A["Input: Target Difficulty Range<br/>[Diffmin, Diffmax]"] --> B["Scene Graph Difficulty Metric<br/>Multiplicative Structural Complexity"]
B --> C["Adaptive MCMC Sampler<br/>Energy Function + Simulated Annealing"]
C --> D["Scene Graph Instantiation<br/>LLM Asset Library → Prompt"]
D --> E["Scene Graph-Driven Binary QA Rewards<br/>Object/Count/Attr/Relation × MLLM Scoring"]
E --> F["Curriculated GRPO (C-GRPO)<br/>Reward Re-weighting by Progress"]
F -->|Policy Update| G["Output: Compositionally Enhanced T2I Model"]
F -.Next Difficulty Batch.-> C
Key Designs¶
1. Scene Graph Difficulty Metric: Quantifying Complexity Multiplicatively
Compositional difficulty previously lacked a standard metric to drive curricula. This paper formalizes a scene graph as \(G=(O,A,R)\) (Object set \(O\), Attribute set \(A\), Relation set \(R\)) and defines difficulty as:
The factors represent total object count, average attribute density, and average relational connectivity. The multiplicative form (rather than additive/average) captures the exponential nature of combinatorial explosions as components increase. Ablations show this outperforms additive baselines (e.g., \(\lVert O\rVert+\lVert A\rVert+\lVert R\rVert\)) by 4.56 points on average.
2. Adaptive MCMC Scene Graph Sampling: Efficient Data Generation
To sample graphs where difficulty falls within \([\mathrm{Diff}_{min},\mathrm{Diff}_{max}]\), the authors use an iterative sampling approach starting from a minimal graph \(G_0\). Two reversible transformations, \(T_{add}\) (adding a node/edge) and \(T_{delete}\) (removing a node/edge), propose candidate graphs \(G'\). The proposal distribution \(q(G'|G)\) is designed to be symmetric. To target specific difficulties, an energy function measures the deviation:
Metropolis-Hastings is used for decisions, with the acceptance probability \(\mathrm{Acc}(G'|G)=\min\!\big(1,\exp(\tfrac{\mathrm{Energy}(G)-\mathrm{Energy}(G')}{\tau})\big)\). A simulated annealing strategy for temperature \(\tau\) allows broad exploration initially, later converging to the constraint, ensuring both accuracy and diversity.
3. Scene Graph-Driven Binary QA Rewards: Fine-grained Feedback without Ground Truth
Rewards are generated by using the same scene graph to produce both the prompt and the evaluation questions. A constrained LLM (DeepSeek-V3) converts the graph to a prompt, ensuring all elements are included. Four types of binary questions are programmatically generated from the graph—Object Existence \(Q_{object}\), Counting \(Q_{count}\), Attribute \(Q_{attribute}\), and Relation \(Q_{relation}\). An MLLM (LLaVA-v1.6-13B) calculates the probability of a "yes" answer: \(r_j^{(i)}=p_{reward}(\text{answer}_j\mid I^{(i)},\text{question}_j)\). Averaging these provides a structural reward signal more granular than holistic scoring.
4. Curriculated GRPO (C-GRPO): Sequencing Policy Optimization
C-GRPO re-weights rewards across different difficulty levels based on training progress. The curriculated reward at step \(t\) is \(\hat r_j^{(i)}(t)=\sum_{j'}\hat p(t,j')\cdot r_j^{(i)}\), where \(\hat p(t,j')\) is the sampling probability of difficulty level \(j'\) at step \(t\) (controlled by Easy-to-Hard or Gaussian scheduling). The overall reward \(\hat r^{(i)}(t)\) for an image is the mean of all sampled questions. Advantages are normalized within each group of \(G\) images to calculate \(A_i(t)\), which is then plugged into the GRPO objective with clip and KL regularization:
This encourages the model to master simple concepts before tackling complex combinations.
Key Experimental Results¶
Main Results¶
On GenEval, DPG, TIFA, T2I-CompBench, and DSG, CompGen provides significant gains for both diffusion and autoregressive backbones:
| Model | Params | GenEval | DPG | TIFA | T2I-CompBench | DSG | Average |
|---|---|---|---|---|---|---|---|
| Stable-Diffusion-1.5 (Baseline) | 0.9B | 42.08% | 62.24% | 78.67% | 29.94% | 61.57% | 54.90% |
| Stable-Diffusion-2.1 | 0.9B | 50.00% | 65.47% | 82.00% | 32.01% | 68.09% | 59.51% |
| Playground-V2 | 2.6B | 59.00% | 74.54% | 86.20% | 36.13% | 74.54% | 66.08% |
| SimpleAR-SFT (Baseline) | 0.5B | 53.00% | 78.48% | 81.06% | 33.76% | 71.98% | 63.66% |
| Emu3 | 14B | 54.00% | 74.19% | 81.86% | 31.20% | 70.31% | 62.31% |
| SD-1.5 w/ CompGen (Ours) | 0.9B | 53.88% | 78.67% | 85.71% | 37.68% | 77.16% | 66.62% (↑11.72) |
| SimpleAR w/ CompGen (Ours) | 0.5B | 63.24% | 81.20% | 85.53% | 40.27% | 86.11% | 71.27% (↑7.61) |
The 0.9B SD-1.5 with CompGen outperforms the 2.6B Playground-V2. The 0.5B SimpleAR with CompGen achieves 71.27%, surpassing all evaluated models, including the 14B Emu3.
Ablation Study¶
Reward Model Impact (SD-1.5 backbone, Average Score):
| Reward Model | Average |
|---|---|
| InstructBLIP | 57.22% |
| CLIP-FlanT5-XXL | 60.63% |
| LLaVA-v1.5-13B | 64.40% |
| LLaVA-v1.6-13B (Selected) | 66.62% |
Performance scales linearly with the capability of the reward MLLM.
Difficulty Metric Impact:
| Difficulty Metric | Average |
|---|---|
| \(\lVert O\rVert+\lVert A\rVert+\lVert R\rVert\) (Additive) | 62.06% |
| \((\lVert O\rVert+\lVert R\rVert)/2\) (Mean) | 61.34% |
| Ours (Multiplicative) | 66.62% |
Key Findings¶
- Reward Model is the Ceiling: Performance improves with MLLM power, providing a clear path for future scaling.
- Multiplicative Metric is Key: It characterises the combinatorial explosion better than additive metrics.
- Curriculum Scheduling Strategy: On GenEval, Gaussian scheduling reached 54.6% within 500 steps (a 30% relative gain). Curriculum learning extends the effective training duration for continuous improvement.
- Difficulty Balance: Focusing exclusively on easy or hard samples degrades generalization; a balanced progression is essential.
Highlights & Insights¶
- Dual Role of Scene Graphs: The scene graph acts as both "questioner" (prompt gen) and "grader" (QA gen), grounding RL without needing reference images.
- Difficulty as a Controllable Sampling Variable: Using MCMC to hit precise difficulty targets allows for "data on demand."
- Curriculum Integrated into Rewards: C-GRPO re-weights rewards rather than changing architecture, making it easy to integrate into existing pipelines.
Limitations & Future Work¶
- Complexity is currently structural; semantic complexity (e.g., visual realism requirements) is not yet factored into the difficulty metric.
- Curriculum scheduling is currently fixed (Easy-to-Hard/Gaussian) rather than adaptive to real-time model performance.
- Reliance on MLLM binary scoring: Systemic biases in the MLLM (e.g., specific relation errors) directly affect reward quality.
- Structural counts may not always align with human perception of difficulty (e.g., semantically unusual but structurally simple prompts).
Related Work & Insights¶
- Vs Attention Methods (DenseDiffusion): CompGen updates weights for zero-cost inference, whereas attention methods are inference-only.
- Vs Planning Methods: CompGen avoids the need for external layout modules during inference.
- Vs SFT with GT Images: CompGen only requires text and RL, bypassing expensive ground-truth image generation.
- Vs Standard GRPO: Standard GRPO oscillates on mixed-difficulty data; C-GRPO ensures healthier scaling via staged learning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐