MIRO: Multi-Reward Conditioned Pretraining Simultaneously Enhances T2I Quality and Efficiency¶
Conference: ICML 2026
arXiv: 2510.25897
Code: Yes (Paper states "Code and weights available here")
Area: Diffusion Models / Text-to-Image
Keywords: Multi-reward conditioning, Flow Matching pretraining, Classifier-Free Guidance, Reward-guided sampling, Inference-time scaling
TL;DR¶
MIRO integrates "alignment" directly from the RLHF post-training stage back into pretraining: it labels each training image with 7 reward scores (aesthetic, user preference, text-image alignment, visual reasoning, scientific correctness, etc.), allowing a Flow Matching model to learn \(p(x|c, s)\). During inference, multi-reward CFG points to high-reward regions. A 0.36B parameter model surpasses the 12B FLUX-dev on GenEval with 370× less training compute, and its single-sample inference quality exceeds a baseline run 128 times.
Background & Motivation¶
Background: Modern T2I systems follow a three-stage pipeline of "pretraining → SFT → RLHF" (the Stable Diffusion 3 / FLUX route). Pretraining learns the distribution of web images, SFT refines the model on curated data, and RLHF pulls the distribution toward a specific scalar reward (usually PickScore or HPSv2).
Limitations of Prior Work: Each stage incurs a cost—pretraining only optimizes likelihood regardless of user preferences; SFT discards "low-quality" data, losing signals that help the model learn natural image structures; and RLHF collapses the distribution onto a single scalar reward, causing mode collapse, harming semantic fidelity, and fixing the trade-offs between rewards during training, preventing users from adjusting them during inference.
Key Challenge: These three stages are serially constrictive—the distribution generated by one stage is further narrowed by the next, ultimately locking users into an operating point chosen by the trainer. Simultaneously, "single reward + data filtering" both wastes low-score data rich in structural signals and naturally induces reward hacking (e.g., surging aesthetic scores while text-image alignment collapses).
Goal: Integrate multi-reward alignment directly into pretraining to achieve three objectives: (i) retain all training samples; (ii) allow users to slide freely across multiple reward dimensions during inference; and (iii) use the reward signals themselves as dense supervision to accelerate convergence.
Key Insight: Instead of pulling the distribution toward \(\arg\max r\) in a post-training phase, it is better to treat the reward scores \(s\) as additional conditions for the generator. This allows the model to learn "at reward level \(s\), given caption \(c\), what the image looks like." Consequently, low-score images find a place (as samples of \(p(x|c, s_\text{low})\)) just as high-score images do, modeling the entire reward spectrum.
Core Idea: Modify the generative model from \(p_\theta(x|c)\) to \(p_\theta(x|c, s)\), where \(s=[s_1,\dots,s_N]\) represents discretized binned scores from N reward models. Use a simple multi-reward CFG to guide inference toward high-reward regions, achieving "alignment through training" that surpasses even RLHF.
Method¶
Overall Architecture¶
MIRO is a three-step pretraining and inference framework that completely replaces the SFT and RLHF stages:
- Data Augmentation: For each training image \((x^{(i)}, c^{(i)})\), scores are generated using N=7 off-the-shelf reward models to obtain \(s^{(i)} \in \mathbb{R}^N\), which are then discretized into B bins using uniform binning (ensuring balanced sample sizes across score ranges).
- Multi-Reward Conditioned Flow Matching Training: The reward vector \(\hat{s}\) is treated as a condition equivalent to the text \(c\) for the Flow Matching velocity network. The training objective is \(\mathcal{L} = \mathbb{E}\big[\|v_\theta(x_t, c, \hat{s}) - (\epsilon - x)\|_2^2\big]\), where \(x_t = (1-t)x + t\epsilon\).
- Reward-Guided Sampling: During inference, \(\hat{s}\) is set to the maximum value \(\hat{s}_\text{max}=[B-1,\dots,B-1]\) to generate high-quality images. Simultaneously, CFG is extended to perform contrastive inference with positive and negative reward targets \(\hat{s}^+, \hat{s}^-\): \(\hat{v}_\theta(x_t, c) = (1+\omega) v_\theta(x_t, c, \hat{s}^+) - \omega\, v_\theta(x_t, c, \hat{s}^-)\).
The seven rewards cover five major dimensions: AestheticScore (aesthetics), PickScore / HPSv2 / ImageReward (user preference), OpenAI CLIP / JINA CLIP / VQAScore (text-image alignment), and SciScore (scientific correctness). The backbone is a 0.36B parameter DiT variant of CAD (Coherence-Aware Diffusion), trained on 16M images (CC12M + LA6), compared against an unconditional baseline of the same backbone and various single-reward conditioned versions.
Key Designs¶
-
Reward Vector Conditioning + Binning Discretization:
- Function: Unify continuous reward scores with vast scale differences (e.g., Aesthetic 0-10, CLIP 0-1) into discrete conditional signals digestible by the model.
- Mechanism: First, the entire dataset is scored using respective reward models, then uniform binning (splitting into B bins based on quantiles) is used instead of equal-width binning. This ensures sparse high-score regions receive enough samples for the model to learn the \(s = B-1\) tail distribution. Conditioning follows the CAD approach by encoding \(\hat{s}\) into tokens concatenated with caption tokens.
- Design Motivation: Directly feeding raw scores would cause the model to bias toward high-density middle-range scores (average quality), failing to learn high-quality tail samples. Uniform binning is probabilistically equivalent to converting reward scores into ranks, making it naturally scale-invariant.
-
Multi-Reward Classifier-Free Guidance (Inference Controller):
- Function: Allow users to "push" generation results toward a specific reward vector direction during inference without locking weights at training time.
- Mechanism: Extend single-reward CFG into vector space. The velocity field uses \(\hat{s}^+\) (default all 1s) and \(\hat{s}^-\) (default all 0s) for differential inference. Theorem 2.1 in the paper proves this is equivalent to sampling from a reward-tilted distribution \(p_\omega(x|c) \propto p(x|c, s^+) \big[\frac{p(s^+|x,c)}{p(s^-|x,c)}\big]^\omega\), where the velocity difference approximates the log-odds gradient \(\nabla_{x_t}\log\frac{p(s^+|x_t,c)}{p(s^-|x_t,c)}\).
- Design Motivation: Compared to RLHF fixing reward weights at training, the CFG formula allows users to independently adjust target values for each reward during every inference. For instance, setting the aesthetic dimension to 0.625 while keeping others at 1 can yield a +7 gain on GenEval. Theoretically, it can also lead toward the multi-reward Pareto front, avoiding single-dimension collapse.
-
Full-Spectrum Reward Supervision for Accelerated Convergence:
- Function: Turn reward scores into dense supervision signals so the model does not rely solely on diffusion reconstruction loss to discover "what makes a good image."
- Mechanism: Theorem 2.2 proves that the MIRO objective preserves the complete data distribution—marginalizing \(\sum_s p(s|c) p_\theta(x|c, s)\) equals \(p_\text{data}(x|c)\), with no loss of entropy \(H(p_\text{marginal}) = H(p_\text{data})\). However, because the model observes "how good this image is across 7 dimensions" at every training step, it receives 7-way dense labels, providing significantly higher supervision density than a baseline with only denoising loss.
- Design Motivation: Explains why MIRO converges 19× faster than the baseline—not due to model scaling, but due to denser supervision signals. Preserving the full spectrum also prevents mode collapse and reward hacking: to fit samples in low-score bins, the model must retain the ability to generate "ugly images," which in turn constrains overfitting to a single high-score mode.
Loss & Training¶
The training objective is the multi-conditional Flow Matching loss mentioned above (Eq. 2). CFG drops conditions during training with a certain probability (standard practice). The overall process requires no post-training RL stages; alignment is achieved in a single training stage, bypassing the instability of reward model gradient estimation or PPO ratio clipping found in RLHF.
Key Experimental Results¶
Main Results (GenEval + PartiPrompts, Selection from Table 1)¶
| Model | Params | Inference TFLOPs | GenEval | Aesthetic | ImageReward | HPSv2 | PickAScore |
|---|---|---|---|---|---|---|---|
| SDXL | 2.6B | – | 55 | 5.94 | 0.46 | 0.25 | 0.220 |
| SD3-medium | 2.0B | – | 62 | 6.18 | 1.15 | 0.30 | 0.225 |
| Sana-1.6B | 1.6B | – | 66 | 6.36 | 1.23 | 0.30 | 0.228 |
| FLUX-dev | 12.0B | 1540 | 67 | 6.56 | 1.19 | 0.30 | 0.229 |
| Baseline (real cap.) | 0.36B | 4.16 | 52 | 5.18 | 0.52 | 0.25 | 0.212 |
| MIRO (real cap.) | 0.36B | 4.16 | 57 | 6.28 | 1.06 | 0.29 | 0.220 |
| MIRO (50% synth.) | 0.36B | 4.16 | 68 | 6.28 | 1.11 | 0.29 | 0.220 |
| MIRO† (synth. + \(\hat{s}^+_\text{aes}=0.625\)) | 0.36B | 4.16 | 75 | 5.24 | 1.18 | 0.29 | 0.220 |
| ImageReward-Scaled MIRO (128 samples) | 0.36B | 532 | 75 | 6.28 | 1.61 | 0.30 | 0.223 |
Key comparison: 0.36B MIRO with synthetic captions scores 68 on GenEval, surpassing 12B FLUX-dev (67) with 370× less training compute. Inference compute (532 vs 1540 TFLOPs, including 128-sample best-of-N) is still 3× faster.
Ablation Study (Figure 3 Training Curves)¶
| Configuration | Aesthetic | ImageReward | PickScore | HPSv2 |
|---|---|---|---|---|
| Baseline convergence steps | ~500k | ~500k | ~500k | ~500k |
| MIRO steps to reach baseline final state | 26k | 135k | 143k | 79k |
| Speedup | 19.1× | 3.7× | 3.5× | 6.3× |
Synthetic caption breakdown (Figure 5 + Table 1): The baseline improves GenEval from 52 to 57 with synthetic captions. MIRO improves from 57 to 68, indicating that multi-reward conditioning and synthetic captions are synergistic rather than redundant. The largest single-category improvements: Position 30→46 (+53%), Counting 44→61 (+39%).
Key Findings¶
- Reward supervision density directly translates to training speed: The Aesthetic speedup of 19× far exceeds other dimensions (HPSv2 6.3×, PickScore 3.5×), suggesting the "denser and easier to learn" a reward signal is, the more exaggerated the speedup. However, even for the hardest (PickScore), there is a 3.5× speedup, showing the universal benefit of full-spectrum dense supervision.
- Single-reward training = Explicit reward hacking experiment: GenEval for single Aesthetic conditioning is only 33 (19 points lower than baseline); while aesthetics surged to 6.65, semantic alignment collapsed. MIRO achieved Aesthetic 6.28 and GenEval 57, verifying that mutual constraints between multiple rewards are necessary to avoid collapse.
- Test-time best-of-N efficiency comparison is striking: On the ImageReward dimension, MIRO with 8 samples ≈ Baseline with 128 samples (16×). For PickScore, MIRO 4 samples ≈ Baseline 128 samples (32×). In Aesthetic and HPSv2, MIRO single-sample results already exceed the baseline's 128-sample upper bound.
- Trading off via inference reward weights: Reducing \(\hat{s}^+_\text{aesthetic}\) from 1 to 0.625 increased GenEval from 68 to 75; setting it to 0 collapsed aesthetics but peaked semantics. This proves MIRO models the trade-off surface in a multi-reward space rather than being fixed to whatever weights were used during training.
- Cross-metric generalization: MIRO using HPSv2 for best-of-N selection conversely achieved 1.35 on ImageReward, surpassing models trained specifically for ImageReward (1.04), proving multi-reward training learns a more general concept of "good images."
Highlights & Insights¶
- Treating alignment as a condition rather than a target is conceptually similar to classifier-free guidance compared to classifier guidance. Instead of optimizing reward gradients separately, the reward is encoded into the conditional distribution, bypassing all RLHF instability. This approach can be directly transferred to video generation, 3D generation, code generation, or any field with multiple available reward models.
- Correspondence between theory and engineering: Theorem 2.1 interprets multi-reward CFG as a log-odds gradient of reward-tilted sampling. Theorem 2.2 uses entropy preservation to prove the full spectrum doesn't lose distribution. Theory is not just decoration; it predicts that the two knobs—adjusting \(\omega\) for alignment strength and \(\hat{s}\) for direction—must be effective.
- Real training efficiency stems from supervision density, not parameters: MIRO 0.36B beating FLUX 12B does not rely on new architectures or more data, but rather on feeding rewards back into the training loop—eliminating the redundant phase where the "model must re-learn what a good image is from reconstruction." This provides a new path for researchers looking to help small models chase large ones: finding more dense supervision is more cost-effective than scaling parameters.
Limitations & Future Work¶
- Blind spots in off-the-shelf reward models: MIRO's ceiling is capped by the coverage of the 7 reward models. If a dimension (e.g., copyright risk, cultural adaptation) lacks an available model, MIRO cannot learn it. The reliability of newer rewards like SciScore is also not fully verified.
- Correlations between rewards are not explicitly modeled: Aesthetic and PickScore are highly correlated, as are VQAScore and CLIP. The paper uses 7 rewards without providing an independence analysis; theory assumes conditional independence, which may not hold, potentially making the effective degrees of freedom much lower than 7.
- Data scale remains small: The setting of 16M images and 0.36B parameters is orders of magnitude smaller than FLUX. Whether the 19× speedup holds when scaling to 100M images or 1B+ parameter models, or whether it will be backfired by noise in reward signals, remains unverified.
- Lack of automation in inference reward tuning: The "Aesthetic=0.625" setting in MIRO† was manually searched. How a user finds the desired trade-off point when faced with 7 knobs and B bins is an open question, perhaps requiring a meta-learning or preference inference layer.
Related Work & Insights¶
- vs. RLHF / DPO (Fan 2023, Rafailov 2023): These optimize \(\mathbb{E}_x[r(x)]\) or preference pairs in post-training, requiring separate RL stages and reward model gradient estimates. MIRO feeds rewards as conditions, completing alignment in a single stage without mode collapse, PPO hyperparameters, or post-training phases.
- vs. Coherence-Aware Diffusion (CAD, Dufour 2024): CAD is MIRO's predecessor but only conditions on a single CLIP score to avoid data filtering. MIRO extends this to 7 rewards and proposes multi-reward CFG. Table 1 shows that a single CLIP condition version ("CAD-like CLIP" row) scores 57 on GenEval—matching MIRO real-caption—but on synthetic captions, MIRO gains another 11 points, proving multi-reward scalability far exceeds single-reward.
- vs. Parrot (Lee 2024, ECCV): Parrot also performs multi-reward T2I but follows a Pareto-optimal RL route requiring complex multi-objective PPO. MIRO bypasses RL entirely, using only conditioning + CFG to reach similar multi-objective balancing with an order of magnitude simpler engineering.
- vs. Synthetic Captioning (DALL·E 3 route): Synthetic captioning improves the baseline from 52 to 57 as a standalone method. MIRO is orthogonal to synthetic captioning; when combined, they reach 68, proving the two routes solve different problems (captions solve text-side noise, MIRO solves reward-side supervision density).
- Insight: Any scenario with "multiple available evaluators but only a single loss used in training" can adopt the MIRO paradigm—treating evaluators as conditions rather than optimization targets. This applies to unit test pass rate, execution correctness, and style scores in code generation; or motion smoothness, physical plausibility, and temporal consistency in video generation.
Rating¶
- Novelty: ⭐⭐⭐⭐ "Reward as condition rather than target" is a paradigm shift in the RLHF era, but since CAD already proved this for single rewards, MIRO is a solid extension rather than a complete subversion.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 7 single-reward baselines + synthetic caption breakdown + training curves + best-of-N + cross-metric generalization + weight sweeps. Every argument is backed by targeted experiments without cherry-picking.
- Writing Quality: ⭐⭐⭐⭐ The three-stage pain point analysis is very clear. Method equations + theorems + intuitive descriptions are well-balanced. The 8-in-1 radar chart in Figure 2 is dense but information-rich.
- Value: ⭐⭐⭐⭐⭐ The result of 0.36B beating 12B with 370× less training compute, if reproducible, implies that SFT + RLHF stages could be entirely removed from industrial T2I pipelines, which has high long-term impact.
Related Papers¶
- [ICML 2026] HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing
- [CVPR 2026] Bias at the End of the Score: Demographic Biases in Reward Models for T2I
- [ICLR 2026] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models
- [ICLR 2026] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models
- [CVPR 2025] Science-T2I: Addressing Scientific Illusions in Image Synthesis