Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning¶
Conference: NeurIPS 2025 arXiv: 2506.01480 Code: https://janus-pro-r1.github.io (project page) Area: Multimodal Large Language Models / Image Generation Keywords: MLLM, visual generation, reinforcement-learning, Chain-of-Thought, Aha Moment
TL;DR¶
This paper proposes Janus-Pro-R1, which achieves synergistic advancement in visual understanding and generation through a two-stage training pipeline (SFT + RL). The approach enables MLLMs to form genuine Chain-of-Thought reasoning and trigger Aha Moments during text-to-image generation, surpassing GPT-4o on GenEval while extending naturally to image editing tasks.
Background & Motivation¶
Although contemporary MLLMs unify visual understanding and generation under a shared next-token prediction framework, the two capabilities in practice remain largely independent — visual understanding does not enhance visual generation, and the powerful reasoning mechanisms of LLMs are not fully integrated into image generation. State-of-the-art models such as Janus-Pro still produce unsatisfactory results in text-to-image generation and support only pure-text inputs for generation.
The root cause lies in the absence of synergy between visual understanding and generation. The paper's starting point is to enable MLLMs to let understanding and generation collaborate naturally, transforming image generation into an iterative introspective process — the model generates an image, evaluates it, identifies problems, and regenerates, thereby forming a genuine CoT reasoning chain and triggering an "Aha Moment."
Method¶
Overall Architecture¶
Janus-Pro-R1 adopts a two-stage training pipeline: 1. Supervised Fine-Tuning (SFT) stage: Mixed sub-task training teaches the MLLM the foundational ability to construct a visual generation CoT. 2. Reinforcement Learning (RL) stage: The GRPO algorithm is used to unlock the model's full potential through exploration–exploitation trade-offs, elevating behavior from imitation to genuine reasoning.
Key Designs¶
-
Mixed Sub-Task Training (SFT stage): The visual generation CoT is decomposed into three sub-tasks for mixed training —
- Task-I Text-to-Image Generation: Text-image pairs with semantic consistency score \(S \geq 0.8\) are selected for standard T2I training.
- Task-II Text-Image Consistency Self-Evaluation: The model is trained to judge whether a generated image is semantically consistent with the text and to provide justification.
- Task-III Image Regeneration: Given the context of a previously incorrect generation, the model is trained to correct errors and regenerate a more accurate image.
- Training data consists of 200K prompts; for each prompt, 18 images are generated by FLUX and Janus-Pro, and semantic consistency scores are evaluated by InternVL2.5-26B.
-
Dual-Layer QA Reward for RL (RL stage):
- GRPO is employed, treating image generation as a long token-level Markov decision process.
- A dual-layer reward is designed: a generation reward \(R^{Gen}\) (evaluating the quality of the generated image) and a comprehension reward \(R^{Comp}\) (evaluating the accuracy of self-assessment).
- The generation reward assigns higher weight to the final output image; the comprehension reward measures the alignment between the model's self-evaluation and an external evaluator.
- InternVL2.5-26B serves as the reward model, requiring no ground-truth image annotations.
-
Extension from Text-to-Image Generation to Image Editing:
- Image editing inherently requires understanding editing instructions and generating new images, sharing the same foundation as introspective generation.
- The same two-stage SFT + RL pipeline is applied.
- RL rewards include a following score \(R^{flw}\) (whether editing instructions are accurately executed) and a preservation score \(R^{psv}\) (whether unedited regions remain unchanged).
Loss & Training¶
- SFT stage: Three sub-tasks are mixed at a 0.2:0.3:0.5 ratio, trained for 50K steps with a learning rate of 2e-5.
- RL stage: GRPO objective with KL divergence constraint (\(\beta = 0.05\)), group size 7, trained for 3K steps.
- Training stability techniques: linear + cosine learning rate scheduler; when the reward curve declines, the learning rate is reduced or the reference model is updated.
Key Experimental Results¶
Main Results (Text-to-Image Generation)¶
| Benchmark | Metric | Janus-Pro-R1 (Aha) | Janus-Pro-7B | GPT-4o | Gain |
|---|---|---|---|---|---|
| GenEval | Overall↑ | 0.86 | 0.80 | 0.85 | +7.5% |
| T2I-CompBench | Avg↑ | 72.7 | 49.4 | - | +47.0% |
| DPG-Bench | Score↑ | 85.57 | 84.17 | - | +1.7% |
| GenEval | Counting↑ | 0.66 | 0.59 | 0.85 | +11.9% |
| GenEval | Position↑ | 0.87 | 0.79 | 0.75 | +10.1% |
| GenEval | ColorAttr↑ | 0.78 | 0.66 | 0.66 | +18.2% |
Ablation Study¶
| Configuration | GenEval Overall | Note |
|---|---|---|
| Janus-Pro-7B baseline | 0.80 | Original model |
| SFT (w/o aha) | 0.81 | SFT yields marginal improvement |
| SFT (with aha) | 0.81 | Introspective mechanism begins to take effect |
| R1 (w/o aha) | 0.83 | RL significantly improves initial generation quality |
| R1 (with aha) | 0.86 | Introspection + RL achieves best performance |
| R1-1B (with aha) | 0.71 | Small models cannot effectively activate Aha |
| Task-I SFT only | 0.79 | Single-task underperforms mixed training |
| Task-II+III SFT only | 0.76 | Lacking foundational T2I capability |
Key Findings¶
- Janus-Pro-R1 surpasses GPT-4o on GenEval Overall (0.86 vs. 0.85).
- SFT tends toward imitative memorization, whereas RL enables genuine generalization (verified through counterfactual generation, e.g., "a square apple").
- The 1B-parameter model cannot effectively trigger the Aha Moment, reflecting a scaling law effect.
- Data quality matters more than quantity: high-threshold filtering reduces data volume but significantly improves performance.
- As a semantic image evaluator, Janus-Pro-R1 achieves 81.1% agreement with GenEval standards, outperforming InternVL2.5-8B.
Highlights & Insights¶
- The concept of "Aha Moment" is elegantly formulated: It transforms image generation via diffusion or AR models into an iterative introspective process, allowing the model to detect and correct its own errors.
- The dual-layer reward design is well-motivated: Simultaneously rewarding generation quality and self-evaluation accuracy drives genuine synergy between understanding and generation.
- RL serves as the critical catalyst: SFT provides only a cold start; RL is what transforms imitation into genuine reasoning.
- A principled path toward unified generation: The paper demonstrates that the understanding–generation synergy can naturally extend to advanced tasks such as image editing.
- Counterfactual generation cases (e.g., "a square apple") effectively illustrate the reasoning generalization capacity acquired through RL.
Limitations & Future Work¶
- Training data for counting-related tasks is scarce, leaving counting performance behind GPT-4o.
- The image editing dataset is of limited aesthetic quality, affecting the visual appeal of editing results.
- The understanding–generation synergy is currently validated only on simple T2I tasks; more complex interleaved image-text generation remains unexplored.
- The 1B model does not benefit from this training paradigm, limiting applicability at smaller scales.
- Training requires 32 A800 GPUs, imposing considerable computational demands.
Related Work & Insights¶
- Compared to contemporaneous works that introduce CoT — such as T2I-R1, MINT, and GOT — this paper emphasizes that CoT should emerge organically from deep model reasoning rather than being enforced through explicit textual planning.
- The GRPO algorithm from DeepSeek-R1 is adapted to apply RL to visual generation reasoning.
- Implication for unified multimodal models: RL may be the key training paradigm for achieving genuine understanding–generation synergy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐