MMaDA: Multimodal Large Diffusion Language Models¶

Conference: NeurIPS 2025 arXiv: 2505.15809 Code: GitHub Area: Diffusion Models / Multimodal Foundation Models Keywords: Diffusion language models, unified multimodal architecture, UniGRPO, mixed long chain-of-thought, discrete diffusion

TL;DR¶

This paper presents MMaDA, the first multimodal foundation model that simultaneously achieves text reasoning, multimodal understanding, and text-to-image generation within a unified discrete diffusion architecture. MMaDA bridges the gap between diffusion model pre-training and post-training through mixed long chain-of-thought (CoT) fine-tuning and the UniGRPO reinforcement learning algorithm.

Background & Motivation¶

Multimodal large models are undergoing an architectural paradigm shift:

Pure Autoregressive (AR): Models such as Emu3 and Janus unify all modalities via next-token prediction. The architecture is simple, but visual generation quality is limited.

AR + Diffusion (two models): Models such as DreamLLM use AR for text and continuous diffusion for vision, requiring two separate models.

AR + Diffusion (one model): Models such as Show-o and Transfusion mix AR and diffusion objectives within a single model, introducing complex hybrid mechanisms.

Nevertheless, existing unified multimodal models severely lack exploration of post-training—they are deployed directly after pre-training without CoT fine-tuning or reinforcement learning. This gap is especially pronounced in non-autoregressive architectures, as directly adapting AR-paradigm GRPO to diffusion models faces three major technical obstacles: 1. Token-level log-likelihood is valid only over masked regions. 2. The policy distribution is conditioned on the sampled mask rate. 3. Sequence-level likelihood cannot be accumulated via the autoregressive chain rule.

MMaDA's core positioning: unify all modalities with pure (discrete) diffusion, and for the first time systematically design a complete post-training pipeline for a diffusion foundation model.

Method¶

Overall Architecture¶

MMaDA's training pipeline consists of three stages: - Stage 1 (Pre-training): Unified diffusion objective with joint training on text generation, class-conditional image generation, and image–text understanding (600K steps). - Stage 2 (Mixed Long-CoT Fine-tuning): Fine-tuning on cross-modal CoT data to establish reasoning capabilities (50K steps). - Stage 3 (UniGRPO Reinforcement Learning): RL training with diverse reward signals (50K steps).

Key Designs¶

Unified Discrete Diffusion Architecture: Built on LLaDA-8B as the backbone, MMaDA treats both text and images as masked prediction tasks over discrete tokens. Images are quantized into $32 \times 32 = 1024$ discrete tokens by MAGVIT-v2 (codebook size 8192). The unified training objective is: $$\mathcal{L}_{\text{unify}}(\theta) = -\mathbb{E}_{t,x_0,x_t}\left[\frac{1}{t}\sum_{i=1}^L \mathbf{I}[x_t^i=\texttt{[MASK]}] \log p_\theta(x_0^i | x_t)\right]$$ The core advantage is that language and vision share exactly the same probabilistic formulation and architecture, with no modality-specific components.
Mixed Long-CoT Fine-tuning: A unified CoT format is designed as |<special_token>|<reasoning_process>|<special_token>|<result> to align reasoning processes across tasks. Data sources include:
- Text reasoning: ReasonFlux, LIMO, OpenThoughts, etc.
- Multimodal reasoning: correct responses from LMM-R1 on GeoQA/CLEVR.
- Knowledge-aware image generation: scientific, cultural, and landmark description pairs synthesized by GPT-4.1.

A key design choice is transferring text reasoning capability to visual generation—the model first performs textual reasoning (analyzing object attributes and spatial relationships) and then generates the image.

UniGRPO: A policy-gradient RL algorithm specifically designed for diffusion models, addressing three key challenges:
- Structured noise policy: For each response $o_i$, a mask rate $p_i \in [0,1]$ is uniformly sampled with random seeds varied across gradient steps. This exposes the model to a wide range of denoising stages, from nearly fully masked to nearly fully unmasked.
- Efficient log-likelihood approximation: $$\pi'_\theta = \frac{1}{M}\sum_{o_{i,t} \in M} \log p_\theta(o_{i,t} | q)$$ Averaging over masked tokens avoids the 128-sample Monte Carlo estimation used in LLaDA.
- Uniform random masking: Starting steps are sampled randomly, and the remaining denoising steps are distributed uniformly (rather than fully at random), approximating the Monte Carlo average and yielding more stable training.

Loss & Training¶

Diverse reward modeling: - Text reasoning: correctness reward 2.0 + format reward 0.5 (following <think>...</think> format). - Multimodal reasoning: correctness + format + CLIP reward $0.1 \cdot \text{CLIP}(\text{image}, \text{text})$. - Image generation: CLIP reward + ImageReward (human preference score), both scaled by 0.1.

Sampling strategies: - Text generation: semi-autoregressive denoising (64-token blocks; 2 lowest-confidence tokens unmasked per step). - Image generation: parallel non-autoregressive with cosine mask schedule, 50 denoising steps, CFG = 3.5.

Key Experimental Results¶

Main Results¶

Multimodal Understanding Benchmarks

Model	POPE	MME	VQAv2	GQA	MMMU	MMB	SEED
LLaVA-v1.5	85.9	1510.7	78.5	62.0	35.4	64.3	58.6
Show-o	80.0	1097.2	69.4	58.0	26.7	-	-
MMaDA	86.1	1410.7	76.7	61.3	30.2	68.5	64.2

Text-to-Image Generation

Model	WISE Cultural↑	ImageReward↑	CLIP Score↑	GenEval Overall↑
SDXL	0.43	1.13	32.12	0.55
Janus	0.16	1.03	29.45	0.61
Show-o	0.28	0.92	28.94	0.53
MMaDA	0.67	1.15	32.46	0.63

Text Reasoning

Model	Architecture	MMLU	GSM8K	MATH	ARC-C
LLaMA-3-8B	AR	64.5	53.1	15.1	53.1
Qwen2-7B	AR	70.3	80.2	43.5	60.6
LLaDA-8B	Diffusion	65.9	70.7	27.3	47.9
MMaDA-8B	Diffusion	68.4	73.4	36.0	57.4

Ablation Study¶

Stage	GSM8K	MATH500	GeoQA	CLEVR	CLIP Score	ImageReward
Stage 1 (Pre-training)	17.4	4.2	8.3	10.3	23.1	0.69
+ Mixed Long-CoT	65.2	26.5	15.9	27.5	29.4	0.84
+ UniGRPO	73.4	36.0	21.0	34.5	32.5	1.15

Masking Strategy	Performance	Notes
d1 (question masked + answer fully masked)	Slow convergence, low reward	Ignores the multi-step nature of diffusion models
Fully random mask rate	Unstable training	High reward variance
UniGRPO (uniform random)	Fast convergence, high reward	Approximates Monte Carlo average

Key Findings¶

Cross-modal synergy: All metrics across the three tasks (text / understanding / generation) improve simultaneously during Stage 2 training; text reasoning capability directly improves the semantic accuracy of image generation.
Sampling efficiency: Image generation maintains strong performance with as few as 15 denoising steps (CLIP 31.7 vs. 32.8 with the full 1024 steps); text generation converges at 256 steps.
Native support for inpainting tasks: The diffusion model can be directly applied to text span prediction, VQA answer completion, and image inpainting without additional fine-tuning.

Highlights & Insights¶

The first fully diffusion-based multimodal foundation model to undergo systematic post-training, demonstrating that diffusion models can perform not only generation but also understanding and reasoning.
UniGRPO's uniform random masking strategy is both efficient and stable, resolving the key technical obstacles in adapting GRPO to diffusion models.
A substantial lead on the WISE Cultural benchmark (0.67 vs. 0.43 for SDXL) demonstrates that reasoning enhancement genuinely benefits knowledge-intensive image generation.
The inherent parallel decoding and inpainting capabilities of diffusion models constitute structural advantages over AR models.

Limitations & Future Work¶

A performance gap relative to leading AR models on pure-text tasks remains (MMLU 68.4 vs. 70.3 for Qwen2-7B; GSM8K 73.4 vs. 80.2).
Image resolution is limited to 512×512; high-resolution generation has not been explored.
Multimodal understanding falls short of LLaVA-v1.5 on certain benchmarks (e.g., MME 1410.7 vs. 1510.7).
Training cost is high (64 A100 GPUs), limiting scalability.
Video generation and the unification of additional modalities remain unexplored.

MMaDA forms a direct contrast with Show-o (AR + diffusion hybrid), demonstrating that a pure diffusion approach can match or surpass hybrid architectures.
LLaDA lays the foundation for text diffusion models; MMaDA extends it to the multimodal domain and adds a complete post-training pipeline for the first time.
UniGRPO complements the GRPO approach of RLVR-World (another paper in this batch)—the former targets discrete diffusion models, while the latter targets autoregressive video models.
This work may inspire exploration of post-training paradigms for larger-scale diffusion foundation models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The first to achieve a three-in-one system (text + understanding + generation) under a pure diffusion architecture with systematic post-training; a pioneering contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple benchmarks across three task categories with thorough ablations, though comparisons with larger models (e.g., 13B/70B) are absent.
Writing Quality: ⭐⭐⭐⭐ The methodology is presented clearly and systematically, though the overall paper structure is somewhat verbose.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for diffusion models as general-purpose foundation models; UniGRPO has broad applicability.