One Diffusion to Generate Them All¶

Conference: CVPR 2025
arXiv: 2411.16318
Code: GitHub
Area: 3D Vision/Unified Generation
Keywords: Unified Diffusion Models, Multi-task Generation, Flow Matching, Multi-view Generation, Conditional Image Generation

TL;DR¶

This work proposes OneDiffusion, a 2.8B parameter unified diffusion model that models all conditional and target images as a frame sequence with varying noise scales. A single model supports multiple tasks including text-to-image, conditional generation, depth estimation, segmentation, multi-view generation, and ID customization.

Background & Motivation¶

Current diffusion models are usually trained independently for single tasks, lacking the generalizability of LLMs.
Controllable generation relies on external modules (e.g., ControlNet requires dedicated conditional encoders, and personalization models need face recognition networks and auxiliary losses).
Input requirements vary dramatically across tasks: multi-view generation needs to handle arbitrary combinations of input/output views and camera poses, while understanding tasks require outputting depth, poses, or segmentation maps.
Existing training schemes are highly optimized for specific tasks and fail to generalize across tasks.
Large language models (e.g., GPT-4) have demonstrated the value of general-purpose models, inspiring the pursuit of similar unity in diffusion models.
A unified framework is needed that supports bidirectional image synthesis and understanding without specialized architectures or external losses.

Method¶

Overall Architecture¶

OneDiffusion unifies all tasks into frame sequence modeling. Each sample is a set of "views" $\{\mathbf{x}_i\}_{i=1}^N$. During training, distinct noise timesteps $t_i$ are independently sampled for each view, and the model learns a joint velocity field $v_\theta(t_1,...,t_N, \mathbf{x}_1,...,\mathbf{x}_N)$. During inference, the timesteps of clean conditional views are set to 0 (no noise), while target views are generated by integrating backward from Gaussian noise. Different tasks are distinguished via task tags (e.g., [[text2image]], [[multiview]]).

Key Designs¶

1. Frame Sequence Modeling with Varying Noise Scales

Function: Unifies all conditional generation and prediction tasks under a same training objective.
Mechanism: Conditional and target images are treated as different "views" of a sequence. During training, noise timesteps are sampled independently for each view as $t_i \sim \text{LogNorm}(0,1)$, and the forward process is defined as $\mathbf{x}_i^{t_i} = t_i\mathbf{x}_i + (1-t_i)\epsilon_i$. During inference, setting condition views to $t_{\setminus K}=0$ and target views to $t_K=t$ enables conditional sampling.
Design Motivation: Varying noise scales naturally distinguish conditions from targets, eliminating the need to design distinct conditioning mechanisms (such as zero convolutions in ControlNet or adapters in IP-Adapter) for different tasks.

2. Multi-View Scheme Unified by Plücker Ray Encoding

Function: Supports multi-view generation and camera pose estimation.
Mechanism: Camera rays are represented using Plücker coordinates $\bm{r}=(\bm{o}\times\bm{d}, \bm{d})$. Ray embeddings are appended as independent "views" following the image latents to form a sequence (rather than being concatenated along channels). Consequently, rays can either serve as conditions to generate multi-view images, or be treated as noisy signals to predict camera poses.
Design Motivation: Treating ray embeddings as independent views instead of channel-wise concatenations allows the number of views $N$ to remain flexible, naturally supporting pose estimation (the inverse problem).

3. Task Tags + Rich Textual Conditioning

Function: Specifies task types and specific conditions through text.
Mechanism: Task tags (e.g., [[semantic2image]]) are predefined for each task, along with descriptive text. For semantic segmentation, color codes and categories (e.g., <#FFFF00 yellow mask: mouse>) are embedded in the prompt to achieve flexible conditioning.
Design Motivation: Leverages the flexibility of text to unify conditional descriptions across tasks, omitting the need to design specialized encoders for each distinct condition.

Loss & Training¶

Joint flow matching objective: $\mathcal{L}(\theta) = \mathbb{E}[\|v_\theta - u\|^2]$, where $u = (\mathbf{x}_1 - \epsilon_1, ..., \mathbf{x}_N - \epsilon_N)$.
Trained from scratch using a three-stage strategy: (1) T2I pre-training at $256^2$/$ 512^2$ for 500K steps each; (2) mixed-task training at $512^2$ for 1M steps; (3) T2I high-resolution fine-tuning at $1024^2$.
Next-DiT architecture is employed, with 3D RoPE positional encoding supporting multi-resolution.
Tasks are sampled within batches with equal probability; AdamW optimizer with $\eta=0.0005$ is used.
Training hardware: TPU v3-256 + 64×H100.

Key Experimental Results¶

Main Results¶

GenEval text-to-image benchmark ($1024 \times 1024$):

Method	Parameters (B)	Data Volume (M)	GenEval↑
SDXL	2.6	-	0.55
SD3-medium	2.0	1000	0.62
FLUX-dev	12.0	-	0.67
FLUX-schnell	12.0	-	0.71
OneDiffusion	2.8	75	0.65

Ablation Study¶

Influence of different components in the multi-view generation task:

Configuration	PSNR↑	SSIM↑	LPIPS↓
w/o Plücker Rays	18.2	0.72	0.28
w/o Multi-task Training	20.1	0.78	0.22
Full OneDiffusion	22.5	0.84	0.16

Key Findings¶

Achieving a GenEval score of 0.65 with only 75M training data, which is close to specialized models trained on over 1000M+ data (SD3 0.62, FLUX-dev 0.67).
Performance on multi-view generation is comparable to methods dedicated to this task, demonstrating that unified training does not compromise single-task capabilities.
The model generalized zero-shot to high resolutions unseen during training.
ID customization can handle non-human faces (e.g., anime characters), outperforming InstantID which relies on face detectors.
A single model simultaneously supports both forward (condition $\rightarrow$ image) and backward (image $\rightarrow$ condition) tasks.

Highlights & Insights¶

The framework design is extremely simple: varying noise sequences unify all tasks without requiring dedicated modules, external losses, or adapters.
Bidirectional capability: the same model can both generate images from depth maps and predict depth maps from images, making the roles of condition and target fully interchangeable.
The design of treating ray embeddings as independent views in the multi-view scheme supports flexible input-output combinations.
Highly efficient training data utilization; 75M data enables competition with models trained on billions of data samples.

Limitations & Future Work¶

The 2.8B parameters are still inferior to the 12B FLUX series in text-to-image generation.
Data balancing strategies for different tasks in multi-task training still require further in-depth research.
Single-task accuracy in certain scenarios may fall short of specialized models.
Future work can extend to more tasks (such as video generation, 3D reconstruction, etc.).
Scaling up the model size and incorporating more high-quality training data are expected to further improve performance.

ControlNet / T2I-Adapter: Conditioning is supported via external modules; OneDiffusion proves that a unified architecture can replace these specialized modules.
Marigold: Finetunes diffusion models for depth estimation; OneDiffusion unifies it as one of many tasks.
Stable Video Diffusion: The sequence modeling concept in video diffusion inspired the frame sequence design of OneDiffusion.
Insight: Diffusion models are evolving towards generalizability similar to LLMs, and a unified training framework serves as the critical path.

Rating¶

Novelty: ⭐⭐⭐⭐ — The approach of unifying multiple tasks via varying noise sequences is simple and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Evaluated across multiple tasks covering both generation and prediction.
Writing Quality: ⭐⭐⭐⭐ — Clear description of the framework and comprehensive task coverage.
Value: ⭐⭐⭐⭐⭐ — Provides an important reference path for unified visual generative models.