Less-to-More Generalization: Unlocking More Controllability by In-Context Generation¶

Conference: ICCV 2025 arXiv: 2504.02160 Code: https://github.com/bytedance/UNO Area: Diffusion Models / Image Generation Keywords: Subject-driven Generation, Multi-subject Generation, DiT, Model-Data Co-evolution, Positional Encoding

TL;DR¶

This paper proposes UNO, a universal DiT-based customized generation model. Through a "model-data co-evolution" paradigm—wherein synthetic data generated by weaker models progressively trains stronger models—combined with progressive cross-modal alignment and Universal RoPE, UNO achieves state-of-the-art performance on both single- and multi-subject-driven image generation (DreamBench DINO 0.760, CLIP-I 0.835).

Background & Motivation¶

Background: Subject-driven generation aims to synthesize new images based on reference subjects and text descriptions, with broad applications in content creation and industrial design. Existing methods fall into two categories: (a) few-data fine-tuning methods (DreamBooth, Textual Inversion, LoRA), which require per-subject fine-tuning and incur high inference costs; and (b) large-scale training methods (IP-Adapter, BLIP-Diffusion), which leverage auxiliary image encoders for zero-shot generation but depend on large amounts of paired data.

Limitations of Prior Work: - Data bottleneck: Acquiring paired subject images with diverse viewpoints and poses is extremely difficult, and scaling from single-subject to multi-subject datasets compounds this challenge further. - Limited synthetic data quality: Existing synthetic data typically suffers from low resolution (≤512×512) and narrow domain coverage. - Multi-subject dilemma: Most methods support only single-subject generation; when confronted with multi-subject scenarios, they tend to exhibit attribute confusion or copy-paste artifacts. - Trade-off between subject fidelity and text controllability: Improving subject consistency often sacrifices text-based editing capability.

Key Challenge: Existing methods adopt a "data-first, then model" paradigm, in which the data bottleneck directly caps model capability. The central question is how to break through this bottleneck to achieve scalable customized generation.

Goal: (a) How to establish a sustainably evolving synthetic data pipeline that scales from single-subject to multi-subject settings? (b) How to design a universal customization model that seamlessly generalizes from single-subject to multi-subject generation?

Key Insight: Inspired by synthetic-data-driven self-improvement in LLMs (weak-to-strong generalization), this work proposes a "model-data co-evolution" paradigm: a weak T2I model first generates single-subject paired data to train an S2I model, which in turn generates multi-subject paired data to train a stronger variant, enabling continuous co-evolution.

Core Idea: The intrinsic in-context generation capability of DiT is exploited to synthesize high-consistency paired data. Combined with progressive cross-modal alignment and Universal RoPE, the T2I model is iteratively upgraded into a multi-subject S2I model with minimal architectural modification.

Method¶

Overall Architecture¶

UNO is built upon FLUX.1 dev (MM-DiT architecture) and consists of two core systems: (1) a synthetic data pipeline—progressive data generation from single-subject to multi-subject with multi-stage filtering; and (2) a model training framework—iteratively training the T2I model into an S2I model via progressive cross-modal alignment (Stage I: single-subject → Stage II: multi-subject) and UnoPE positional encoding.

Key Designs¶

Synthetic Data Curation Framework:
- Function: Systematically generates high-resolution (1024×1024), high-consistency subject-paired data by leveraging the in-context generation capability of DiT models.
- Mechanism:
  - Single-subject stage: A taxonomy of 365 broad categories and fine-grained subcategories is constructed. An LLM generates diverse subject and scene descriptions, and carefully designed text templates prompt the T2I model to produce subject-consistent image pairs. DINOv2 performs preliminary filtering (removing low-consistency pairs), followed by VLM scoring across appearance, details, and attributes: \(score = \text{Average}(\text{VLM}(I_{ref}, I_{tgt}, c_y))\)
  - Multi-subject stage: The Stage I S2I model combined with an open-vocabulary detector (OVD) detects and crops additional subjects from target images. The S2I model then generates a consistent second reference image \(I_{ref}^2\). Crucially, directly using cropped images as \(I_{ref}^2\) leads to copy-paste artifacts; regeneration via the S2I model is therefore mandatory.
- Design Motivation: To resolve the acquisition bottleneck for multi-subject paired data. Experiments confirm that data filtered at higher quality scores consistently improves model performance (both DINO and CLIP-I scores increase with data quality).
Progressive Cross-Modal Alignment:
- Function: A two-stage curriculum that gradually trains the DiT to condition on reference images.
- Mechanism:
  - Stage I: Single reference image input. The reference image is VAE-encoded and concatenated with text tokens and the noisy latent: \(z_1 = \text{Concatenate}(c, z_t, \mathcal{E}(I_{ref}^1))\). All tokens participate jointly in FLUX's MM-DiT attention.
  - Stage II: Extended to multiple reference images: \(z_2 = \text{Concatenate}(c, z_t, z_{ref}^1, z_{ref}^2, \ldots, z_{ref}^N)\), with \(N=2\).
- Design Motivation: Directly training with multi-reference image inputs degrades performance (DINO drops from 0.542 to 0.511), because noise-free reference tokens disrupt the original convergence distribution. A simple-to-complex curriculum proves substantially more effective.
Universal Rotary Position Embedding (UnoPE):
- Function: Provides appropriate positional encoding for multiple reference images to prevent attribute confusion.
- Mechanism: FLUX employs RoPE, assigning text tokens position (0,0) and noisy image tokens position (i,j) with \(i \in [0,w-1], j \in [0,h-1]\). Reference images are placed starting from a diagonal offset: \((i', j') = (i + w^{(N-1)}, j + h^{(N-1)})\), ensuring sufficient positional separation between different reference images.
- Design Motivation: Directly replicating the target image's positional indices (without offset) prevents the model from distinguishing reference images from the target image, causing a dramatic DINO drop (from 0.730 to 0.470 for single-subject; from 0.542 to 0.386 for multi-subject). The diagonal offset in UnoPE breaks the spatial correlation between reference images, compelling the model to derive layout information from text rather than position.

Loss & Training¶

Standard flow-matching loss is applied. Training configuration: learning rate \(10^{-5}\), batch size 16, Stage I for 5,000 steps (230K single-subject samples), Stage II for 5,000 steps (15K multi-subject samples), LoRA rank 512, on 8×A100 GPUs.

Key Experimental Results¶

Main Results¶

Single-subject generation (DreamBench):

Method	DINO ↑	CLIP-I ↑	CLIP-T ↑
DreamBooth (fine-tuning)	0.668	0.803	0.305
RealCustom++	0.702	0.794	0.318
OmniGen	0.693	0.801	0.315
OminiControl	0.684	0.799	0.312
FLUX IP-Adapter	0.582	0.820	0.288
UNO (Ours)	0.760	0.835	0.304
Oracle (between reference images)	0.774	0.885	-

Multi-subject generation:

Method	DINO ↑	CLIP-I ↑	CLIP-T ↑
MS-Diffusion	0.525	0.726	0.319
OmniGen	0.511	0.722	0.331
MIP-Adapter	0.482	0.726	0.311
UNO (Ours)	0.542	0.733	0.322

Ablation Study¶

Configuration	DINO ↑	CLIP-I ↑	CLIP-T ↑
UNO (Full)	0.542	0.733	0.322
w/o generated \(I_{ref}^2\) (replaced by cropped image)	0.529	0.730	0.308
w/o cross-modal alignment (direct multi-subject training)	0.511	0.721	0.322
w/o UnoPE (cloned positional indices)	0.386	0.674	0.323
w/o offset (no positional offset)	0.470 / 0.386	0.722 / 0.674	0.308 / 0.323
w/ width-offset only	0.717 / 0.508	0.813 / 0.724	0.304 / 0.321
w/ height-offset only	0.678 / 0.501	0.797 / 0.719	0.308 / 0.306
w/ diagonal-offset (UnoPE)	0.730 / 0.542	0.821 / 0.733	0.309 / 0.322

Key Findings¶

UnoPE is the most critical component: Its removal causes DINO to plummet from 0.542 to 0.386 (−28.8%), demonstrating that positional encoding design is essential for multi-subject generation. Diagonal offset outperforms width-only or height-only offsets.
Progressive training is substantially effective: Training directly on multi-subject data without progressive alignment reduces DINO by 5.7%. Notably, Stage II training reciprocally improves single-subject performance (DINO 0.730→0.760), indicating positive transfer between multi-subject and single-subject capabilities.
Synthetic data quality is critical: Generated \(I_{ref}^2\) outperforms directly cropped counterparts (DINO +2.5%), and VLM-scored, high-quality filtered data consistently boosts model performance.
UNO's DINO score of 0.760 approaches the Oracle value of 0.774, indicating that subject consistency is near the upper bound.

Highlights & Insights¶

Novel "model-data co-evolution" paradigm: Rather than passively waiting for data, the model actively leverages its own generative capability to produce training data for a stronger successor. This self-improvement strategy is inspired by weak-to-strong generalization in LLMs and represents the first systematic realization of this concept in visual generation.
Exploiting the intrinsic in-context capability of DiT: The observation that DiT models inherently possess the ability to generate subject-consistent images—activated through carefully constructed text templates—eliminates the need for complex data collection pipelines.
Elegant and effective diagonal-offset design in UnoPE: Spatial isolation via positional encoding resolves multi-subject attribute confusion, and the underlying principle is transferable to other DiT applications that handle multi-condition inputs.

Limitations & Future Work¶

The current framework supports only \(N=2\) reference subjects; the effectiveness and efficiency of scaling to a larger number of subjects remain to be validated.
The synthetic data pipeline depends on the in-context generation capability of the base T2I model and is inapplicable to models that lack this property.
The CLIP-T score is not the highest (0.304 vs. OmniGen's 0.315), indicating room for improvement in text controllability.
LoRA rank 512 entails a large parameter count, and fine-tuning efficiency warrants further optimization.

vs. IP-Adapter/BLIP-Diffusion: These methods inject reference information via auxiliary image encoders, whereas UNO directly leverages the DiT's VAE and attention mechanism without any additional encoder.
vs. OminiControl/IC-LoRA: These methods also exploit DiT's intrinsic reference-image capability, but are limited to single-subject generation at low resolution (512×512). UNO extends to high-resolution multi-subject generation through a complete data pipeline and principled positional encoding design.
vs. DreamBooth: DreamBooth requires per-subject fine-tuning, whereas UNO enables zero-shot inference with superior DINO scores.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The model-data co-evolution paradigm and UnoPE are both original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive single- and multi-subject evaluation with detailed ablations, though evaluation beyond two subjects is absent.
Writing Quality: ⭐⭐⭐⭐ — Method presentation is clear, though the data pipeline section is somewhat verbose.
Value: ⭐⭐⭐⭐⭐ — Open-source codebase, strong practical utility, and a production-grade solution from ByteDance.