Less-to-More Generalization: Unlocking More Controllability by In-Context Generation¶
Conference: ICCV 2025 arXiv: 2504.02160 Code: https://github.com/bytedance/UNO Area: Diffusion Models / Image Generation Keywords: Subject-driven Generation, Multi-subject Generation, DiT, Model-Data Co-evolution, Positional Encoding
TL;DR¶
This paper proposes UNO, a universal DiT-based customized generation model. Through a "model-data co-evolution" paradigm—wherein synthetic data generated by weaker models progressively trains stronger models—combined with progressive cross-modal alignment and Universal RoPE, UNO achieves state-of-the-art performance on both single- and multi-subject-driven image generation (DreamBench DINO 0.760, CLIP-I 0.835).
Background & Motivation¶
Background: Subject-driven generation aims to synthesize new images based on reference subjects and text descriptions, with broad applications in content creation and industrial design. Existing methods fall into two categories: (a) few-data fine-tuning methods (DreamBooth, Textual Inversion, LoRA), which require per-subject fine-tuning and incur high inference costs; and (b) large-scale training methods (IP-Adapter, BLIP-Diffusion), which leverage auxiliary image encoders for zero-shot generation but depend on large amounts of paired data.
Limitations of Prior Work: - Data bottleneck: Acquiring paired subject images with diverse viewpoints and poses is extremely difficult, and scaling from single-subject to multi-subject datasets compounds this challenge further. - Limited synthetic data quality: Existing synthetic data typically suffers from low resolution (≤512×512) and narrow domain coverage. - Multi-subject dilemma: Most methods support only single-subject generation; when confronted with multi-subject scenarios, they tend to exhibit attribute confusion or copy-paste artifacts. - Trade-off between subject fidelity and text controllability: Improving subject consistency often sacrifices text-based editing capability.
Key Challenge: Existing methods adopt a "data-first, then model" paradigm, in which the data bottleneck directly caps model capability. The central question is how to break through this bottleneck to achieve scalable customized generation.
Goal: (a) How to establish a sustainably evolving synthetic data pipeline that scales from single-subject to multi-subject settings? (b) How to design a universal customization model that seamlessly generalizes from single-subject to multi-subject generation?
Key Insight: Inspired by synthetic-data-driven self-improvement in LLMs (weak-to-strong generalization), this work proposes a "model-data co-evolution" paradigm: a weak T2I model first generates single-subject paired data to train an S2I model, which in turn generates multi-subject paired data to train a stronger variant, enabling continuous co-evolution.
Core Idea: The intrinsic in-context generation capability of DiT is exploited to synthesize high-consistency paired data. Combined with progressive cross-modal alignment and Universal RoPE, the T2I model is iteratively upgraded into a multi-subject S2I model with minimal architectural modification.
Method¶
Overall Architecture¶
UNO is built upon FLUX.1 dev (MM-DiT architecture) and consists of two core systems: (1) a synthetic data pipeline—progressive data generation from single-subject to multi-subject with multi-stage filtering; and (2) a model training framework—iteratively training the T2I model into an S2I model via progressive cross-modal alignment (Stage I: single-subject → Stage II: multi-subject) and UnoPE positional encoding.
Key Designs¶
-
Synthetic Data Curation Framework:
- Function: Systematically generates high-resolution (1024×1024), high-consistency subject-paired data by leveraging the in-context generation capability of DiT models.
- Mechanism:
- Single-subject stage: A taxonomy of 365 broad categories and fine-grained subcategories is constructed. An LLM generates diverse subject and scene descriptions, and carefully designed text templates prompt the T2I model to produce subject-consistent image pairs. DINOv2 performs preliminary filtering (removing low-consistency pairs), followed by VLM scoring across appearance, details, and attributes: \(score = \text{Average}(\text{VLM}(I_{ref}, I_{tgt}, c_y))\)
- Multi-subject stage: The Stage I S2I model combined with an open-vocabulary detector (OVD) detects and crops additional subjects from target images. The S2I model then generates a consistent second reference image \(I_{ref}^2\). Crucially, directly using cropped images as \(I_{ref}^2\) leads to copy-paste artifacts; regeneration via the S2I model is therefore mandatory.
- Design Motivation: To resolve the acquisition bottleneck for multi-subject paired data. Experiments confirm that data filtered at higher quality scores consistently improves model performance (both DINO and CLIP-I scores increase with data quality).
-
Progressive Cross-Modal Alignment:
- Function: A two-stage curriculum that gradually trains the DiT to condition on reference images.
- Mechanism:
- Stage I: Single reference image input. The reference image is VAE-encoded and concatenated with text tokens and the noisy latent: \(z_1 = \text{Concatenate}(c, z_t, \mathcal{E}(I_{ref}^1))\). All tokens participate jointly in FLUX's MM-DiT attention.
- Stage II: Extended to multiple reference images: \(z_2 = \text{Concatenate}(c, z_t, z_{ref}^1, z_{ref}^2, \ldots, z_{ref}^N)\), with \(N=2\).
- Design Motivation: Directly training with multi-reference image inputs degrades performance (DINO drops from 0.542 to 0.511), because noise-free reference tokens disrupt the original convergence distribution. A simple-to-complex curriculum proves substantially more effective.
-
Universal Rotary Position Embedding (UnoPE):
- Function: Provides appropriate positional encoding for multiple reference images to prevent attribute confusion.
- Mechanism: FLUX employs RoPE, assigning text tokens position (0,0) and noisy image tokens position (i,j) with \(i \in [0,w-1], j \in [0,h-1]\). Reference images are placed starting from a diagonal offset: \((i', j') = (i + w^{(N-1)}, j + h^{(N-1)})\), ensuring sufficient positional separation between different reference images.
- Design Motivation: Directly replicating the target image's positional indices (without offset) prevents the model from distinguishing reference images from the target image, causing a dramatic DINO drop (from 0.730 to 0.470 for single-subject; from 0.542 to 0.386 for multi-subject). The diagonal offset in UnoPE breaks the spatial correlation between reference images, compelling the model to derive layout information from text rather than position.
Loss & Training¶
Standard flow-matching loss is applied. Training configuration: learning rate \(10^{-5}\), batch size 16, Stage I for 5,000 steps (230K single-subject samples), Stage II for 5,000 steps (15K multi-subject samples), LoRA rank 512, on 8×A100 GPUs.
Key Experimental Results¶
Main Results¶
Single-subject generation (DreamBench):
| Method | DINO ↑ | CLIP-I ↑ | CLIP-T ↑ |
|---|---|---|---|
| DreamBooth (fine-tuning) | 0.668 | 0.803 | 0.305 |
| RealCustom++ | 0.702 | 0.794 | 0.318 |
| OmniGen | 0.693 | 0.801 | 0.315 |
| OminiControl | 0.684 | 0.799 | 0.312 |
| FLUX IP-Adapter | 0.582 | 0.820 | 0.288 |
| UNO (Ours) | 0.760 | 0.835 | 0.304 |
| Oracle (between reference images) | 0.774 | 0.885 | - |
Multi-subject generation:
| Method | DINO ↑ | CLIP-I ↑ | CLIP-T ↑ |
|---|---|---|---|
| MS-Diffusion | 0.525 | 0.726 | 0.319 |
| OmniGen | 0.511 | 0.722 | 0.331 |
| MIP-Adapter | 0.482 | 0.726 | 0.311 |
| UNO (Ours) | 0.542 | 0.733 | 0.322 |
Ablation Study¶
| Configuration | DINO ↑ | CLIP-I ↑ | CLIP-T ↑ |
|---|---|---|---|
| UNO (Full) | 0.542 | 0.733 | 0.322 |
| w/o generated \(I_{ref}^2\) (replaced by cropped image) | 0.529 | 0.730 | 0.308 |
| w/o cross-modal alignment (direct multi-subject training) | 0.511 | 0.721 | 0.322 |
| w/o UnoPE (cloned positional indices) | 0.386 | 0.674 | 0.323 |
| w/o offset (no positional offset) | 0.470 / 0.386 | 0.722 / 0.674 | 0.308 / 0.323 |
| w/ width-offset only | 0.717 / 0.508 | 0.813 / 0.724 | 0.304 / 0.321 |
| w/ height-offset only | 0.678 / 0.501 | 0.797 / 0.719 | 0.308 / 0.306 |
| w/ diagonal-offset (UnoPE) | 0.730 / 0.542 | 0.821 / 0.733 | 0.309 / 0.322 |
Key Findings¶
- UnoPE is the most critical component: Its removal causes DINO to plummet from 0.542 to 0.386 (−28.8%), demonstrating that positional encoding design is essential for multi-subject generation. Diagonal offset outperforms width-only or height-only offsets.
- Progressive training is substantially effective: Training directly on multi-subject data without progressive alignment reduces DINO by 5.7%. Notably, Stage II training reciprocally improves single-subject performance (DINO 0.730→0.760), indicating positive transfer between multi-subject and single-subject capabilities.
- Synthetic data quality is critical: Generated \(I_{ref}^2\) outperforms directly cropped counterparts (DINO +2.5%), and VLM-scored, high-quality filtered data consistently boosts model performance.
- UNO's DINO score of 0.760 approaches the Oracle value of 0.774, indicating that subject consistency is near the upper bound.
Highlights & Insights¶
- Novel "model-data co-evolution" paradigm: Rather than passively waiting for data, the model actively leverages its own generative capability to produce training data for a stronger successor. This self-improvement strategy is inspired by weak-to-strong generalization in LLMs and represents the first systematic realization of this concept in visual generation.
- Exploiting the intrinsic in-context capability of DiT: The observation that DiT models inherently possess the ability to generate subject-consistent images—activated through carefully constructed text templates—eliminates the need for complex data collection pipelines.
- Elegant and effective diagonal-offset design in UnoPE: Spatial isolation via positional encoding resolves multi-subject attribute confusion, and the underlying principle is transferable to other DiT applications that handle multi-condition inputs.
Limitations & Future Work¶
- The current framework supports only \(N=2\) reference subjects; the effectiveness and efficiency of scaling to a larger number of subjects remain to be validated.
- The synthetic data pipeline depends on the in-context generation capability of the base T2I model and is inapplicable to models that lack this property.
- The CLIP-T score is not the highest (0.304 vs. OmniGen's 0.315), indicating room for improvement in text controllability.
- LoRA rank 512 entails a large parameter count, and fine-tuning efficiency warrants further optimization.
Related Work & Insights¶
- vs. IP-Adapter/BLIP-Diffusion: These methods inject reference information via auxiliary image encoders, whereas UNO directly leverages the DiT's VAE and attention mechanism without any additional encoder.
- vs. OminiControl/IC-LoRA: These methods also exploit DiT's intrinsic reference-image capability, but are limited to single-subject generation at low resolution (512×512). UNO extends to high-resolution multi-subject generation through a complete data pipeline and principled positional encoding design.
- vs. DreamBooth: DreamBooth requires per-subject fine-tuning, whereas UNO enables zero-shot inference with superior DINO scores.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The model-data co-evolution paradigm and UnoPE are both original contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive single- and multi-subject evaluation with detailed ablations, though evaluation beyond two subjects is absent.
- Writing Quality: ⭐⭐⭐⭐ — Method presentation is clear, though the data pipeline section is somewhat verbose.
- Value: ⭐⭐⭐⭐⭐ — Open-source codebase, strong practical utility, and a production-grade solution from ByteDance.