Skip to content

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Conference: ICCV 2025 arXiv: 2504.02160 Code: https://github.com/bytedance/UNO Area: Diffusion Models / Image Generation Keywords: Subject-driven Generation, Multi-subject Generation, DiT, Model-Data Co-evolution, Positional Encoding

TL;DR

This paper proposes UNO, a universal DiT-based customized generation model. Through a "model-data co-evolution" paradigm—wherein synthetic data generated by weaker models progressively trains stronger models—combined with progressive cross-modal alignment and Universal RoPE, UNO achieves state-of-the-art performance on both single- and multi-subject-driven image generation (DreamBench DINO 0.760, CLIP-I 0.835).

Background & Motivation

Background: Subject-driven generation aims to synthesize new images based on reference subjects and text descriptions, with broad applications in content creation and industrial design. Existing methods fall into two categories: (a) few-data fine-tuning methods (DreamBooth, Textual Inversion, LoRA), which require per-subject fine-tuning and incur high inference costs; and (b) large-scale training methods (IP-Adapter, BLIP-Diffusion), which leverage auxiliary image encoders for zero-shot generation but depend on large amounts of paired data.

Limitations of Prior Work: - Data bottleneck: Acquiring paired subject images with diverse viewpoints and poses is extremely difficult, and scaling from single-subject to multi-subject datasets compounds this challenge further. - Limited synthetic data quality: Existing synthetic data typically suffers from low resolution (≤512×512) and narrow domain coverage. - Multi-subject dilemma: Most methods support only single-subject generation; when confronted with multi-subject scenarios, they tend to exhibit attribute confusion or copy-paste artifacts. - Trade-off between subject fidelity and text controllability: Improving subject consistency often sacrifices text-based editing capability.

Key Challenge: Existing methods adopt a "data-first, then model" paradigm, in which the data bottleneck directly caps model capability. The central question is how to break through this bottleneck to achieve scalable customized generation.

Goal: (a) How to establish a sustainably evolving synthetic data pipeline that scales from single-subject to multi-subject settings? (b) How to design a universal customization model that seamlessly generalizes from single-subject to multi-subject generation?

Key Insight: Inspired by synthetic-data-driven self-improvement in LLMs (weak-to-strong generalization), this work proposes a "model-data co-evolution" paradigm: a weak T2I model first generates single-subject paired data to train an S2I model, which in turn generates multi-subject paired data to train a stronger variant, enabling continuous co-evolution.

Core Idea: The intrinsic in-context generation capability of DiT is exploited to synthesize high-consistency paired data. Combined with progressive cross-modal alignment and Universal RoPE, the T2I model is iteratively upgraded into a multi-subject S2I model with minimal architectural modification.

Method

Overall Architecture

UNO is built upon FLUX.1 dev (MM-DiT architecture) and consists of two core systems: (1) a synthetic data pipeline—progressive data generation from single-subject to multi-subject with multi-stage filtering; and (2) a model training framework—iteratively training the T2I model into an S2I model via progressive cross-modal alignment (Stage I: single-subject → Stage II: multi-subject) and UnoPE positional encoding.

Key Designs

  1. Synthetic Data Curation Framework:

    • Function: Systematically generates high-resolution (1024×1024), high-consistency subject-paired data by leveraging the in-context generation capability of DiT models.
    • Mechanism:
      • Single-subject stage: A taxonomy of 365 broad categories and fine-grained subcategories is constructed. An LLM generates diverse subject and scene descriptions, and carefully designed text templates prompt the T2I model to produce subject-consistent image pairs. DINOv2 performs preliminary filtering (removing low-consistency pairs), followed by VLM scoring across appearance, details, and attributes: \(score = \text{Average}(\text{VLM}(I_{ref}, I_{tgt}, c_y))\)
      • Multi-subject stage: The Stage I S2I model combined with an open-vocabulary detector (OVD) detects and crops additional subjects from target images. The S2I model then generates a consistent second reference image \(I_{ref}^2\). Crucially, directly using cropped images as \(I_{ref}^2\) leads to copy-paste artifacts; regeneration via the S2I model is therefore mandatory.
    • Design Motivation: To resolve the acquisition bottleneck for multi-subject paired data. Experiments confirm that data filtered at higher quality scores consistently improves model performance (both DINO and CLIP-I scores increase with data quality).
  2. Progressive Cross-Modal Alignment:

    • Function: A two-stage curriculum that gradually trains the DiT to condition on reference images.
    • Mechanism:
      • Stage I: Single reference image input. The reference image is VAE-encoded and concatenated with text tokens and the noisy latent: \(z_1 = \text{Concatenate}(c, z_t, \mathcal{E}(I_{ref}^1))\). All tokens participate jointly in FLUX's MM-DiT attention.
      • Stage II: Extended to multiple reference images: \(z_2 = \text{Concatenate}(c, z_t, z_{ref}^1, z_{ref}^2, \ldots, z_{ref}^N)\), with \(N=2\).
    • Design Motivation: Directly training with multi-reference image inputs degrades performance (DINO drops from 0.542 to 0.511), because noise-free reference tokens disrupt the original convergence distribution. A simple-to-complex curriculum proves substantially more effective.
  3. Universal Rotary Position Embedding (UnoPE):

    • Function: Provides appropriate positional encoding for multiple reference images to prevent attribute confusion.
    • Mechanism: FLUX employs RoPE, assigning text tokens position (0,0) and noisy image tokens position (i,j) with \(i \in [0,w-1], j \in [0,h-1]\). Reference images are placed starting from a diagonal offset: \((i', j') = (i + w^{(N-1)}, j + h^{(N-1)})\), ensuring sufficient positional separation between different reference images.
    • Design Motivation: Directly replicating the target image's positional indices (without offset) prevents the model from distinguishing reference images from the target image, causing a dramatic DINO drop (from 0.730 to 0.470 for single-subject; from 0.542 to 0.386 for multi-subject). The diagonal offset in UnoPE breaks the spatial correlation between reference images, compelling the model to derive layout information from text rather than position.

Loss & Training

Standard flow-matching loss is applied. Training configuration: learning rate \(10^{-5}\), batch size 16, Stage I for 5,000 steps (230K single-subject samples), Stage II for 5,000 steps (15K multi-subject samples), LoRA rank 512, on 8×A100 GPUs.

Key Experimental Results

Main Results

Single-subject generation (DreamBench):

Method DINO ↑ CLIP-I ↑ CLIP-T ↑
DreamBooth (fine-tuning) 0.668 0.803 0.305
RealCustom++ 0.702 0.794 0.318
OmniGen 0.693 0.801 0.315
OminiControl 0.684 0.799 0.312
FLUX IP-Adapter 0.582 0.820 0.288
UNO (Ours) 0.760 0.835 0.304
Oracle (between reference images) 0.774 0.885 -

Multi-subject generation:

Method DINO ↑ CLIP-I ↑ CLIP-T ↑
MS-Diffusion 0.525 0.726 0.319
OmniGen 0.511 0.722 0.331
MIP-Adapter 0.482 0.726 0.311
UNO (Ours) 0.542 0.733 0.322

Ablation Study

Configuration DINO ↑ CLIP-I ↑ CLIP-T ↑
UNO (Full) 0.542 0.733 0.322
w/o generated \(I_{ref}^2\) (replaced by cropped image) 0.529 0.730 0.308
w/o cross-modal alignment (direct multi-subject training) 0.511 0.721 0.322
w/o UnoPE (cloned positional indices) 0.386 0.674 0.323
w/o offset (no positional offset) 0.470 / 0.386 0.722 / 0.674 0.308 / 0.323
w/ width-offset only 0.717 / 0.508 0.813 / 0.724 0.304 / 0.321
w/ height-offset only 0.678 / 0.501 0.797 / 0.719 0.308 / 0.306
w/ diagonal-offset (UnoPE) 0.730 / 0.542 0.821 / 0.733 0.309 / 0.322

Key Findings

  • UnoPE is the most critical component: Its removal causes DINO to plummet from 0.542 to 0.386 (−28.8%), demonstrating that positional encoding design is essential for multi-subject generation. Diagonal offset outperforms width-only or height-only offsets.
  • Progressive training is substantially effective: Training directly on multi-subject data without progressive alignment reduces DINO by 5.7%. Notably, Stage II training reciprocally improves single-subject performance (DINO 0.730→0.760), indicating positive transfer between multi-subject and single-subject capabilities.
  • Synthetic data quality is critical: Generated \(I_{ref}^2\) outperforms directly cropped counterparts (DINO +2.5%), and VLM-scored, high-quality filtered data consistently boosts model performance.
  • UNO's DINO score of 0.760 approaches the Oracle value of 0.774, indicating that subject consistency is near the upper bound.

Highlights & Insights

  • Novel "model-data co-evolution" paradigm: Rather than passively waiting for data, the model actively leverages its own generative capability to produce training data for a stronger successor. This self-improvement strategy is inspired by weak-to-strong generalization in LLMs and represents the first systematic realization of this concept in visual generation.
  • Exploiting the intrinsic in-context capability of DiT: The observation that DiT models inherently possess the ability to generate subject-consistent images—activated through carefully constructed text templates—eliminates the need for complex data collection pipelines.
  • Elegant and effective diagonal-offset design in UnoPE: Spatial isolation via positional encoding resolves multi-subject attribute confusion, and the underlying principle is transferable to other DiT applications that handle multi-condition inputs.

Limitations & Future Work

  • The current framework supports only \(N=2\) reference subjects; the effectiveness and efficiency of scaling to a larger number of subjects remain to be validated.
  • The synthetic data pipeline depends on the in-context generation capability of the base T2I model and is inapplicable to models that lack this property.
  • The CLIP-T score is not the highest (0.304 vs. OmniGen's 0.315), indicating room for improvement in text controllability.
  • LoRA rank 512 entails a large parameter count, and fine-tuning efficiency warrants further optimization.
  • vs. IP-Adapter/BLIP-Diffusion: These methods inject reference information via auxiliary image encoders, whereas UNO directly leverages the DiT's VAE and attention mechanism without any additional encoder.
  • vs. OminiControl/IC-LoRA: These methods also exploit DiT's intrinsic reference-image capability, but are limited to single-subject generation at low resolution (512×512). UNO extends to high-resolution multi-subject generation through a complete data pipeline and principled positional encoding design.
  • vs. DreamBooth: DreamBooth requires per-subject fine-tuning, whereas UNO enables zero-shot inference with superior DINO scores.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The model-data co-evolution paradigm and UnoPE are both original contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive single- and multi-subject evaluation with detailed ablations, though evaluation beyond two subjects is absent.
  • Writing Quality: ⭐⭐⭐⭐ — Method presentation is clear, though the data pipeline section is somewhat verbose.
  • Value: ⭐⭐⭐⭐⭐ — Open-source codebase, strong practical utility, and a production-grade solution from ByteDance.