BootComp: Controllable Human Image Generation with Personalized Multi-Garments¶
Conference: CVPR 2025
arXiv: 2411.16801
Code: https://omnious.github.io/BootComp
Area: Controllable Image Generation / Virtual Try-On
Keywords: Multi-Garment Human Generation, Synthetic Data Pipeline, Decomposition Network, Diffusion Model Composition, Virtual Try-On
TL;DR¶
This paper proposes BootComp, which trains a decomposition network to extract product-view garment images from human images to construct a large-scale synthetic paired dataset. It then trains a dual-path diffusion model to generate controllable human images conditioned on multiple reference garments, achieving a 30% improvement in MP-LPIPS over the state-of-the-art (SOTA).
Background & Motivation¶
Background: Controllable human image generation based on T2I diffusion models is a key application in the fashion field, such as garment recommendation, model image generation, and virtual try-on. It requires generating human images wearing multiple reference garments, conditioned on the images of those reference garments.
Limitations of Prior Work: The core bottleneck lies in training data acquisition—collecting photos of the exact same person wearing all different combinations of garments is extremely difficult. (1) Segmenting garments from images leads to "copy-paste" issues (generated results are identical to the reference without changing poses); (2) extracting paired data from different frames of video is limited in scale and low in quality; (3) most training data only contains single garment-human pairs, failing to generalize to multi-garment compositions during inference.
Key Challenge: A large amount of high-quality "multi-garment to human" paired data is required, but collecting such paired data in practice is almost impossible.
Goal: To design a data generation pipeline to address the paired data dilemma and train a controllable generation model for human image generation with multi-garment compositions.
Key Insight: Train a "decomposition network" to map worn garments from human images to product-view images, thereby extracting reference garments from any human image to construct a large-scale synthetic paired dataset.
Core Idea: A two-stage framework: (1) decomposition network + quality filtering to generate synthetic paired data; (2) dual diffusion model composition module (frozen generator + trainable encoder) trained on the synthetic data.
Method¶
Overall Architecture¶
Stage 1: Train a decomposition network \(f_\phi\) to extract product views of single garments from human images. It is used to generate a multi-garment paired dataset from 240K human images. After quality filtering, 54K high-quality pairs are obtained. Stage 2: Two SDXL diffusion networks—an encoder processes multiple garment images to extract features, while a generator (frozen) uses these features to generate human images.
Key Designs¶
-
Decomposition Module:
- Function: Maps the worn garments of specific categories in a human image to product-view images.
- Mechanism: Formulated as an image-to-image translation problem. Initialized with a pre-trained T2I diffusion model, the input is the garment region \(x^s = S(y, m)\) segmented by a human parsing model. The key/value of the segmented garment are concatenated into the generation path via extended self-attention. The text prompt is set to "A product photo of {category}" to leverage the T2I prior.
- Design Motivation: By training the decomposition network on single garment-human pairs (which are easy to collect) and then extracting all garments from arbitrary human images, the scale of data can be vastly scaled up.
-
Synthetic Data Quality Filtering:
- Function: Removes low-quality garment images generated by the decomposition network.
- Mechanism: Calculates the perceptual similarity (using DreamSim) between the generated product view \(\tilde{x}\) and the segmented region \(x^s\). Pairs with similarity below the threshold \(\tau=0.4\) are discarded. Ultimately, 54K pairs are retained out of 240K.
- Design Motivation: The decomposition network may generate poor-quality garment images when human parsing results are inaccurate. Low-quality data would severely degrade the training of the composition module.
-
Composition Module:
- Function: Generates human images conditioned on multiple garments.
- Mechanism: Utilizes two SDXL networks: a trainable encoder \(g_\theta\) and a frozen generator \(g_{\theta^-}\). Each garment \(\tilde{x}_i\) is processed by the encoder to extract hidden states, which are used to condition the generator's self-attention layers through key/value concatenation: queries come from the human image path, while keys/values concatenate features of all garments.
- Design Motivation: Freezing the generator allows BootComp to seamlessly integrate with other adapter modules (e.g., ControlNet, IP-Adapter) for downstream tasks like pose control and stylization without additional fine-tuning.
Loss & Training¶
Standard diffusion model \(\epsilon\)-prediction loss. The decomposition network is trained for 140K iterations (using 4 H100 GPUs), and the composition module is trained for 115K iterations (using 8 H100 GPUs). Inference uses DDPM with 50 steps and a CFG scale of 2.0.
Key Experimental Results¶
Main Results¶
| Method | MP-LPIPS↓ | DINO↑ | M-DINO↑ | FID↓ |
|---|---|---|---|---|
| MIP-Adapter | 0.276 | 0.308 | 0.025 | 59.99 |
| Parts2Whole | 0.267 | 0.362 | 0.036 | 28.39 |
| BootComp (Ours) | 0.187 | 0.379 | 0.046 | 27.63 |
Ablation Study¶
| Configuration | MP-LPIPS↓ | FID↓ | Description |
|---|---|---|---|
| Training with segmented paired data | 0.374 | 59.27 | Severe copy-paste issue |
| Training with synthetic paired data | 0.197 | 29.41 | Significant improvement |
| + 54K filtered data | Best | Best | Filtering improves quality |
Data scalability experiment: 5K -> 15K -> 30K -> 50K, where FID consistently drops from 34.15 to 25.88, demonstrating scalability.
Key Findings¶
- BootComp improves by 30% in MP-LPIPS over Parts2Whole, demonstrating a massive advantage in garment detail preservation.
- Synthetic data vs. segmented data: FID drops from 59.27 to 29.41, proving that the quality of product views generated by the decomposition network is far superior to simple segmentation.
- The frozen generator design allows BootComp to obtain pose control, style transfer, and virtual try-on capabilities for free.
Highlights & Insights¶
- The data pipeline is the core contribution: It solves the fundamental bottleneck of obtaining multi-garment paired data. The decomposition network + quality filtering pipeline is reusable.
- Design philosophy of the frozen generator: Only training the encoder endows the system with strong compositionality—changing styles or control modes requires no retraining.
- Wide application range: Virtual try-on, pose control, cartoonization, and personalized generation are all achieved within a single framework.
Limitations & Future Work¶
- The decomposition network relies on the quality of the human parsing model, meaning parsing errors will propagate to subsequent stages.
- Resolution is limited to 512×384, and the effectiveness at higher resolutions has not been verified.
- The capability to handle complex garment combinations (e.g., layering, accessories) needs further improvement.
Related Work & Insights¶
- vs. Parts2Whole: The most direct competitor; the proposed method has a clear advantage in detail preservation.
- vs. MIP-Adapter: General multi-condition generation, not optimized for garment details.
- The concept of the data generation pipeline can be transferred to other controllable generation tasks that require paired data.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of using a decomposition network to construct synthetic paired data is highly ingenious.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison + ablations + scalability + multi-application demonstrations.
- Writing Quality: ⭐⭐⭐⭐ Clear framework and strong motivation.
- Value: ⭐⭐⭐⭐⭐ Directly applicable to the fashion AI industry; the data pipeline scheme is highly reusable.