Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories¶
Conference: CVPR 2026 arXiv: 2603.14153 Code: GitHub Area: Virtual Try-On / Dataset Keywords: virtual try-on, multi-reference images, outfit-level, dataset construction, image generation
TL;DR¶
This paper introduces Garments2Look, the first large-scale multimodal outfit-level virtual try-on dataset (80K pairs, 40 categories, 300+ subcategories). Each sample contains 3–12 reference garment images, a model outfit image, and detailed textual annotations. The dataset exposes significant shortcomings of existing methods in multi-layer outfit composition and accessory consistency.
Background & Motivation¶
Virtual try-on (VTON) has achieved notable progress in single-garment visualization, yet real-world fashion scenarios demand much more — users expect full outfit previews involving multiple garments, accessories, fine-grained categories, layered wearing styles, and diverse styling techniques.
Structural deficiencies of existing datasets: - VITON-HD and DressCode support only single-garment try-on with limited categories (1–3) - M&M VTO and BootComp support multi-reference inputs but lack category diversity - No existing dataset simultaneously provides layering order, styling technique annotations, and multi-accessory support
New challenges posed by outfit-level VTON: - Complex layering and occlusion relationships among garments (e.g., a knit cardigan can be worn as an outer layer or tucked inside) - Diverse styling techniques (regular wear, draped over shoulders, tied at the waist, rolled-up sleeves, etc.) - Reference counts ranging from 3 to 12, imposing high demands on multi-reference consistency
Method¶
Overall Architecture¶
The dataset construction follows a four-stage pipeline: Data Collection → Data Synthesis → Data Filtering → Data Evaluation. The core mechanism combines real data (Gold Standard) with synthetic data, ensuring quality through rigorous filtering and expert review.
Key Designs¶
- Data Sources and Classification Strategy
Samples are divided into four categories based on data completeness: - Gold Standard (50.2%): complete garment image + model outfit image pairs - Outfit plan available but no look image (24.0%): look image must be synthesized - Garment images only, no outfit plan (25.8%): both outfit plan and look image must be synthesized
Sources include: outfit compatibility learning datasets (PolyVore), open-source fashion datasets, publicly available web images (strictly compliant), and synthetic data.
- Outfit Synthesis Pipeline
A RAG-inspired heuristic outfit construction process: - Step 1: Construct a knowledge base of 65 fashion styles (35 female / 30 male), each generated by an LLM and reviewed by fashion experts - Step 2: Randomly select a style → LLM generates a user profile and wearing scenario (including occasion, color palette, theme, and garment categories) - Step 3: LLM generates a 3–9 item outfit list under style-knowledge constraints, ordered "top-to-bottom, inner-to-outer, garments-to-accessories" - Step 4: Retrieve the top-128 candidates per item → inverse-frequency weighted sampling to prevent overrepresentation of popular items
- Look Image Synthesis
All items in the outfit list are arranged into an OOTD grid image, which serves as the unified input to Nano Banana (Gemini-2.5-Flash-Image). Compared to multiple separate inputs, the grid image better preserves inter-item consistency. Prompt engineering is used to inject layering order and styling techniques (e.g., "tuck the top into the trousers," "roll up the sleeves") across 5 defined technique categories.
-
Three-Level Data Filtering
-
Item level: A standardized taxonomy of 40 major categories and 300+ fine-grained subcategories
- Outfit level: Rule-based rationality validation grounded in fashion domain knowledge (e.g., no outfit should contain two dresses simultaneously)
- Pair level: Automatic screening by Gemini-2.5-Flash + DWPose-based pose classification + manual review by 10 fashion students and 3 domain experts
- Only approximately 40% of synthesized look images pass the final review
Loss & Training¶
This paper is a dataset contribution and does not involve model training. The evaluation protocol includes: - Standard VTON metrics: FID, KID, SSIM, LPIPS - VLM-based evaluation metrics (Gemini-3-Flash): garment consistency, layering accuracy, styling technique accuracy
Key Experimental Results¶
Main Results¶
Method comparison on the Garments2Look test set:
| Method Type | Model | FID↓ | SSIM↑ | Garment↑ | Layering↑ | Styling↑ |
|---|---|---|---|---|---|---|
| VTON | FastFit | 3.59 | 0.855 | 0.624 | 0.131 | 0.340 |
| VTON | OmniTry | 6.56 | 0.724 | 0.461 | 0.167 | 0.261 |
| Editing | GPT-4o (2 Ref) | 2.15 | 0.758 | 0.892 | 0.849 | 0.694 |
| Editing | NB (2 Ref) | 1.04 | 0.858 | 0.925 | 0.885 | 0.739 |
| Editing | NBP (N Ref) | 1.32 | 0.817 | 0.984 | 0.936 | 0.736 |
Ablation Study¶
| Configuration | Key Metric | Notes |
|---|---|---|
| N Ref (individual items) vs. 2 Ref (OOTD grid) | 2 Ref generally superior | Grid image preserves richer outfit context |
| ≤4 references vs. >4 references | Consistency degrades for all methods when >4 | Particularly severe for VTON models |
| VTON models vs. general editing models | Editing models consistently outperform VTON | VTON lacks flexible multi-garment handling |
| Synthetic vs. real data quality | Expert scores 4.35–4.74/5 | Synthetic data quality is manageable after rigorous filtering |
Key Findings¶
- VTON models fail comprehensively on outfit-level tasks: layering accuracy is only 13–17%, and styling technique accuracy is only 26–34%
- General-purpose editing models (GPT-4o, Nano Banana) substantially outperform dedicated VTON models on outfit-level VTON
- As the number of reference items increases, consistency degrades significantly across all methods — shape distortion, texture alteration, color deviation, and item merging are the primary failure modes
- OOTD grid input (2 Ref strategy) generally outperforms multiple separate inputs (N Ref), as the holistic reference implicitly encodes outfit relationships
- Even state-of-the-art editing models cannot precisely control non-standard styling techniques (e.g., half-buttoned outerwear, untucked mid-layers)
Highlights & Insights¶
- First truly outfit-level VTON dataset: 40 major categories, 300+ subcategories, with layering and styling technique annotations — filling a critical gap in the field
- The data synthesis pipeline's fashion knowledge base + RAG-style retrieval + inverse-frequency sampling is elegantly designed to ensure diversity while mitigating popularity bias
- Experiments are thorough and purposeful: four progressively structured research questions (reference count limits, consistency, overall quality, value of structured annotations) systematically expose performance bottlenecks
- In-depth analysis of commercial editing models (Nano Banana vs. GPT-4o vs. Seedream) provides valuable industrial perspective
Limitations & Future Work¶
- Synthesized look images rely on Nano Banana, whose pose control and inpainting capabilities are limited, introducing unavoidable synthesis artifacts
- Only approximately 40% of synthesized images pass review, resulting in relatively low data construction efficiency
- Layering and styling annotations are generated automatically by VLMs, limiting annotation precision
- A video try-on dimension is absent (dynamic outfit effects better reflect real-world needs)
- Evaluation still relies on VLM-based scoring; no outfit-level automated metrics have been established
Related Work & Insights¶
- VITON-HD / DressCode established the foundation for high-resolution VTON datasets but are limited to single garments
- BootComp first proposed a try-off synthesis pipeline and data filtering strategy; this work substantially extends that approach
- The inverse-frequency sampling mechanism is generalizable to other retrieval-augmented generation scenarios requiring bias mitigation
- The finding that general editing models outperform dedicated VTON models suggests the field may need to shift from "specialized pipelines" toward a new paradigm of "general editing with domain constraints"
Rating¶
- Novelty: ⭐⭐⭐⭐ First large-scale outfit-level VTON dataset; both the task definition and annotation schema are novel
- Experimental Thoroughness: ⭐⭐⭐⭐ 7 model baselines (VTON + general editing), 4 progressively structured analysis questions, quantitative + qualitative + human evaluation
- Writing Quality: ⭐⭐⭐⭐ Data construction process is described in thorough detail; problem-driven experimental analysis is logically structured
- Value: ⭐⭐⭐⭐⭐ Dataset and code are open-sourced, filling an important gap and providing sustained impetus for the VTON research direction