Garments2Look: A Multi-Reference Dataset for High-Fidelity Outfit-Level Virtual Try-On with Clothing and Accessories¶

Conference: CVPR 2026 arXiv: 2603.14153 Code: GitHub Area: Virtual Try-On / Dataset Keywords: virtual try-on, multi-reference images, outfit-level, dataset construction, image generation

TL;DR¶

This paper introduces Garments2Look, the first large-scale multimodal outfit-level virtual try-on dataset (80K pairs, 40 categories, 300+ subcategories). Each sample contains 3–12 reference garment images, a model outfit image, and detailed textual annotations. The dataset exposes significant shortcomings of existing methods in multi-layer outfit composition and accessory consistency.

Background & Motivation¶

Virtual try-on (VTON) has achieved notable progress in single-garment visualization, yet real-world fashion scenarios demand much more — users expect full outfit previews involving multiple garments, accessories, fine-grained categories, layered wearing styles, and diverse styling techniques.

Structural deficiencies of existing datasets: - VITON-HD and DressCode support only single-garment try-on with limited categories (1–3) - M&M VTO and BootComp support multi-reference inputs but lack category diversity - No existing dataset simultaneously provides layering order, styling technique annotations, and multi-accessory support

New challenges posed by outfit-level VTON: - Complex layering and occlusion relationships among garments (e.g., a knit cardigan can be worn as an outer layer or tucked inside) - Diverse styling techniques (regular wear, draped over shoulders, tied at the waist, rolled-up sleeves, etc.) - Reference counts ranging from 3 to 12, imposing high demands on multi-reference consistency

Method¶

Overall Architecture¶

The dataset construction follows a four-stage pipeline: Data Collection → Data Synthesis → Data Filtering → Data Evaluation. The core mechanism combines real data (Gold Standard) with synthetic data, ensuring quality through rigorous filtering and expert review.

Key Designs¶

Data Sources and Classification Strategy

Samples are divided into four categories based on data completeness: - Gold Standard (50.2%): complete garment image + model outfit image pairs - Outfit plan available but no look image (24.0%): look image must be synthesized - Garment images only, no outfit plan (25.8%): both outfit plan and look image must be synthesized

Sources include: outfit compatibility learning datasets (PolyVore), open-source fashion datasets, publicly available web images (strictly compliant), and synthetic data.

Outfit Synthesis Pipeline

A RAG-inspired heuristic outfit construction process: - Step 1: Construct a knowledge base of 65 fashion styles (35 female / 30 male), each generated by an LLM and reviewed by fashion experts - Step 2: Randomly select a style → LLM generates a user profile and wearing scenario (including occasion, color palette, theme, and garment categories) - Step 3: LLM generates a 3–9 item outfit list under style-knowledge constraints, ordered "top-to-bottom, inner-to-outer, garments-to-accessories" - Step 4: Retrieve the top-128 candidates per item → inverse-frequency weighted sampling to prevent overrepresentation of popular items

Look Image Synthesis

All items in the outfit list are arranged into an OOTD grid image, which serves as the unified input to Nano Banana (Gemini-2.5-Flash-Image). Compared to multiple separate inputs, the grid image better preserves inter-item consistency. Prompt engineering is used to inject layering order and styling techniques (e.g., "tuck the top into the trousers," "roll up the sleeves") across 5 defined technique categories.

Three-Level Data Filtering
Item level: A standardized taxonomy of 40 major categories and 300+ fine-grained subcategories
Outfit level: Rule-based rationality validation grounded in fashion domain knowledge (e.g., no outfit should contain two dresses simultaneously)
Pair level: Automatic screening by Gemini-2.5-Flash + DWPose-based pose classification + manual review by 10 fashion students and 3 domain experts
Only approximately 40% of synthesized look images pass the final review

Loss & Training¶

This paper is a dataset contribution and does not involve model training. The evaluation protocol includes: - Standard VTON metrics: FID, KID, SSIM, LPIPS - VLM-based evaluation metrics (Gemini-3-Flash): garment consistency, layering accuracy, styling technique accuracy

Key Experimental Results¶

Main Results¶

Method comparison on the Garments2Look test set:

Method Type	Model	FID↓	SSIM↑	Garment↑	Layering↑	Styling↑
VTON	FastFit	3.59	0.855	0.624	0.131	0.340
VTON	OmniTry	6.56	0.724	0.461	0.167	0.261
Editing	GPT-4o (2 Ref)	2.15	0.758	0.892	0.849	0.694
Editing	NB (2 Ref)	1.04	0.858	0.925	0.885	0.739
Editing	NBP (N Ref)	1.32	0.817	0.984	0.936	0.736

Ablation Study¶

Configuration	Key Metric	Notes
N Ref (individual items) vs. 2 Ref (OOTD grid)	2 Ref generally superior	Grid image preserves richer outfit context
≤4 references vs. >4 references	Consistency degrades for all methods when >4	Particularly severe for VTON models
VTON models vs. general editing models	Editing models consistently outperform VTON	VTON lacks flexible multi-garment handling
Synthetic vs. real data quality	Expert scores 4.35–4.74/5	Synthetic data quality is manageable after rigorous filtering

Key Findings¶

VTON models fail comprehensively on outfit-level tasks: layering accuracy is only 13–17%, and styling technique accuracy is only 26–34%
General-purpose editing models (GPT-4o, Nano Banana) substantially outperform dedicated VTON models on outfit-level VTON
As the number of reference items increases, consistency degrades significantly across all methods — shape distortion, texture alteration, color deviation, and item merging are the primary failure modes
OOTD grid input (2 Ref strategy) generally outperforms multiple separate inputs (N Ref), as the holistic reference implicitly encodes outfit relationships
Even state-of-the-art editing models cannot precisely control non-standard styling techniques (e.g., half-buttoned outerwear, untucked mid-layers)

Highlights & Insights¶

First truly outfit-level VTON dataset: 40 major categories, 300+ subcategories, with layering and styling technique annotations — filling a critical gap in the field
The data synthesis pipeline's fashion knowledge base + RAG-style retrieval + inverse-frequency sampling is elegantly designed to ensure diversity while mitigating popularity bias
Experiments are thorough and purposeful: four progressively structured research questions (reference count limits, consistency, overall quality, value of structured annotations) systematically expose performance bottlenecks
In-depth analysis of commercial editing models (Nano Banana vs. GPT-4o vs. Seedream) provides valuable industrial perspective

Limitations & Future Work¶

Synthesized look images rely on Nano Banana, whose pose control and inpainting capabilities are limited, introducing unavoidable synthesis artifacts
Only approximately 40% of synthesized images pass review, resulting in relatively low data construction efficiency
Layering and styling annotations are generated automatically by VLMs, limiting annotation precision
A video try-on dimension is absent (dynamic outfit effects better reflect real-world needs)
Evaluation still relies on VLM-based scoring; no outfit-level automated metrics have been established

VITON-HD / DressCode established the foundation for high-resolution VTON datasets but are limited to single garments
BootComp first proposed a try-off synthesis pipeline and data filtering strategy; this work substantially extends that approach
The inverse-frequency sampling mechanism is generalizable to other retrieval-augmented generation scenarios requiring bias mitigation
The finding that general editing models outperform dedicated VTON models suggests the field may need to shift from "specialized pipelines" toward a new paradigm of "general editing with domain constraints"

Rating¶

Novelty: ⭐⭐⭐⭐ First large-scale outfit-level VTON dataset; both the task definition and annotation schema are novel
Experimental Thoroughness: ⭐⭐⭐⭐ 7 model baselines (VTON + general editing), 4 progressively structured analysis questions, quantitative + qualitative + human evaluation
Writing Quality: ⭐⭐⭐⭐ Data construction process is described in thorough detail; problem-driven experimental analysis is logically structured
Value: ⭐⭐⭐⭐⭐ Dataset and code are open-sourced, filling an important gap and providing sustained impetus for the VTON research direction