Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback¶
Conference: CVPR 2026 arXiv: 2603.13057 Code: GitHub Area: Human Understanding Keywords: Virtual try-on, reference-free quality assessment, human feedback alignment, interleaved cross-attention, VTON-QBench
TL;DR¶
This work constructs VTON-QBench (62,688 try-on images, 13,838 qualified annotators, 431,800 annotations) and proposes VTON-IQA, a reference-free image quality assessment framework that jointly models garment fidelity and person preservation via an asymmetric Interleaved Cross-Attention (ICA) module, achieving image-level quality prediction highly aligned with human perception.
Background & Motivation¶
Background: Virtual try-on (VTON) is increasingly important in fashion e-commerce, with the synthesis of wearing results from a person image and a garment image as the core task. Try-on quality has continuously improved from GANs to U-Net diffusion models and DiT architectures.
Limitations of Prior Work:
- Ground-truth images are unavailable in real-world scenarios (it is infeasible to photograph the same person wearing the target garment), rendering reference-based metrics such as SSIM/LPIPS inapplicable.
- FID/KID measure only dataset-level distributional similarity and cannot reflect single-image quality.
- Existing VTON-specific evaluation methods (VTONQA, VTBench, VTON-VLLM) either operate on small datasets (748 pairs), lack publicly available implementations, or are not validated at scale with human annotations.
Key Challenge: Single-image-level, reference-free quality assessment aligned with human perception is required, yet no existing tool satisfies all three criteria simultaneously.
Goal: Establish a large-scale human-annotated benchmark and train a reference-free quality prediction model.
Key Insight: Try-on quality fundamentally involves verifying two aspects—(1) whether the garment is faithfully transferred, and (2) whether non-target regions are preserved—naturally necessitating cross-image interaction modeling.
Core Idea: Construct a large-scale human-annotated dataset and explicitly model the consistency between the try-on result and the garment/person images via asymmetric cross-attention.
Method¶
Overall Architecture¶
A three-branch Transformer architecture (based on DINOv3 ViT-L/16) processes the garment \(I_G\), person \(I_P\), and try-on result \(I_V\) independently. The first \(L/2\) layers extract features independently; the latter \(L/2\) layers incorporate ICA modules for cross-image interaction. Three [CLS] tokens are extracted, fused via learnable weighted cosine similarity, and mapped through \(\tanh\) to \([-1, 1]\) as the quality score.
Key Designs¶
-
VTON-QBench Dataset Construction
- 62,688 try-on images generated by 14 representative VTON models (covering GAN, U-Net diffusion, DiT, and commercial API approaches).
- 13,153 garment–person pairs (6,981 original pairs + 6,172 synthetic pairs augmented via FLUX.1-dev LoRA, covering casual, streetwear, formal, minimalist, and vintage styles).
- 13,838 qualified annotators providing 431,800 three-level quality annotations (natural / slightly unnatural / unnatural).
- Two-stage annotation cleaning: (1) five unambiguous test questions combined with behavioral filtering (annotators with >80% identical responses or >60% disagreement with majority voting are excluded), raising Krippendorff's α from 0.286 to 0.550; (2) removal of questionnaires with α ≤ 0.4.
- Pseudo triplets constructed using Nano Banana Pro as a strong model to generate reference images, enabling comparison with reference-based metrics.
-
Interleaved Cross-Attention (ICA) Module
- A cross-attention layer is inserted between the standard Transformer's self-attention and MLP layers.
- Asymmetric interaction design: the try-on image \(V\) interacts bidirectionally with both garment \(G\) and person \(P\): \(\hat{X}_V^{(\ell)} = \tilde{X}_V^{(\ell)} + C_{V \leftarrow G}^{(\ell)} + C_{V \leftarrow P}^{(\ell)}\)
- However, \(G\) and \(P\) do not interact directly; they are connected only through \(V\): \(\hat{X}_G^{(\ell)} = \tilde{X}_G^{(\ell)} + C_{G \leftarrow V}^{(\ell)}\)
- This design reflects the try-on-image-centric nature of quality assessment—verifying whether \(V\) preserves the garment attributes of \(G\) and the non-target elements of \(P\).
-
Scoring Mechanism
- Three [CLS] tokens \(c_G, c_P, c_V\) are fused: \(\tilde{s} = \alpha \frac{c_G^\top c_V}{\|c_G\|\|c_V\|} + (1-\alpha) \frac{c_P^\top c_V}{\|c_P\|\|c_V\|}\)
- \(\alpha\) is a learnable scalar that adaptively balances garment consistency and person preservation.
- The final score is \(\hat{s} = \tanh(a\tilde{s} + b)\), constrained to \([-1, 1]\).
Loss & Training¶
Joint optimization of Bradley–Terry preference learning and score regression:
AdamW optimizer, lr=1e-4, batch size 16, early stopping (training halted if validation loss shows no improvement for 3 epochs), single A100 40GB GPU, bfloat16 mixed precision.
Key Experimental Results¶
Main Results¶
| Method | ρ_SRCC | ρ_PLCC | R² | A_macro | A_micro | Reference-Free |
|---|---|---|---|---|---|---|
| SSIM | — | 0.135 | — | 0.596 | 0.593 | ✗ |
| LPIPS | — | 0.387 | — | 0.701 | 0.695 | ✗ |
| DINOv3 (zero-shot) | — | 0.261 | — | 0.637 | 0.641 | ✓ |
| VTON-IQA w/o ICA | 0.617 | 0.615 | 0.372 | 0.722 | 0.747 | ✓ |
| VTON-IQA | 0.750 | 0.751 | 0.553 | 0.781 | 0.790 | ✓ |
| Human | 0.760 | 0.762 | 0.536 | 0.782 | 0.791 | — |
Rankings of 14 VTON Models (Dress Code Dataset, VTON-IQA Score)¶
| Rank | Model | Score |
|---|---|---|
| 1 | Nano Banana Pro | 0.305 |
| 2 | GPT-Image-1.5 | 0.237 |
| 3 | FitDit | 0.219 |
| 4 | IDM-VTON | 0.141 |
| ... | ... | ... |
| 13 | HR-VITON (GAN) | -0.835 |
| 14 | VITON-HD (GAN) | -0.933 |
Ablation Study¶
| Configuration | ρ_SRCC | A_micro |
|---|---|---|
| DINOv3 zero-shot | — | 0.641 |
| + Fine-tuning (w/o ICA) | 0.617 | 0.747 |
| + ICA | 0.750 | 0.790 |
The ICA module contributes a +21.6% gain in SRCC and a +5.8% improvement in micro-accuracy.
Key Findings¶
- SSIM/LPIPS diverge substantially from human judgments under pose/scale variation; VTON-IQA is robust to global transformations.
- GPT-Image-1.5 is underestimated by traditional metrics (since zero-shot models frequently alter pose/scale); VTON-IQA correctly reflects its high quality.
- Human vs. model: pairwise accuracy is nearly identical (0.782 vs. 0.781), though a gap remains in correlation metrics.
- DiT-based models consistently outperform U-Net diffusion models, while GAN-based methods lag substantially behind.
Highlights & Insights¶
- Remarkable dataset scale: 62K images, 13K annotators, and 431K annotations constitute the largest human-annotated dataset in the VTON evaluation field to date.
- The rigorous annotation quality control pipeline (Krippendorff's α thresholding, multi-stage cleaning) is directly transferable to other crowdsourced annotation projects.
- The asymmetric design of ICA elegantly encodes the semantic structure of try-on quality assessment, centering the evaluation on the try-on image.
- Comprehensive benchmarking across 14 VTON models reveals systematic discrepancies between traditional metrics and perceptual quality.
Limitations & Future Work¶
- Coverage is limited to studio-setting standard try-on scenarios; in-the-wild settings with complex backgrounds and diverse poses are not addressed.
- The model outputs only a scalar score, lacking interpretable attribute-level feedback (e.g., "collar mismatch").
- The three-level annotation scale may be too coarse; fine-grained 5–7 level ratings could provide greater utility.
- The framework has not been extended to video try-on or 3D virtual try-on scenarios.
Related Work & Insights¶
- vs. VTONQA: Dataset scale is substantially smaller (748 pairs vs. 13,153 pairs) and the implementation is not publicly available; the present dataset and code will be open-sourced.
- vs. VTBench: Provides a multi-dimensional diagnostic framework but does not learn a unified quality assessment model.
- vs. VTON-VLLM: Focuses on textual critique rather than quantitative prediction.
- Inspiration: The reference-free quality assessment paradigm is transferable to the evaluation of other conditional generation tasks (e.g., image editing, style transfer); the asymmetric interaction design of ICA is applicable to any scenario requiring verification of whether generated outputs preserve input conditions.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic contribution in task formulation and dataset construction; the ICA design is novel but not a breakthrough.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering 14 VTON models, cross-group generalization, human comparison, and category-wise evaluation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction pipeline.
- Value: ⭐⭐⭐⭐⭐ Fills the gap of standardized evaluation benchmarks in the VTON field and has strong potential to become a community standard tool.