Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback¶
Conference: CVPR2025
arXiv: 2603.13057
Code: GitHub
Area: Human Understanding
Keywords: virtual try-on, image quality assessment, human feedback, cross-attention, benchmark
TL;DR¶
VTON-IQA is proposed as a reference-free virtual try-on image quality assessment framework. It achieves image-level quality prediction aligned with human perception through a large-scale human-annotated benchmark VTON-QBench (62,688 try-on images + 431,800 annotations) and an Interleaved Cross-Attention module.
Background & Motivation¶
Dilemma of virtual try-on evaluation: In real-world scenarios, ground-truth images of the same person wearing the target clothing are usually unavailable, rendering reference-based metrics (SSIM, LPIPS) inapplicable.
Limitations of distribution-level metrics: Metrics such as FID and KID measure dataset-level statistical properties and cannot reflect the perceptual quality of individual generated images.
Limitations of Prior Work: The VTONQA dataset has a limited scale (748 pairs vs. 13,153 pairs in ours), VTON-VLLM focuses on textual criticism rather than quantitative prediction, and VTBench does not directly learn a unified quality model from large-scale human annotations.
Specificity of quality assessment: Virtual try-on quality assessment differs from traditional single-image IQA because it requires simultaneously verifying clothing fidelity and the preservation of non-target regions (human identity, background, etc.).
Need for cross-image interaction: Quality evaluation requires modeling the cross-relationships between the generated try-on image, the input clothing image, and the person image, which traditional IQA methods cannot achieve.
Lack of reproducible benchmarks: Existing methods lack open-source implementations and standardized evaluation benchmarks, making reproducible evaluation difficult.
Method¶
Overall Architecture¶
A three-branch Transformer architecture processes the clothing image \(I_G\), the person image \(I_P\), and the generated try-on image \(I_V\). The first half of the layers perform independent feature extraction, while the second half introduces the Interleaved Cross-Attention (ICA) module to model cross-image interactions.
VTON-QBench Dataset Construction¶
- Data Augmentation: Synthetic clothing-person pairs are generated based on FLUX.1-dev, covering casual/street/formal/minimal/vintage styles, expanding the number of pairs from 6,981 to 13,153 (approximately 1.9\(\times\)).
- Pseudo-triplet Construction: A strong model, Nano Banana Pro, is used to generate pseudo ground-truths to construct \((I_G, I_P, I_R)\) triplets, supporting comparisons with reference-based metrics.
- Generation by 14 VTON Models: Covering GAN-based (VITON-HD, HR-VITON, SD-VITON), U-Net diffusion (IDM-VTON, CatVTON, OOTDiffusion), DiT diffusion (FitDit, CatVTON-FLUX), and proprietary editing models (Nano Banana Pro, GPT-Image-1.5).
- Annotation Protocol: A three-level ordinal scale (Unnatural / Slightly unnatural / Completely natural) is adopted, where 13,838 qualified annotators provided 431,800 annotations.
- Data Cleaning: A two-stage filtering process—dummy-task consistency check + Krippendorff's \(\alpha\) threshold filtering (questionnaires with \(\alpha \le 0.4\) are discarded), improving \(\alpha\) from 0.286 to 0.550.
Interleaved Cross-Attention (ICA) Module¶
- In the latter \(L/2\) layers of standard Transformer blocks, a cross-attention layer is inserted between self-attention and MLP.
- Asymmetric Interaction Design: The try-on image representation aggregates contributions from both clothing and person \(\hat{X}_V = \tilde{X}_V + C_{V \leftarrow G} + C_{V \leftarrow P}\), while the clothing and person branches only retrieve information from the try-on branch.
- This design avoids unnecessary \(G \leftrightarrow P\) coupling and emphasizes quality judgment centered on the generated try-on image.
Scoring Module¶
- \([CLS]\) tokens from each branch are extracted as compact global representations \(c_G, c_P, c_V\).
- Intermediate Relation Score: A learnable weight \(\alpha\) balances the cosine similarities of clothing consistency and non-target region preservation.
- \(\tilde{s} = \alpha \frac{c_G^\top c_V}{\|c_G\|\|c_V\|} + (1-\alpha) \frac{c_P^\top c_V}{\|c_P\|\|c_V\|}\)
- Final Score: \(\hat{s} = \tanh(a\tilde{s}+b)\), where a learnable affine transformation and a \(\tanh\) activation restrict the score to \([-1,1]\).
Loss & Training¶
- Pairwise Preference Term: The Bradley-Terry model is used to model pairwise preferences, with soft-label cross-entropy aligning predictions with the distribution of human preferences.
- Score Regression Term: \(L_2\) loss enforces consistency between the predicted scores and human ratings.
- Joint optimization balances both relative ranking and absolute score alignment.
Key Experimental Results¶
| Method | \(\rho_{\text{SRCC}}\uparrow\) | \(\rho_{\text{PLCC}}\uparrow\) | \(R^2\uparrow\) | \(A_{\text{macro}}\uparrow\) | \(A_{\text{micro}}\uparrow\) |
|---|---|---|---|---|---|
| SSIM | – | 0.135 | – | 0.596 | 0.593 |
| LPIPS | – | 0.387 | – | 0.701 | 0.695 |
| DINOv3 (zero-shot) | – | 0.261 | – | 0.637 | 0.641 |
| VTON-IQA w/o ICA | 0.617 | 0.615 | 0.372 | 0.722 | 0.747 |
| VTON-IQA | 0.750 | 0.751 | 0.553 | 0.781 | 0.790 |
| Human | 0.760 | 0.762 | 0.536 | 0.782 | 0.791 |
Key Findings: - The ICA module brings significant improvement (SRCC: 0.617 to 0.750), validating the effectiveness of modeling cross-image interactions. - In terms of pairwise accuracy (\(A_{\text{macro}}/A_{\text{micro}}\)), the model is close to human-level performance (0.781 vs. 0.782). - There is still room for improvement in correlation metrics (0.750 vs. 0.760), indicating that fine-grained perceptual alignment needs further enhancement. - In the benchmark of 14 VTON models, Nano Banana Pro achieves the highest overall scores on Dress Code and VITON-HD. - GAN-based methods (VITON-HD, HR-VITON, SD-VITON) score significantly lower than diffusion-based methods under VTON-IQA. - Qualitative analysis shows that VTON-IQA is robust to changes in pose and scale, whereas SSIM/LPIPS excessively penalize global transformations.
Highlights¶
- Unprecedentedly Large Human-Annotated Benchmark: VTON-QBench is currently the largest dataset for subjective human evaluation of virtual try-on, with the number of annotators (13,838) and the count of annotations (431,800) far exceeding prior works.
- Asymmetric ICA Design: It precisely models \(V \leftrightarrow G\) and \(V \leftrightarrow P\) interactions while avoiding redundant \(G \leftrightarrow P\) coupling, aligning perfectly with the essence of virtual try-on quality assessment.
- Human-Like Pairwise Accuracy: Achieving consistency comparable to humans on pairwise ranking tasks.
- Comprehensive Benchmark: Providing a systematic evaluation covering 14 representative VTON models, offering a standardized reference for the virtual try-on community.
Limitations¶
- There is still a gap in correlation metrics compared to human performance (SRCC 0.750 vs. 0.760), indicating that fine-grained perceptual alignment needs improvement.
- The annotation scale has only three levels, which is relatively coarse-grained and might limit the discriminability of quality scores.
- Both training and evaluation are based on VTON-QBench, and the generalization ability to completely new VTON models or extreme scenarios remains to be verified.
- It only evaluates upper-body/full-body try-on, and does not cover sub-scenarios like accessories or footwear.
- The backbone is based on DINOv3 ViT-L/16, which has high inference costs (requiring three-branch forward propagation for three images).
Related Work¶
- Virtual Try-On: Evolving from GAN-based two-stage pipelines (VITON-HD, HR-VITON) \(\rightarrow\) U-Net diffusion (IDM-VTON, CatVTON) \(\rightarrow\) DiT (FitDit, Any2AnyTryon) \(\rightarrow\) proprietary editing models (Nano Banana Pro, GPT-Image-1.5).
- Quality Assessment: VTONQA (limited-scale annotation training evaluator), VTON-VLLM (textual criticism-oriented), and VTBench (hierarchical benchmark without a unified quality model).
- General IQA: Q-Align, CLIP-IQA (single-image IQA without modeling cross-image interactions).
Rating¶
- Novelty: ⭐⭐⭐⭐ — The asymmetric ICA interaction design is highly focused, and the large-scale crowdsourced annotation dataset construction process is mature.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive benchmark of 14 models, ablation studies, and comparisons with humans.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-defined problem.
- Value: ⭐⭐⭐⭐ — Provides the virtual try-on community with much-needed standardized evaluation tools and large-scale benchmarks.