Skip to content

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

Conference: CVPR 2026 arXiv: 2603.13057 Code: GitHub Area: Human Understanding Keywords: Virtual try-on, reference-free quality assessment, human feedback alignment, interleaved cross-attention, VTON-QBench

TL;DR

This work constructs VTON-QBench (62,688 try-on images, 13,838 qualified annotators, 431,800 annotations) and proposes VTON-IQA, a reference-free image quality assessment framework that jointly models garment fidelity and person preservation via an asymmetric Interleaved Cross-Attention (ICA) module, achieving image-level quality prediction highly aligned with human perception.

Background & Motivation

Background: Virtual try-on (VTON) is increasingly important in fashion e-commerce, with the synthesis of wearing results from a person image and a garment image as the core task. Try-on quality has continuously improved from GANs to U-Net diffusion models and DiT architectures.

Limitations of Prior Work:

  1. Ground-truth images are unavailable in real-world scenarios (it is infeasible to photograph the same person wearing the target garment), rendering reference-based metrics such as SSIM/LPIPS inapplicable.
  2. FID/KID measure only dataset-level distributional similarity and cannot reflect single-image quality.
  3. Existing VTON-specific evaluation methods (VTONQA, VTBench, VTON-VLLM) either operate on small datasets (748 pairs), lack publicly available implementations, or are not validated at scale with human annotations.

Key Challenge: Single-image-level, reference-free quality assessment aligned with human perception is required, yet no existing tool satisfies all three criteria simultaneously.

Goal: Establish a large-scale human-annotated benchmark and train a reference-free quality prediction model.

Key Insight: Try-on quality fundamentally involves verifying two aspects—(1) whether the garment is faithfully transferred, and (2) whether non-target regions are preserved—naturally necessitating cross-image interaction modeling.

Core Idea: Construct a large-scale human-annotated dataset and explicitly model the consistency between the try-on result and the garment/person images via asymmetric cross-attention.

Method

Overall Architecture

A three-branch Transformer architecture (based on DINOv3 ViT-L/16) processes the garment \(I_G\), person \(I_P\), and try-on result \(I_V\) independently. The first \(L/2\) layers extract features independently; the latter \(L/2\) layers incorporate ICA modules for cross-image interaction. Three [CLS] tokens are extracted, fused via learnable weighted cosine similarity, and mapped through \(\tanh\) to \([-1, 1]\) as the quality score.

Key Designs

  1. VTON-QBench Dataset Construction

    • 62,688 try-on images generated by 14 representative VTON models (covering GAN, U-Net diffusion, DiT, and commercial API approaches).
    • 13,153 garment–person pairs (6,981 original pairs + 6,172 synthetic pairs augmented via FLUX.1-dev LoRA, covering casual, streetwear, formal, minimalist, and vintage styles).
    • 13,838 qualified annotators providing 431,800 three-level quality annotations (natural / slightly unnatural / unnatural).
    • Two-stage annotation cleaning: (1) five unambiguous test questions combined with behavioral filtering (annotators with >80% identical responses or >60% disagreement with majority voting are excluded), raising Krippendorff's α from 0.286 to 0.550; (2) removal of questionnaires with α ≤ 0.4.
    • Pseudo triplets constructed using Nano Banana Pro as a strong model to generate reference images, enabling comparison with reference-based metrics.
  2. Interleaved Cross-Attention (ICA) Module

    • A cross-attention layer is inserted between the standard Transformer's self-attention and MLP layers.
    • Asymmetric interaction design: the try-on image \(V\) interacts bidirectionally with both garment \(G\) and person \(P\): \(\hat{X}_V^{(\ell)} = \tilde{X}_V^{(\ell)} + C_{V \leftarrow G}^{(\ell)} + C_{V \leftarrow P}^{(\ell)}\)
    • However, \(G\) and \(P\) do not interact directly; they are connected only through \(V\): \(\hat{X}_G^{(\ell)} = \tilde{X}_G^{(\ell)} + C_{G \leftarrow V}^{(\ell)}\)
    • This design reflects the try-on-image-centric nature of quality assessment—verifying whether \(V\) preserves the garment attributes of \(G\) and the non-target elements of \(P\).
  3. Scoring Mechanism

    • Three [CLS] tokens \(c_G, c_P, c_V\) are fused: \(\tilde{s} = \alpha \frac{c_G^\top c_V}{\|c_G\|\|c_V\|} + (1-\alpha) \frac{c_P^\top c_V}{\|c_P\|\|c_V\|}\)
    • \(\alpha\) is a learnable scalar that adaptively balances garment consistency and person preservation.
    • The final score is \(\hat{s} = \tanh(a\tilde{s} + b)\), constrained to \([-1, 1]\).

Loss & Training

Joint optimization of Bradley–Terry preference learning and score regression:

\[\mathcal{L}_\theta = -q_{ij} \log p_\theta - (1-q_{ij}) \log(1-p_\theta) + \sum_{k \in \{i,j\}} \|\Psi_\theta(I_G, I_P, I_{V_k}) - S_k\|_2^2\]

AdamW optimizer, lr=1e-4, batch size 16, early stopping (training halted if validation loss shows no improvement for 3 epochs), single A100 40GB GPU, bfloat16 mixed precision.

Key Experimental Results

Main Results

Method ρ_SRCC ρ_PLCC A_macro A_micro Reference-Free
SSIM 0.135 0.596 0.593
LPIPS 0.387 0.701 0.695
DINOv3 (zero-shot) 0.261 0.637 0.641
VTON-IQA w/o ICA 0.617 0.615 0.372 0.722 0.747
VTON-IQA 0.750 0.751 0.553 0.781 0.790
Human 0.760 0.762 0.536 0.782 0.791

Rankings of 14 VTON Models (Dress Code Dataset, VTON-IQA Score)

Rank Model Score
1 Nano Banana Pro 0.305
2 GPT-Image-1.5 0.237
3 FitDit 0.219
4 IDM-VTON 0.141
... ... ...
13 HR-VITON (GAN) -0.835
14 VITON-HD (GAN) -0.933

Ablation Study

Configuration ρ_SRCC A_micro
DINOv3 zero-shot 0.641
+ Fine-tuning (w/o ICA) 0.617 0.747
+ ICA 0.750 0.790

The ICA module contributes a +21.6% gain in SRCC and a +5.8% improvement in micro-accuracy.

Key Findings

  • SSIM/LPIPS diverge substantially from human judgments under pose/scale variation; VTON-IQA is robust to global transformations.
  • GPT-Image-1.5 is underestimated by traditional metrics (since zero-shot models frequently alter pose/scale); VTON-IQA correctly reflects its high quality.
  • Human vs. model: pairwise accuracy is nearly identical (0.782 vs. 0.781), though a gap remains in correlation metrics.
  • DiT-based models consistently outperform U-Net diffusion models, while GAN-based methods lag substantially behind.

Highlights & Insights

  • Remarkable dataset scale: 62K images, 13K annotators, and 431K annotations constitute the largest human-annotated dataset in the VTON evaluation field to date.
  • The rigorous annotation quality control pipeline (Krippendorff's α thresholding, multi-stage cleaning) is directly transferable to other crowdsourced annotation projects.
  • The asymmetric design of ICA elegantly encodes the semantic structure of try-on quality assessment, centering the evaluation on the try-on image.
  • Comprehensive benchmarking across 14 VTON models reveals systematic discrepancies between traditional metrics and perceptual quality.

Limitations & Future Work

  • Coverage is limited to studio-setting standard try-on scenarios; in-the-wild settings with complex backgrounds and diverse poses are not addressed.
  • The model outputs only a scalar score, lacking interpretable attribute-level feedback (e.g., "collar mismatch").
  • The three-level annotation scale may be too coarse; fine-grained 5–7 level ratings could provide greater utility.
  • The framework has not been extended to video try-on or 3D virtual try-on scenarios.
  • vs. VTONQA: Dataset scale is substantially smaller (748 pairs vs. 13,153 pairs) and the implementation is not publicly available; the present dataset and code will be open-sourced.
  • vs. VTBench: Provides a multi-dimensional diagnostic framework but does not learn a unified quality assessment model.
  • vs. VTON-VLLM: Focuses on textual critique rather than quantitative prediction.
  • Inspiration: The reference-free quality assessment paradigm is transferable to the evaluation of other conditional generation tasks (e.g., image editing, style transfer); the asymmetric interaction design of ICA is applicable to any scenario requiring verification of whether generated outputs preserve input conditions.

Rating

  • Novelty: ⭐⭐⭐⭐ Systematic contribution in task formulation and dataset construction; the ICA design is novel but not a breakthrough.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering 14 VTON models, cross-group generalization, human comparison, and category-wise evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction pipeline.
  • Value: ⭐⭐⭐⭐⭐ Fills the gap of standardized evaluation benchmarks in the VTON field and has strong potential to become a community standard tool.