Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback¶

Conference: CVPR 2026 arXiv: 2603.13057 Code: GitHub Area: Human Understanding Keywords: Virtual try-on, reference-free quality assessment, human feedback alignment, interleaved cross-attention, VTON-QBench

TL;DR¶

This work constructs VTON-QBench (62,688 try-on images, 13,838 qualified annotators, 431,800 annotations) and proposes VTON-IQA, a reference-free image quality assessment framework that jointly models garment fidelity and person preservation via an asymmetric Interleaved Cross-Attention (ICA) module, achieving image-level quality prediction highly aligned with human perception.

Background & Motivation¶

Background: Virtual try-on (VTON) is increasingly important in fashion e-commerce, with the synthesis of wearing results from a person image and a garment image as the core task. Try-on quality has continuously improved from GANs to U-Net diffusion models and DiT architectures.

Limitations of Prior Work:

Ground-truth images are unavailable in real-world scenarios (it is infeasible to photograph the same person wearing the target garment), rendering reference-based metrics such as SSIM/LPIPS inapplicable.
FID/KID measure only dataset-level distributional similarity and cannot reflect single-image quality.
Existing VTON-specific evaluation methods (VTONQA, VTBench, VTON-VLLM) either operate on small datasets (748 pairs), lack publicly available implementations, or are not validated at scale with human annotations.

Key Challenge: Single-image-level, reference-free quality assessment aligned with human perception is required, yet no existing tool satisfies all three criteria simultaneously.

Goal: Establish a large-scale human-annotated benchmark and train a reference-free quality prediction model.

Key Insight: Try-on quality fundamentally involves verifying two aspects—(1) whether the garment is faithfully transferred, and (2) whether non-target regions are preserved—naturally necessitating cross-image interaction modeling.

Core Idea: Construct a large-scale human-annotated dataset and explicitly model the consistency between the try-on result and the garment/person images via asymmetric cross-attention.

Method¶

Overall Architecture¶

A three-branch Transformer architecture (based on DINOv3 ViT-L/16) processes the garment \(I_G\), person \(I_P\), and try-on result \(I_V\) independently. The first \(L/2\) layers extract features independently; the latter \(L/2\) layers incorporate ICA modules for cross-image interaction. Three [CLS] tokens are extracted, fused via learnable weighted cosine similarity, and mapped through \(\tanh\) to \([-1, 1]\) as the quality score.

Key Designs¶

VTON-QBench Dataset Construction
- 62,688 try-on images generated by 14 representative VTON models (covering GAN, U-Net diffusion, DiT, and commercial API approaches).
- 13,153 garment–person pairs (6,981 original pairs + 6,172 synthetic pairs augmented via FLUX.1-dev LoRA, covering casual, streetwear, formal, minimalist, and vintage styles).
- 13,838 qualified annotators providing 431,800 three-level quality annotations (natural / slightly unnatural / unnatural).
- Two-stage annotation cleaning: (1) five unambiguous test questions combined with behavioral filtering (annotators with >80% identical responses or >60% disagreement with majority voting are excluded), raising Krippendorff's α from 0.286 to 0.550; (2) removal of questionnaires with α ≤ 0.4.
- Pseudo triplets constructed using Nano Banana Pro as a strong model to generate reference images, enabling comparison with reference-based metrics.
Interleaved Cross-Attention (ICA) Module
- A cross-attention layer is inserted between the standard Transformer's self-attention and MLP layers.
- Asymmetric interaction design: the try-on image \(V\) interacts bidirectionally with both garment \(G\) and person \(P\): \(\hat{X}_V^{(\ell)} = \tilde{X}_V^{(\ell)} + C_{V \leftarrow G}^{(\ell)} + C_{V \leftarrow P}^{(\ell)}\)
- However, \(G\) and \(P\) do not interact directly; they are connected only through \(V\): \(\hat{X}_G^{(\ell)} = \tilde{X}_G^{(\ell)} + C_{G \leftarrow V}^{(\ell)}\)
- This design reflects the try-on-image-centric nature of quality assessment—verifying whether \(V\) preserves the garment attributes of \(G\) and the non-target elements of \(P\).
Scoring Mechanism
- Three [CLS] tokens \(c_G, c_P, c_V\) are fused: \(\tilde{s} = \alpha \frac{c_G^\top c_V}{\|c_G\|\|c_V\|} + (1-\alpha) \frac{c_P^\top c_V}{\|c_P\|\|c_V\|}\)
- \(\alpha\) is a learnable scalar that adaptively balances garment consistency and person preservation.
- The final score is \(\hat{s} = \tanh(a\tilde{s} + b)\), constrained to \([-1, 1]\).

Loss & Training¶

Joint optimization of Bradley–Terry preference learning and score regression:

\[\mathcal{L}_\theta = -q_{ij} \log p_\theta - (1-q_{ij}) \log(1-p_\theta) + \sum_{k \in \{i,j\}} \|\Psi_\theta(I_G, I_P, I_{V_k}) - S_k\|_2^2\]

AdamW optimizer, lr=1e-4, batch size 16, early stopping (training halted if validation loss shows no improvement for 3 epochs), single A100 40GB GPU, bfloat16 mixed precision.

Key Experimental Results¶

Main Results¶

Method	ρ_SRCC	ρ_PLCC	R²	A_macro	A_micro	Reference-Free
SSIM	—	0.135	—	0.596	0.593	✗
LPIPS	—	0.387	—	0.701	0.695	✗
DINOv3 (zero-shot)	—	0.261	—	0.637	0.641	✓
VTON-IQA w/o ICA	0.617	0.615	0.372	0.722	0.747	✓
VTON-IQA	0.750	0.751	0.553	0.781	0.790	✓
Human	0.760	0.762	0.536	0.782	0.791	—

Rankings of 14 VTON Models (Dress Code Dataset, VTON-IQA Score)¶

Rank	Model	Score
1	Nano Banana Pro	0.305
2	GPT-Image-1.5	0.237
3	FitDit	0.219
4	IDM-VTON	0.141
...	...	...
13	HR-VITON (GAN)	-0.835
14	VITON-HD (GAN)	-0.933

Ablation Study¶

Configuration	ρ_SRCC	A_micro
DINOv3 zero-shot	—	0.641
+ Fine-tuning (w/o ICA)	0.617	0.747
+ ICA	0.750	0.790

The ICA module contributes a +21.6% gain in SRCC and a +5.8% improvement in micro-accuracy.

Key Findings¶

SSIM/LPIPS diverge substantially from human judgments under pose/scale variation; VTON-IQA is robust to global transformations.
GPT-Image-1.5 is underestimated by traditional metrics (since zero-shot models frequently alter pose/scale); VTON-IQA correctly reflects its high quality.
Human vs. model: pairwise accuracy is nearly identical (0.782 vs. 0.781), though a gap remains in correlation metrics.
DiT-based models consistently outperform U-Net diffusion models, while GAN-based methods lag substantially behind.

Highlights & Insights¶

Remarkable dataset scale: 62K images, 13K annotators, and 431K annotations constitute the largest human-annotated dataset in the VTON evaluation field to date.
The rigorous annotation quality control pipeline (Krippendorff's α thresholding, multi-stage cleaning) is directly transferable to other crowdsourced annotation projects.
The asymmetric design of ICA elegantly encodes the semantic structure of try-on quality assessment, centering the evaluation on the try-on image.
Comprehensive benchmarking across 14 VTON models reveals systematic discrepancies between traditional metrics and perceptual quality.

Limitations & Future Work¶

Coverage is limited to studio-setting standard try-on scenarios; in-the-wild settings with complex backgrounds and diverse poses are not addressed.
The model outputs only a scalar score, lacking interpretable attribute-level feedback (e.g., "collar mismatch").
The three-level annotation scale may be too coarse; fine-grained 5–7 level ratings could provide greater utility.
The framework has not been extended to video try-on or 3D virtual try-on scenarios.

vs. VTONQA: Dataset scale is substantially smaller (748 pairs vs. 13,153 pairs) and the implementation is not publicly available; the present dataset and code will be open-sourced.
vs. VTBench: Provides a multi-dimensional diagnostic framework but does not learn a unified quality assessment model.
vs. VTON-VLLM: Focuses on textual critique rather than quantitative prediction.
Inspiration: The reference-free quality assessment paradigm is transferable to the evaluation of other conditional generation tasks (e.g., image editing, style transfer); the asymmetric interaction design of ICA is applicable to any scenario requiring verification of whether generated outputs preserve input conditions.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematic contribution in task formulation and dataset construction; the ICA design is novel but not a breakthrough.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering 14 VTON models, cross-group generalization, human comparison, and category-wise evaluation.
Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed description of the dataset construction pipeline.
Value: ⭐⭐⭐⭐⭐ Fills the gap of standardized evaluation benchmarks in the VTON field and has strong potential to become a community standard tool.