CVPR2026 Human Understanding Virtual try-on image quality assessment reference-free evaluation human feedback alignment cross-attention large-scale annotation benchmark

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback¶

Conference: CVPR2026 arXiv: 2603.13057 Code: GitHub Area: Human-Centric Understanding / Virtual Try-On Quality Assessment Keywords: Virtual try-on, image quality assessment, reference-free evaluation, human feedback alignment, cross-attention, large-scale annotation benchmark

TL;DR¶

This paper presents VTON-IQA, a reference-free image quality assessment framework for virtual try-on. It introduces VTON-QBench, a large-scale benchmark comprising 62,688 try-on images annotated with 431,800 human judgments, and proposes an Interleaved Cross-Attention (ICA) module to model interactions among garment, person, and try-on images, achieving image-level quality predictions that are closely aligned with human perception.

Background & Motivation¶

Absence of ground-truth references in real-world deployment: In practical e-commerce settings, ground-truth images of the same person wearing the target garment are typically unavailable, rendering full-reference metrics such as SSIM and LPIPS inapplicable.
Distribution-level metrics fail to reflect per-image quality: FID and KID measure dataset-level statistical similarity and cannot assess the perceptual quality of individual generated images.
Existing VTON evaluation methods lack large-scale human validation: VTON-VLLM focuses on textual critique rather than quantitative scoring; VTBench employs LLM-based judgment without learning from large-scale human annotations; VTONQA is limited in scale (only 748 pairs annotated by 40 workers).
Lack of publicly reproducible evaluation benchmarks: Existing methods have not released implementations or standardized benchmarks, hindering reproducible evaluation.
Try-on quality assessment differs from single-image IQA: It requires simultaneous verification of garment fidelity and person identity preservation, necessitating cross-image interaction modeling by nature.
Traditional metrics over-penalize global transformations: SSIM and LPIPS are sensitive to pose and scale variations, leading to misalignment with human perception.

Method¶

Overall Architecture¶

VTON-IQA adopts a three-branch Transformer architecture that takes a garment image \(I_G\), a person image \(I_P\), and a generated try-on image \(I_V\) as inputs, and produces a continuous quality score \(\hat{s} \in [-1, 1]\). The backbone is DINOv3 ViT-L/16.

Pipeline: Each of the three images undergoes patch embedding with a [CLS] token → the first \(L/2\) layers apply independent self-attention for feature extraction → the latter \(L/2\) layers introduce ICA modules for cross-image interaction → [CLS] representations are extracted → weighted cosine similarity scoring is performed.

Interleaved Cross-Attention (ICA) Module¶

The core design of ICA is asymmetric interaction: cross-attention layers are inserted between the self-attention and MLP components of standard Transformer blocks.

The try-on branch aggregates information from both the garment and person branches: \(\hat{X}_V = \tilde{X}_V + C_{V \leftarrow G} + C_{V \leftarrow P}\)
The garment/person branches receive information only from the try-on branch: \(\hat{X}_G = \tilde{X}_G + C_{G \leftarrow V}\), \(\hat{X}_P = \tilde{X}_P + C_{P \leftarrow V}\)
Bidirectional interactions between \((V, G)\) and \((V, P)\) are explicitly modeled, while direct \(G \leftrightarrow P\) coupling is deliberately avoided, since quality judgment is inherently try-on-centric.

Scoring Module¶

The [CLS] tokens \(c_G, c_P, c_V\) from the three branches are extracted, and two cosine similarity terms are combined via a learnable weight \(\alpha\):

\[\tilde{s} = \alpha \cdot \cos(c_G, c_V) + (1-\alpha) \cdot \cos(c_P, c_V)\]

The final score is mapped to \([-1, 1]\) via a learnable affine transformation followed by a tanh activation.

Loss & Training¶

Two objectives are jointly optimized:

Bradley–Terry preference learning: Pairwise preferences between two try-on results generated for the same person–garment pair are modeled, and soft-label cross-entropy is used to align predicted preferences with human judgments.
Score regression: An L2 loss constrains the consistency between predicted scores and human ratings.

\[\mathcal{L} = -q_{ij}\log p_\theta - (1-q_{ij})\log(1-p_\theta) + \sum_{k}\|\Psi_\theta - S_k\|_2^2\]

VTON-QBench Dataset Construction¶

Dimension	Scale
Garment–person pairs	13,153 (with synthetic augmentation at 1.9×)
Try-on images	62,688
VTON models	14 (covering GAN / U-Net Diffusion / DiT / commercial models)
Qualified annotators	13,838
Quality annotations	431,800

Three-level annotation: Unnatural (1) / Slightly unnatural but not obvious (2) / Completely natural (3); the final score is the mean across multiple annotators.

Data cleaning: A two-stage filtering process is applied — (1) dummy question screening and anomalous behavior detection; (2) complete questionnaire removal when Krippendorff's \(\alpha \leq 0.4\) — raising \(\alpha\) from 0.286 to 0.550.

Key Experimental Results¶

Main Results: Comparison with Baselines (VTON-QBench Test Set)¶

Method	SRCC↑	PLCC↑	R²↑	A_macro↑	A_micro↑
SSIM	–	0.135	–	0.596	0.593
LPIPS	–	0.387	–	0.701	0.695
DINOv3 (zero-shot)	–	0.261	–	0.637	0.641
VTON-IQA w/o ICA	0.617	0.615	0.372	0.722	0.747
VTON-IQA (full)	0.750	0.751	0.553	0.781	0.790

The ICA module yields substantial gains of SRCC +0.133 and PLCC +0.136.
Pairwise accuracy approaches human-level performance (human A_macro = 0.782 vs. model 0.771).

Benchmark across 14 VTON Models (VITON-HD, Unpaired)¶

Model	VTON-IQA↑	FID↓
Nano Banana Pro	0.315	10.309
GPT-Image-1.5	0.234	12.801
FitDit	0.189	9.893
Qwen-Image-Edit	0.087	10.706
IDM-VTON	0.039	9.093
OOTDiffusion	-0.142	9.064
LADI-VTON	-0.864	21.515

Commercial models lead substantially on human-aligned scores; FID/KID rankings do not consistently correspond to human perception.

Ablation Study¶

ICA vs. w/o ICA: The ICA module yields significant improvements across all metrics, validating the necessity of cross-image interaction modeling.
Asymmetric vs. symmetric interaction: The asymmetric design (avoiding \(G \leftrightarrow P\) coupling) better reflects the semantic structure of try-on quality assessment.
Task-specific training vs. zero-shot: Fine-tuning on VTON-QBench yields PLCC +0.354 over DINOv3 zero-shot, demonstrating the critical importance of in-domain training.

Highlights & Insights¶

Unprecedented dataset scale: VTON-QBench is, to the best of the authors' knowledge, the largest human subjective evaluation dataset for virtual try-on, with a planned open-source release.
Elegant asymmetric ICA design: The try-on-centric interaction structure aligns with the semantics of quality assessment and avoids irrelevant coupling.
Pairwise accuracy reaches human level: The model's A_macro falls only 0.011 short of human performance, indicating practical utility in preference ranking.
First unified benchmark across 14 VTON models: Covering GAN, UNet-Diffusion, DiT, and commercial models, the results reveal a systematic discrepancy between traditional metrics and human perception.
Complete synthetic data augmentation pipeline: LoRA + FLUX.1-dev generation, GPT-based filtering, and human review expand garment–person pairs by 1.9×.

Limitations & Future Work¶

Coarse three-level annotation granularity: Only three levels may fail to capture fine-grained quality differences; continuous ratings or multi-dimensional scoring may be preferable.
Correlation metrics still lag behind humans: SRCC = 0.750 vs. human 0.760; R² = 0.489 vs. 0.536, leaving room for improvement in absolute score prediction.
Single overall score without sub-dimension diagnosis: The framework outputs only a holistic quality score, lacking diagnostic capability for sub-dimensions such as garment texture, color, shape, and length.
Heavy backbone: Three-branch inference with DINOv3 ViT-L/16 incurs substantial computational cost, requiring efficiency considerations for deployment.
Synthetic augmentation relies on a commercial model: The pseudo-triplet construction uses Nano Banana Pro, limiting dataset construction cost and reproducibility.
Dynamic scenarios such as video try-on remain unexplored in terms of quality assessment.

Method	Data Scale	Reference-Free	Image-Level Score	Human Annotation	Open-Source
SSIM/LPIPS	N/A	✗	✓	✗	✓
FID/KID	N/A	✓	✗ (distribution-level)	✗	✓
VTONQA	748 pairs / 8,132 images / 40 workers	✓	✓	✓	✗
VTON-VLLM	–	✓	✗ (textual)	✓	✗
VTBench	–	✓	✓	Indirect	–
VTON-IQA	13K pairs / 63K images / 14K workers	✓	✓	✓	✓

VTON-IQA comprehensively surpasses prior work in data scale, evaluation completeness, and open-source commitment.

Rating¶

Novelty: ⭐⭐⭐⭐ — The asymmetric cross-image interaction design of ICA is original, and the large-scale VTON human annotation benchmark fills a notable gap in the field.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Benchmark across 14 models, human-level comparison, ablation studies, and qualitative analysis constitute an exceptionally comprehensive evaluation.
Writing Quality: ⭐⭐⭐⭐ — Well-structured presentation with detailed dataset construction descriptions and rigorous mathematical formulations.
Value: ⭐⭐⭐⭐ — Provides the virtual try-on community with a standardized evaluation benchmark and tool, offering both engineering and academic value.