Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback¶
Conference: CVPR2026 arXiv: 2603.13057 Code: GitHub Area: Human-Centric Understanding / Virtual Try-On Quality Assessment Keywords: Virtual try-on, image quality assessment, reference-free evaluation, human feedback alignment, cross-attention, large-scale annotation benchmark
TL;DR¶
This paper presents VTON-IQA, a reference-free image quality assessment framework for virtual try-on. It introduces VTON-QBench, a large-scale benchmark comprising 62,688 try-on images annotated with 431,800 human judgments, and proposes an Interleaved Cross-Attention (ICA) module to model interactions among garment, person, and try-on images, achieving image-level quality predictions that are closely aligned with human perception.
Background & Motivation¶
- Absence of ground-truth references in real-world deployment: In practical e-commerce settings, ground-truth images of the same person wearing the target garment are typically unavailable, rendering full-reference metrics such as SSIM and LPIPS inapplicable.
- Distribution-level metrics fail to reflect per-image quality: FID and KID measure dataset-level statistical similarity and cannot assess the perceptual quality of individual generated images.
- Existing VTON evaluation methods lack large-scale human validation: VTON-VLLM focuses on textual critique rather than quantitative scoring; VTBench employs LLM-based judgment without learning from large-scale human annotations; VTONQA is limited in scale (only 748 pairs annotated by 40 workers).
- Lack of publicly reproducible evaluation benchmarks: Existing methods have not released implementations or standardized benchmarks, hindering reproducible evaluation.
- Try-on quality assessment differs from single-image IQA: It requires simultaneous verification of garment fidelity and person identity preservation, necessitating cross-image interaction modeling by nature.
- Traditional metrics over-penalize global transformations: SSIM and LPIPS are sensitive to pose and scale variations, leading to misalignment with human perception.
Method¶
Overall Architecture¶
VTON-IQA adopts a three-branch Transformer architecture that takes a garment image \(I_G\), a person image \(I_P\), and a generated try-on image \(I_V\) as inputs, and produces a continuous quality score \(\hat{s} \in [-1, 1]\). The backbone is DINOv3 ViT-L/16.
Pipeline: Each of the three images undergoes patch embedding with a [CLS] token → the first \(L/2\) layers apply independent self-attention for feature extraction → the latter \(L/2\) layers introduce ICA modules for cross-image interaction → [CLS] representations are extracted → weighted cosine similarity scoring is performed.
Interleaved Cross-Attention (ICA) Module¶
The core design of ICA is asymmetric interaction: cross-attention layers are inserted between the self-attention and MLP components of standard Transformer blocks.
- The try-on branch aggregates information from both the garment and person branches: \(\hat{X}_V = \tilde{X}_V + C_{V \leftarrow G} + C_{V \leftarrow P}\)
- The garment/person branches receive information only from the try-on branch: \(\hat{X}_G = \tilde{X}_G + C_{G \leftarrow V}\), \(\hat{X}_P = \tilde{X}_P + C_{P \leftarrow V}\)
- Bidirectional interactions between \((V, G)\) and \((V, P)\) are explicitly modeled, while direct \(G \leftrightarrow P\) coupling is deliberately avoided, since quality judgment is inherently try-on-centric.
Scoring Module¶
The [CLS] tokens \(c_G, c_P, c_V\) from the three branches are extracted, and two cosine similarity terms are combined via a learnable weight \(\alpha\):
The final score is mapped to \([-1, 1]\) via a learnable affine transformation followed by a tanh activation.
Loss & Training¶
Two objectives are jointly optimized:
- Bradley–Terry preference learning: Pairwise preferences between two try-on results generated for the same person–garment pair are modeled, and soft-label cross-entropy is used to align predicted preferences with human judgments.
- Score regression: An L2 loss constrains the consistency between predicted scores and human ratings.
VTON-QBench Dataset Construction¶
| Dimension | Scale |
|---|---|
| Garment–person pairs | 13,153 (with synthetic augmentation at 1.9×) |
| Try-on images | 62,688 |
| VTON models | 14 (covering GAN / U-Net Diffusion / DiT / commercial models) |
| Qualified annotators | 13,838 |
| Quality annotations | 431,800 |
Three-level annotation: Unnatural (1) / Slightly unnatural but not obvious (2) / Completely natural (3); the final score is the mean across multiple annotators.
Data cleaning: A two-stage filtering process is applied — (1) dummy question screening and anomalous behavior detection; (2) complete questionnaire removal when Krippendorff's \(\alpha \leq 0.4\) — raising \(\alpha\) from 0.286 to 0.550.
Key Experimental Results¶
Main Results: Comparison with Baselines (VTON-QBench Test Set)¶
| Method | SRCC↑ | PLCC↑ | R²↑ | A_macro↑ | A_micro↑ |
|---|---|---|---|---|---|
| SSIM | – | 0.135 | – | 0.596 | 0.593 |
| LPIPS | – | 0.387 | – | 0.701 | 0.695 |
| DINOv3 (zero-shot) | – | 0.261 | – | 0.637 | 0.641 |
| VTON-IQA w/o ICA | 0.617 | 0.615 | 0.372 | 0.722 | 0.747 |
| VTON-IQA (full) | 0.750 | 0.751 | 0.553 | 0.781 | 0.790 |
- The ICA module yields substantial gains of SRCC +0.133 and PLCC +0.136.
- Pairwise accuracy approaches human-level performance (human A_macro = 0.782 vs. model 0.771).
Benchmark across 14 VTON Models (VITON-HD, Unpaired)¶
| Model | VTON-IQA↑ | FID↓ |
|---|---|---|
| Nano Banana Pro | 0.315 | 10.309 |
| GPT-Image-1.5 | 0.234 | 12.801 |
| FitDit | 0.189 | 9.893 |
| Qwen-Image-Edit | 0.087 | 10.706 |
| IDM-VTON | 0.039 | 9.093 |
| OOTDiffusion | -0.142 | 9.064 |
| LADI-VTON | -0.864 | 21.515 |
Commercial models lead substantially on human-aligned scores; FID/KID rankings do not consistently correspond to human perception.
Ablation Study¶
- ICA vs. w/o ICA: The ICA module yields significant improvements across all metrics, validating the necessity of cross-image interaction modeling.
- Asymmetric vs. symmetric interaction: The asymmetric design (avoiding \(G \leftrightarrow P\) coupling) better reflects the semantic structure of try-on quality assessment.
- Task-specific training vs. zero-shot: Fine-tuning on VTON-QBench yields PLCC +0.354 over DINOv3 zero-shot, demonstrating the critical importance of in-domain training.
Highlights & Insights¶
- Unprecedented dataset scale: VTON-QBench is, to the best of the authors' knowledge, the largest human subjective evaluation dataset for virtual try-on, with a planned open-source release.
- Elegant asymmetric ICA design: The try-on-centric interaction structure aligns with the semantics of quality assessment and avoids irrelevant coupling.
- Pairwise accuracy reaches human level: The model's A_macro falls only 0.011 short of human performance, indicating practical utility in preference ranking.
- First unified benchmark across 14 VTON models: Covering GAN, UNet-Diffusion, DiT, and commercial models, the results reveal a systematic discrepancy between traditional metrics and human perception.
- Complete synthetic data augmentation pipeline: LoRA + FLUX.1-dev generation, GPT-based filtering, and human review expand garment–person pairs by 1.9×.
Limitations & Future Work¶
- Coarse three-level annotation granularity: Only three levels may fail to capture fine-grained quality differences; continuous ratings or multi-dimensional scoring may be preferable.
- Correlation metrics still lag behind humans: SRCC = 0.750 vs. human 0.760; R² = 0.489 vs. 0.536, leaving room for improvement in absolute score prediction.
- Single overall score without sub-dimension diagnosis: The framework outputs only a holistic quality score, lacking diagnostic capability for sub-dimensions such as garment texture, color, shape, and length.
- Heavy backbone: Three-branch inference with DINOv3 ViT-L/16 incurs substantial computational cost, requiring efficiency considerations for deployment.
- Synthetic augmentation relies on a commercial model: The pseudo-triplet construction uses Nano Banana Pro, limiting dataset construction cost and reproducibility.
- Dynamic scenarios such as video try-on remain unexplored in terms of quality assessment.
Related Work & Insights¶
| Method | Data Scale | Reference-Free | Image-Level Score | Human Annotation | Open-Source |
|---|---|---|---|---|---|
| SSIM/LPIPS | N/A | ✗ | ✓ | ✗ | ✓ |
| FID/KID | N/A | ✓ | ✗ (distribution-level) | ✗ | ✓ |
| VTONQA | 748 pairs / 8,132 images / 40 workers | ✓ | ✓ | ✓ | ✗ |
| VTON-VLLM | – | ✓ | ✗ (textual) | ✓ | ✗ |
| VTBench | – | ✓ | ✓ | Indirect | – |
| VTON-IQA | 13K pairs / 63K images / 14K workers | ✓ | ✓ | ✓ | ✓ |
VTON-IQA comprehensively surpasses prior work in data scale, evaluation completeness, and open-source commitment.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The asymmetric cross-image interaction design of ICA is original, and the large-scale VTON human annotation benchmark fills a notable gap in the field.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Benchmark across 14 models, human-level comparison, ablation studies, and qualitative analysis constitute an exceptionally comprehensive evaluation.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured presentation with detailed dataset construction descriptions and rigorous mathematical formulations.
- Value: ⭐⭐⭐⭐ — Provides the virtual try-on community with a standardized evaluation benchmark and tool, offering both engineering and academic value.