Skip to content

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Conference: CVPR2026 arXiv: 2603.11675 Code: None Area: Image Generation Keywords: virtual try-on, diffusion transformer, flow matching, multi-condition generation, promptable editing

TL;DR

A virtual try-on framework built on Flow Matching DiT that significantly reduces inference overhead while maintaining high fidelity, achieved through latent multimodal condition concatenation, a temporal self-reference caching mechanism, and 3D-RoPE grouped condition injection. The framework supports multi-garment try-on and text-prompt-controlled outfit styling.

Background & Motivation

Virtual try-on (VTON) is a core capability in e-commerce, enabling consumers to preview outfits without physical fitting and reducing return rates. The field has evolved from warping-based → GAN-based → diffusion-based approaches, with diffusion models achieving remarkable breakthroughs in visual fidelity. However, existing methods suffer from three core limitations:

High architectural complexity: Methods such as IDM-VTON and FitDiT require additional reference networks to encode garment features, doubling parameter counts and complicating initialization and interaction logic.

Low inference efficiency: Dual-network architectures result in slow sampling speeds, making it difficult to balance quality and speed.

Uncontrollable outfit style: Some methods replace text encoders with image encoders, sacrificing text-driven controllability; while PromptDresser introduces prompt-based control, it relies on a closed-source VLM (GPT-4o) for style description extraction, incurring high cost and limited accuracy.

The authors reframe VTON as a structured image editing problem that must simultaneously satisfy three key requirements: subject preservation, faithful texture transfer, and seamless integration. From this perspective, the paper proposes the PROMO framework: no reference network required, prompt-controllable outfit styling, and substantially improved inference speed.

Method

Overall Architecture

PROMO is built on FLUX.1-dev (Flow Matching DiT) and fine-tuned via LoRA (rank=128, 580M trainable parameters) across all linear layers. Inputs include:

  • Person image \(I_P\) (masked version \(\tilde{I}_P\))
  • Garment image set \(\{I_{G_i}\}_{i=1}^N\)
  • Optional outfit style description \(T_{style}\)
  • Pose and segmentation mask information

All image conditions are mapped to latent space via a unified encoder \(\mathcal{E}\), concatenated with the noisy latent \(\mathbf{z}_T\), and fed into the Flow Matching model. The final try-on result is generated through a VAE decoder.

Key Designs

1. Spatial Condition Merging

Unlike IC-LoRA and similar methods that concatenate conditions at the pixel level, PROMO recognizes that mask and pose conditions contain substantial information redundancy and need not match the output resolution. Accordingly:

  • Mask and pose conditions are downsampled by 2× in pixel space (height and width), reducing the token count to 25%.
  • The pose condition is directly overlaid onto the agnostic mask, further merging the two.
  • The final token count is only 12.5% of the original dual-condition representation (87.5% compression), substantially improving training and inference efficiency.

2. Temporal Self-Reference Mechanism

Conventional reference networks require a full parameter replica. PROMO draws inspiration from FastFit and extends this idea to the DiT framework:

  • Each condition \(C_i\) attends only to itself; text \(T_{style}\) and latent \(z_t\) have global visibility.
  • At inference: Key-Value pairs for all conditions \(C_i\) are cached at the first timestep; subsequent steps use only Queries containing \(T_{style}\) and \(z_t\) to interact with the cached KVs.
  • Effect: No parameter doubling, near-lossless quality, and inference time reduced from 22.24s to 9.18s (approximately 2.4× speedup).

3. 3D-RoPE Grouped Condition Encoding

The temporal dimension of RoPE positional encoding is used as a condition group identifier, distinguishing spatial conditions from garment conditions without any additional parameters:

  • Spatial conditions: \((t,x,y)_{\mathcal{C}_i} = (i, x, y)_{Z_t}\)
  • Garment conditions: \((t,x,y)_{\mathcal{C}_i} = (i, x, y + \Delta)_{Z_t}\)

This is a parameter-free approach that enables generalization from single-garment training to multi-garment inference in a single forward pass, avoiding iterative error accumulation. Ablation studies show that removing 3D-RoPE causes FID to degrade significantly from 3.31 to 6.73.

4. Style Prompt System

To address PromptDresser's dependence on GPT-4o, a two-stage distillation pipeline is designed:

  1. A multi-garment JSON schema is designed, with OpenAPI specifications generated via Pydantic (more reliably parsed by LLMs).
  2. Qwen2.5-VL-72B first annotates a small dataset → strict filtering → fine-tuning of Qwen2.5-VL-7B.
  3. The fine-tuned 7B model is faster and surpasses the 72B annotation model in accuracy (as all training data passed strict quality control).

5. Accurate Human Body Shape Estimation

Standard DensePose produces distorted results on loose garments (e.g., long skirts), leading to information leakage. This paper employs EOMT combined with iterative image generation training to develop a pose and shape estimation model that is robust to garment occlusion, effectively preventing information leakage under clothing cover.

Loss & Training

Region-Aware Loss Weighting: Human parsing results are used to distinguish body and background regions, downsampled by 16× to match latent resolution. Body regions are weighted \(1+\lambda\) and background regions \(1-\lambda\) (\(\lambda=0.5\)):

\[\mathcal{L}_{\text{weighted}} = \mathbb{E}_{t, \mathbf{z}_0, \boldsymbol{\epsilon}} \left[ \mathbf{W} \odot \| \boldsymbol{v} - \boldsymbol{v}_\theta(\mathbf{z}_t, t, \mathbf{c}) \|^2 \right]\]

Training configuration: 16 × H800 GPUs, batch size 16, 90K steps, Prodigy optimizer (default learning rate 1), trained on the VITON-HD and DressCode training sets at resolution 1024×768.

Key Experimental Results

Main Results: Comparison with VTON Methods (DressCode + VITON-HD, Paired Setting)

Method DC-SSIM↑ DC-LPIPS↓ DC-FID↓ DC-KID↓ VH-SSIM↑ VH-LPIPS↓ VH-FID↓ VH-KID↓
LaDI-VTON 0.756 0.380 5.47 1.93 0.872 0.153 6.85 1.38
CatVTON 0.894 0.160 6.54 3.96 0.867 0.188 9.44 4.74
OOTDiffusion 0.888 0.080 3.66 0.86 0.792 0.191 32.89 20.08
Any2AnyTryon 0.911 0.121 3.08 1.06 0.871 0.157 7.12 2.18
PROMO 0.891 0.089 3.31 0.49 0.862 0.111 6.89 1.49

PROMO achieves best-in-class KID across all benchmarks (DC: 0.49 vs. runner-up 0.86; VH: 1.49 vs. runner-up 1.38) and delivers excellent LPIPS performance. The KID advantage is even more pronounced in the unpaired setting (DC: 0.50 vs. runner-up 1.53; VH: 1.92 vs. runner-up 2.05).

Ablation Study (DressCode Dataset)

Variant SSIM↑ LPIPS↓ FID↓ KID↓ unp-FID↓ unp-KID↓ Inference Time (s)
w/o parsing area loss 0.890 0.087 3.28 0.51 4.64 0.95
w/o style prompt 0.890 0.093 3.72 0.89 5.35 0.62
w/o 3D-RoPE 0.870 0.130 6.73 1.72 7.82 2.28
w/o Temporal Self-Ref 3.31 0.80 4.74 0.53 22.24
w/o Spatial Token Merging 3.49 0.51 4.85 0.49 11.10
PROMO (full) 0.891 0.089 3.31 0.49 4.74 0.50 9.18

Key Findings

  1. 3D-RoPE is the most critical component: Its removal causes significant degradation across all metrics (FID 3.31→6.73), indicating the model can no longer correctly distinguish the semantic roles of different conditions.
  2. Temporal Self-Reference yields 2.4× speedup: Inference time reduced from 22.24s to 9.18s with near-lossless quality.
  3. Spatial condition merging reduces condition tokens by 87.5%: Inference time decreases from 11.10s to 9.18s without quality degradation.
  4. PROMO achieves an overall preference rate of 84.42% in user studies, surpassing commercial systems including Huihua (78.85%), Douyin (61.54%), and Kling (60.19%).
  5. Substantial superiority over general image editing models: Seedream 4.0, Qwen Image Edit, and Gemini 2.5-Flash exhibit pronounced color inconsistency and artifacts on the VTON task.

Highlights & Insights

  1. "Conditions need not match output resolution": This straightforward insight enables mask and pose conditions to be compressed to 12.5%, demonstrating that information density differences should be considered in condition injection design.
  2. Parameter-free 3D-RoPE grouped encoding: Distinguishing condition groups solely through positional encoding elegantly enables generalization from single-garment training to multi-garment inference without introducing any new parameters.
  3. KV cache reuse across timesteps: Since conditions remain constant during denoising, their KVs need only be computed once — a design principle with broad implications for all conditional generation tasks.
  4. Distillation outperforms direct use of large models: The fine-tuned 7B model surpasses the 72B annotation model in accuracy, as training data underwent rigorous filtering — data quality outweighs model scale.
  5. Reframing VTON as structured editing: This perspective makes the framework naturally transferable to a broader range of image editing tasks.

Limitations & Future Work

  1. SSIM is not optimal: In the paired setting, SSIM does not rank highest (Any2AnyTryon 0.911 vs. PROMO 0.891), indicating room for improvement in pixel-level reconstruction fidelity.
  2. Strong dependence on human pose estimation: DensePose and DWPose preprocessing are required, and these estimators may still fail under extreme poses.
  3. Limited gain from Parsing Area Loss on clean backgrounds: The improvement is marginal on benchmark datasets and primarily manifests in in-the-wild scenarios, suggesting insufficient complexity in public datasets.
  4. No extension to video try-on: The framework supports only single-frame generation; extending to video try-on with temporal consistency constraints is a natural future direction.
  5. Style Prompt System requires additional training: The 72B→7B distillation pipeline still entails a non-trivial deployment overhead.
  • FastFit: The Temporal Self-Reference mechanism is directly inspired by FastFit's KV caching design for UNet, which PROMO extends to the DiT architecture.
  • OminiControl / OminiControl2: The spatial condition downsampling idea originates from these works; PROMO further applies pose-mask merging for additional compression.
  • FLUX Kontext: The primary inspiration for 3D-RoPE grouped conditioning; however, Kontext supports only single-image conditions, whereas PROMO extends this to multi-garment scenarios.
  • PromptDresser: A comparable prompt-controllable VTON approach, but one that relies on GPT-4o; PROMO's distillation-based solution is more practical and achieves higher accuracy.

Rating

Dimension Score (1–10) Notes
Novelty 7 Individual components are not entirely new, but their combination is elegant; the application of 3D-RoPE and KV caching in VTON is innovative
Technical Depth 8 Covers condition injection, positional encoding, inference acceleration, and VLM distillation, with thorough analysis of each
Experimental Thoroughness 8 Dual-dataset evaluation + in-the-wild testing + commercial system comparison + user study + comprehensive ablation
Practical Value 9 Directly targets e-commerce deployment; significant inference speedup; enriched by engineering experience from Xiaohongshu's in-house system
Writing Quality 7 Well-structured with good overall readability