OmniVTON: Training-Free Universal Virtual Try-On¶

Conference: ICCV2025 arXiv: 2507.15037 Code: GitHub Area: Image Generation Keywords: virtual try-on, training-free, diffusion model, garment warping, pose alignment

TL;DR¶

OmniVTON proposes the first training-free universal virtual try-on framework. By decoupling garment texture and pose conditions, the method employs three core modules—Structured Garment Morphing (SGM), Continuous Boundary Stitching (CBS), and Spectral Pose Injection (SPI)—to achieve high-fidelity try-on in both in-shop and in-the-wild settings, while also supporting multi-person try-on for the first time.

Background & Motivation¶

Problem Definition: Image-based virtual try-on (VTON) requires seamlessly transferring a garment image onto a target human body while preserving texture consistency and pose fidelity.
Limitations of Prior Work:
- Supervised in-shop methods (GP-VTON, IDM-VTON, etc.) rely on paired training data and generalize poorly across domains.
- Unsupervised in-the-wild methods (StreetTryOn) are constrained by data distribution bias and lack general applicability.
- Both paradigms require training dedicated models for specific conditions; constructing large-scale datasets spanning diverse categories and poses is impractical.
Key Challenge:
Fine-grained texture consistency: Without a training phase, establishing garment–body alignment while preserving texture details is inherently difficult.
Human pose alignment: Existing methods rely on keypoint or DensePose conditioning and require retraining for cross-modal feature fusion.
Goal: To develop a unified, training-free VTON framework capable of generalizing across different domains and scenarios.

Method¶

Overall Architecture¶

OmniVTON adopts a two-stage pipeline built on off-the-shelf diffusion models without any additional training: 1. Stage 1: Warp the target garment to align with the human body via SGM to produce a garment prior. 2. Stage 2: Progressively inpaint and complete the final try-on image using the garment prior and pose-encoded noise through the CBS mechanism.

Key Design 1: Structured Garment Morphing (SGM)¶

SGM leverages skeleton information and parsing maps to constrain garment deformation without retraining, making it applicable across different domains.

Pseudo-person image generation: For Shop-to-X scenarios (garment image only), a pseudo-person image is generated via attention modulation. Specifically, garment-conditioned noise and person-conditioned noise are denoised in parallel, with the K and V of the latter injected into the self-attention layers of the former:

\[f_c = \text{Softmax}\left(\frac{Q_c \cdot [K_c \| K_p]^\top}{\sqrt{d}}\right)[V_c \| V_p]\]

Multi-part semantic correspondence: Using the 25 keypoints from OpenPose, the human body is divided into $N$ semantic regions (e.g., for upper-body garments: torso, left/right upper arm, and left/right forearm—5 regions in total), establishing multi-part semantic correspondences between the target garment and the source person image. TAPPS is used to generate segmentation maps to isolate pixel regions.

Localized transformation: For each bounding-box corner pair, a homography matrix $\mathcal{H}_{o \to p}^i \in \mathbb{R}^{3 \times 3}$ is optimized using the Levenberg–Marquardt algorithm, followed by a piecewise perspective transformation:

\[\begin{bmatrix} x_o' \\ y_o' \\ 1 \end{bmatrix} = \sum_{i=1}^{5} \mathbb{I}_{\text{Region}_i}(x_o, y_o) H_{o \to p}^i \begin{bmatrix} x_o \\ y_o \\ 1 \end{bmatrix}\]

Key Design 2: Spectral Pose Injection (SPI)¶

DDIM inversion preserves structural information from the source person but introduces source garment texture contamination. SPI addresses this via frequency-domain analysis:

Apply FFT and center-shift to the inverted noise $z_T^{inv}$ and random noise $z_T$.
Perform frequency-domain weighted fusion using a Gaussian low-pass mask $G_\tau$: $$\hat{f}_T = G_\tau \odot f_T^{inv} + (1 - G_\tau) \odot f_T$$
Apply inverse FFT to obtain the blended initial noise $\hat{z}_T$.

Core Idea: Low frequencies retain the pose and structural information from the inverted noise, while high frequencies are replaced by random noise to eliminate texture residuals and enhance generation flexibility.

Key Design 3: Continuous Boundary Stitching (CBS)¶

Multi-region stitching in SGM introduces texture discontinuities at boundaries. CBS improves boundary continuity through bidirectional semantic context:

From the $I_c$ path to the $I_p'$ path: query $Q_p'$ matches target garment textures to bridge discontinuities.
From the $I_p'$ path to the $I_c$ path: the similarity between attention maps of the two paths is enhanced while dissimilar values are suppressed.

Loss & Training¶

OmniVTON is a training-free method and involves no loss function design. All components are realized through the inference pipeline of pretrained diffusion models.

Key Experimental Results¶

Main Results¶

Quantitative comparison on VITON-HD (all VTON methods use DressCode-pretrained models to evaluate cross-dataset generalization):

Method	Year	FID_u ↓	FID_p ↓	SSIM_p ↑	LPIPS_p ↓
PBE	2023	19.230	17.649	0.784	0.227
AnyDoor	2024	14.830	9.922	0.796	0.164
GP-VTON	2023	51.566	49.196	0.810	0.249
IDM-VTON	2024	23.035	20.460	0.812	0.147
OmniVTON	—	9.621	7.758	0.832	0.145

Quantitative comparison on DressCode (cross-garment-category adaptability):

Method	FID_u ↓	FID_p ↓	SSIM_p ↑	LPIPS_p ↓
CAT-DM	13.678	12.028	0.858	0.125
IDM-VTON	9.685	8.377	0.842	0.138
OmniVTON	6.450	5.335	0.865	0.119

Ablation Study¶

Variant	SGM	CBS	SPI	FID_u ↓	FID_p ↓	SSIM_p ↑	LPIPS_p ↓
Base	—	—	—	18.445	16.878	0.773	0.222
(A)	✓	—	—	13.303	11.475	0.809	0.177
(B)	✓	✓	—	9.799	7.993	0.824	0.158
(C)	✓	—	✓	13.148	10.767	0.813	0.180
OmniVTON	✓	✓	✓	9.621	7.758	0.832	0.145

Key Findings¶

SGM alone reduces FID_u from 18.445 to 13.303, validating the effectiveness of training-free garment alignment.
CBS further improves LPIPS by 0.019 over SGM, enhancing perceptual quality.
SPI provides substantial SSIM and FID gains, effectively suppressing noise contamination while maintaining structural consistency.
FID_u on DressCode improves by 33.4% over the strongest baseline, demonstrating cross-garment-category adaptability.
OmniVTON achieves leading performance across all four cross-scenario settings on the StreetTryOn benchmark, even surpassing StreetTryOn trained on in-domain data.

Highlights & Insights¶

First training-free universal VTON framework: Unifies in-shop and in-the-wild scenarios, eliminating the need to train dedicated models for specific conditions.
Elegant decoupling strategy: Garment texture preservation and pose alignment are decoupled into independent modules, avoiding the bias that arises when diffusion models simultaneously handle multiple conditions.
Novel application of frequency-domain analysis: SPI exploits the spectral properties of the latent space—low frequencies retain pose structure while high frequencies enhance generation flexibility—resulting in a conceptually elegant design.
Multi-person try-on: Enables try-on in multi-person scenes for the first time by concatenating multiple garments along the spatial dimension to generate pseudo-person images simultaneously.
Strong cross-domain generalization: Evaluated on VITON-HD using a DressCode-pretrained model, OmniVTON reduces FID by 5.209, demonstrating that the method is not reliant on any specific domain.

Limitations & Future Work¶

Performance degrades in extreme cases, such as densely crowded scenes or scenarios where the target body region is very small, leading to garment alignment failures.
Multi-region stitching may still introduce artifacts at boundaries; CBS does not fully eliminate all discontinuities.
The method depends on multiple pretrained models (OpenPose, TAPPS, diffusion models), resulting in a relatively long inference chain.
The training-free inference paradigm may incur slower inference speeds due to iterative denoising and repeated attention modulation.

Garment warping: The trend progresses from TPS (VITON) → optical flow (GP-VTON) → training-free skeleton-guided warping (this work), reflecting a shift toward reduced reliance on paired data.
Implicit warping VTON: IDM-VTON and StableVITON model deformation implicitly via attention mechanisms but lack explicit geometric constraints.
Exemplar-guided inpainting: PBE and AnyDoor offer generality but lack try-on-specific designs.
Inspiration from frequency-domain methods: The frequency-domain modulation concept in SPI is potentially transferable to other image generation tasks requiring the disentanglement of structural and texture information.

Rating ⭐⭐⭐⭐¶

The paper presents a highly innovative approach as the first training-free universal VTON framework, with a well-motivated design in which each module is validated through clear ablation studies. The decoupling strategy and frequency-domain analysis are conceptually elegant. Experiments span multiple datasets and scenarios, with strong quantitative and qualitative results. Multi-person try-on represents a meaningful extension. Limitations are primarily confined to extreme scenarios, and the overall work is of high quality.