Skip to content

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Conference: ICML 2026
arXiv: 2605.21431
Code: To be confirmed
Area: Video Generation / Virtual Try-On
Keywords: Interactive Virtual Try-On, Video Generation, Diffusion Models, Multimodal Conditions

TL;DR

iTryOn defines the "Interactive Video Virtual Try-On" task for the first time—enabling individuals in videos to actively manipulate garments (zipping, lifting hemlines, stretching fabric) rather than merely displaying them passively. By addressing spatial ambiguity with 3D hand priors, strictly aligning timestamped action captions with corresponding frames via Action-Aware RoPE (A-RoPE), and amplifying learning signals for sparse interaction frames through Action-Aware Constraint Loss (AC Loss), it improves the ISR (Interaction Success Rate) from 0.397 (baseline) to 0.610 (+54%) on the self-constructed VVT-Interact dataset.

Background & Motivation

Background: Virtual try-on has evolved from static images to Video Virtual Try-On (VVT). Recent methods based on Diffusion Transformers (DiT) achieve high-fidelity spatio-temporal consistency, preserving garment textures and the natural flow of movement.

Limitations of Prior Work: Existing VVT methods only handle passive try-on scenarios (individuals standing still or walking naturally to display clothes), completely ignoring real-world interactive scenarios common in e-commerce live streaming—such as actively pulling zippers, lifting garment corners, or stretching fabric to show elasticity. These interactions carry critical consumer information but cannot be generated by current models.

Key Challenge: There are two main contradictions that are difficult to reconcile: - Spatial Contradiction: 2D skeletal poses lack Z-axis depth, making it impossible to distinguish "hand moving toward the chest to button up" (interaction) vs. "hand placed on the chest" (non-interaction). Information regarding hand shape and orientation is also lost. - Learning Contradiction: Interaction frames are extremely sparse (typically only 5-10%). Gradients from simple non-interaction frames easily overwhelm learning signals for complex actions, leading the model to ignore physical deformations.

Goal: Define and solve the Interactive VVT task, enabling the model to understand "what interaction to perform," "when to interact," and "how to make physical contact."

Key Insight: Existing VVT lacks spatial precision (no explicit hand geometry) and semantic precision (no clear action intent and temporal boundaries). 3D hand priors resolve "spatial ambiguity," timestamped action captions resolve "semantic ambiguity," and AC Loss amplifies the weight of interaction frames.

Core Idea: Multi-level Interaction Injection—injecting 3D hand geometry at the spatial level, synchronous action captions at the semantic level, and amplifying learning for sparse interaction frames at the loss level.

Method

Overall Architecture

Inputs: Source video \(V_{\text{src}}\), target garment \(G\), skeletal pose \(V_{\text{pose}}\), garment-agnostic representation \(V_{\text{agn}}\), and interaction guidance \(\mathcal{C}\). Output: Try-on video \(\hat{V}\). Process: 1. Source video and condition information are encoded into latent space using a frozen Wan encoder. 2. The DiT backbone receives parallel categories of conditions during the diffusion denoising stage—Context Blocks handle the overall body and skeleton; Interaction Guider processes fine 3D hand contact; Semantic Guidance injects global and action captions. 3. Decoded back to video space after denoising.

Key Designs

  1. Spatial Guidance with 3D Hand Priors:

    • Function: Provides fine-grained hand geometry (finger shape, orientation, distance from garment) to solve the depth deficiency of 2D skeletons.
    • Mechanism: Extracts 3D hand meshes \(V_{\text{hand}}\) (point clouds / mesh vertices) using HaMeR. These are projected into feature space and processed by a lightweight Interaction Guider (convolution + self-attention), then fused additively with DiT tokens.
    • Design Motivation: 2D keypoint projections lose information about "pinching vs. pressing" (different hand shapes) and "approaching from afar vs. already in contact" (motion direction); 3D priors are completely garment-agnostic, avoiding the introduction of source clothing textures.
  2. Action-Aware RoPE (A-RoPE) for Semantic Synchronization:

    • Function: Strictly aligns timestamped action captions \((\text{action}, [\text{start}, \text{end}])\) with corresponding video segments, preventing action descriptions from "leaking" into non-interactive frames.
    • Mechanism: In temporal cross-attention, a scaled 1D RoPE is applied to the query \(Q_i\) for each video segment \(i\) (applied to all segments to maintain global temporal order). However, RoPE is only applied to the key \(K_i\) of action captions corresponding to interactive segments (non-interactive segments use empty captions without position encoding). \(\hat{Q}_i = \text{1D-RoPE}(Q_i, i \cdot k)\) and \(\hat{K}_i = \text{1D-RoPE}(K_i, i \cdot k)\), where \(k = 4\). Higher attention weights are only produced for \((i, i)\) pairs with matching position encodings, forcing alignment.
    • Design Motivation: Global captions are too vague; timestamped captions precisely locate the frame range where interaction occurs. A-RoPE distinguishes "virtual temporal channels" via position encoding rotation, making each action caption visible only to its corresponding video segment.
  3. Action-Aware Constraint Loss (AC Loss):

    • Function: Reweights the diffusion loss to amplify supervision gradients for interaction frames, preventing underfitting of sparse events.
    • Mechanism: A binary mask \(\mathbb{M}_{\text{action}}\) is constructed (1 for interaction frames, 0 otherwise). Total loss: \(\mathcal{L} = \mathcal{L}_{\text{std}} + \lambda \mathbb{E}[\|\mathbb{M}_{\text{action}} \odot (\hat{v}_\theta - v)\|_2^2]\), with \(\lambda = 0.5\). The second term penalizes only interaction frames.
    • Design Motivation: Across 90% of non-interaction frames, the optimizer is easily attracted by stable gradients of simple scenes, drowning out rare gradients of complex wrinkle deformations; AC Loss explicitly tells the model "these 10% of frames are vital."

Key Experimental Results

Main Results (VVT-Interact: 5292 videos; 5160 for training, 132 for testing)

Method VFID\(_I^p\) VFID\(_R^p\) SSIM ↑ LPIPS ↓ FVD\(^p\) ISR\(^p\)
ViViD 29.83 1.27 0.726 0.164 468.5 0.397
CatV2TON 26.99 2.27 0.776 0.143 533.2 0.484
MagicTryOn 27.67 2.60 0.765 0.170 431.8 0.435
iTryOn 22.46 0.60 0.785 0.122 380.6 0.610

iTryOn establishes an overwhelming advantage, with ISR improving by 26% over the best baseline.

Ablation Study

Configuration VFID\(_I^p\) ISR\(^p\) Key Observation
(a) Baseline 27.12 0.477 Fails to generate interactions
(b) + Data 26.65 0.478 Data alone is insufficient
(c) + (b) + Spatial Guidance 24.85 0.517 3D hand guidance
(d) + (c) + Semantic Guidance 22.76 0.599 A-RoPE Action Captions
(e) + (d) + AC Loss 22.46 0.610 Complete model

Key Findings

  • All three modules are indispensable—data alone or single guidance cannot achieve significant improvements.
  • The ISR metric uses a VLM for "semantic verification" rather than just visual quality, acknowledging the dual requirements of "physical correctness vs. semantic correctness."
  • An ISR of 0.610 indicates that 61% of interactions are correctly executed by the model (vs. 39.7% for the baseline).

Highlights & Insights

  • A-RoPE Synchronization Mechanism: Utilizes position encoding rotation to differentiate interactive and non-interactive segments, maintaining global temporal coherence while isolating local action descriptions; this is transferable to other tasks requiring precise alignment of temporal labels.
  • Geometry-Semantic Separation in 3D Hand Priors: Uses 3D meshes to avoid depth maps leaking source garment geometry. This separation design philosophy can be generalized to hand-object manipulation, human-tool interaction, etc.
  • General Framework for Sparse Event Learning: AC Loss treats sparse imbalanced data learning as a standard sampling weight problem, a general approach applicable to any task with sparse keyframes.
  • Innovation of the ISR Metric: For the first time, a VLM is used for semantic verification rather than just visual quality, serving as a model for upgrading evaluation standards in virtual try-on.

Limitations & Future Work

  • The model lacks semantic understanding of garments (e.g., "this garment has a zipper"), sometimes generating "pantomime" interactions that are unfeasible.
  • ISR can evaluate interaction semantic success rates but struggles to quantify fine-grained physical accuracy (wrinkle physics, deformation angles).
  • Limited interaction categories in the dataset (only 6 types); generalization to unseen interaction types remains unknown.
  • 3D hand priors depend on extraction via HaMeR, which is sensitive to occlusion; direct learning of implicit hand representations from video could be considered.
  • vs. ViViD / CatV2TON / MagicTryOn: These methods optimize for non-interactive VVT (spatio-temporal consistency, garment detail) but fail to capture interaction intent; iTryOn bridges the fundamental gap in interaction understanding through multimodal condition fusion and sparse supervision reweighting.
  • vs. Video Editing (e.g., ControlNet): Editing methods use coarse-grained spatial conditions (bounding boxes, skeletons) to manipulate content but lack temporal information and physical constraints; iTryOn’s A-RoPE + AC Loss can inspire future editing frameworks.
  • vs. Human-Object Interaction (HOI) Recognition: Current HOI work focuses mostly on recognition. This paper reframes the problem as a generative problem for the first time, opening new directions for interaction synthesis.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First to define the Interactive VVT task; A-RoPE and AC Loss are targeted innovations.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Large-scale VVT-Interact dataset + 3 baseline comparisons + complete ablation + parallel quantitative/qualitative analysis + new ISR metric.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic; Figure 3 comparisons are persuasive. ISR evaluation depends on VLM, so its credibility requires further validation.
  • Value: ⭐⭐⭐⭐⭐ Broad prospects in e-commerce live streaming and content creation; open-source data + benchmarks + technical components (A-RoPE, AC Loss) are highly transferable.