Mask2IV: Interaction-Centric Video Generation via Mask Trajectories¶

Conference: AAAI 2026 arXiv: 2510.03135 Code: Project Page Area: Video Understanding / Video Generation Keywords: Interaction video generation, mask trajectories, human-object interaction, robot manipulation, two-stage diffusion

TL;DR¶

This paper proposes Mask2IV, a two-stage decoupled framework that first predicts mask motion trajectories of the interactor and object, then generates video conditioned on these trajectories. The approach enables controllable, interaction-centric video generation without dense mask annotations, supporting both human-object interaction and robot manipulation scenarios.

Background & Motivation¶

State of the Field¶

Diffusion models have achieved remarkable progress in video generation, producing high-quality videos from text or image prompts. In the context of embodied AI, generating realistic human-object or robot-object interaction video sequences holds significant value, providing visual priors for downstream tasks such as imitation learning and affordance learning.

Limitations of Prior Work¶

Imprecise text-conditioned control: Existing text-conditioned methods (e.g., EgoVid, LEGO) lack fine-grained control over interaction details — they cannot specify which object to interact with or where the hand should be positioned.

Mask-conditioned methods suffer from two critical drawbacks: - Poor practicality: Methods such as InterDyn require users to provide dense, per-frame hand mask sequences as control signals. Obtaining these masks requires recording or synthesizing the very interaction video one wishes to control — creating a chicken-and-egg paradox. - Hand-only focus: Relying solely on hand masks limits interaction modeling scope, making it impossible to precisely specify target objects or capture fine-grained hand-object contact information.

Lack of a unified framework: Human-object interaction and robot manipulation are typically studied as separate problems, with no unified solution.

Root Cause¶

Masks are effective interaction control signals — geometrically explicit and motion-trackable — yet the cost of obtaining dense mask annotations is as high as that of generating the video itself. The core technical challenge lies in leveraging the advantages of masks while eliminating their annotation dependency.

Starting Point¶

Decoupling trajectory prediction from video generation: The first stage automatically predicts interaction trajectories (mask sequences), and the second stage generates video conditioned on these predicted trajectories. Users need only provide an initial image, a target object mask, and text or position conditions — no dense annotations are required.

Method¶

Overall Architecture¶

Mask2IV decomposes interaction video generation into two stages:

Stage 1: Interaction Trajectory Generation - Input: initial frame \(I\), object mask \(M\), conditioning signal (text \(T\) or target position mask \(P\)) - Output: mask trajectory sequence \(S \in \mathbb{R}^{N \times H \times W \times 3}\)

Stage 2: Trajectory-conditioned Video Generation - Input: initial frame \(I\), predicted mask trajectory \(S\) - Output: interaction video \(V \in \mathbb{R}^{N \times H \times W \times 3}\)

Key Designs¶

1. Interaction Trajectory Generation (Stage 1)¶

Function: Predict the joint motion trajectories of the interactor (hand/robotic arm) and the object.

Mechanism: - The initial frame \(I\) and object mask \(M\) are encoded into latent space features via a VAE encoder. - The object mask is first color-encoded into RGB format (as the VAE requires three-channel input). - If the initial frame contains an interactor (hand/robotic arm), GroundedSAM is used to segment it and assign distinct colors, enabling the model to distinguish between roles. - The encoded latent features are concatenated with the noisy latent variable and fed into a video diffusion model. - Temporal attention layers are frozen to preserve motion priors, while remaining parameters are fine-tuned.

Two conditioning variants:

(a) Text-conditioned Trajectory Generation (TT-Gen): - Text prompts are encoded via CLIP and injected into the model through cross-attention. - Capable of distinguishing subtle interaction intents, e.g., "pick up" vs. "put down," "push" vs. "pull."

(b) Position-conditioned Trajectory Generation (PT-Gen): - The target position mask \(P\) is encoded and inserted into the slot of the last frame. - The initial object mask latent is assigned to the first frame. - Intermediate frames are filled with zeros, and the model automatically interpolates to produce a coherent trajectory.

Design Motivation: Simplifying the problem to predicting mask motion first allows the model to focus solely on motion dynamics without handling appearance details, substantially reducing the difficulty of directly generating complex interaction videos.

2. Trajectory-conditioned Video Generation (Stage 2)¶

Function: Synthesize the final video conditioned on predicted mask trajectories.

Mechanism: - The trajectory \(S\) is encoded into a feature tensor \(f_s\) via the VAE. - \(f_s\) is concatenated with the noisy latent variable and the first-frame features before being fed into the diffusion model.

Two targeted designs:

(a) Random perturbation for robustness: - During training, masks are randomly dilated or eroded with probability \(p=0.2\) (kernel size randomly selected from {3, 5, 7}). - This prevents the model from over-relying on the precise shape of masks, improving generalization.

(b) Contact region-weighted loss: - A contact map is defined as \(m_c = (\delta(m_h) \cap m_o) \cup (m_h \cap \delta(m_o))\), where \(\delta(\cdot)\) denotes dilation. - The contact map is used to re-weight the diffusion objective:

\[w = (1 - m_c) + \lambda \cdot m_c\]

\[\mathcal{L} = \mathbb{E}_{z,S,\epsilon,t}[\|w \odot (\epsilon - \epsilon_\theta(z, f_\psi(S), t))\|_2^2]\]

where \(\lambda=5\), assigning five times the loss weight to contact regions relative to non-contact regions.

Design Motivation: - Random perturbation addresses the distribution gap between training (ground-truth masks) and inference (predicted masks). - Contact-weighted loss resolves the synthesis difficulty at hand-object boundary regions — precisely the most critical area for interaction modeling.

3. Benchmark Construction¶

Function: Construct training/evaluation benchmarks with per-frame segmentation annotations.

HOI4D (human-object interaction): Video clips are trimmed using timestamps; low-dynamic videos are filtered based on a motion score derived from hand and object displacement (bottom 5% removed); text annotations follow the template "a hand {verbing} an {object}."
BridgeData V2 (robot manipulation): GroundingDINO is used for object detection and SAM2 for video segmentation to extract robotic arm and object masks; manipulated objects are identified by low inter-frame mIoU over time (indicating changing position and shape), requiring no additional annotation.

Loss & Training¶

Built upon DynamiCrafter with additional convolutional channels to accommodate mask latents.
16 frames at 320×512 resolution.
AdamW optimizer, learning rate 1e-5, batch size 8.
DDIM sampler with 50 denoising steps at inference.
Contact weight \(\lambda = 5\).

Key Experimental Results¶

Main Results¶

Method	Conference	FVD↓ (HOI/Robot)	LPIPS↓	PSNR↑	SSIM↑	V2V-Sim↑	T2V-Sim↑
DynamiCrafter	ECCV24	554/861	0.516/0.375	13.48/14.21	0.553/0.571	0.473/0.867	0.146/0.215
DynamiCrafter-ft	ECCV24	169/198	0.206/0.166	20.49/19.80	0.721/0.775	0.814/0.957	0.199/0.223
CosHand	ECCV24	163/175	0.209/0.123	20.67/21.81	0.725/0.809	0.837/0.969	0.191/0.220
InterDyn	CVPR25	172/208	0.207/0.145	20.71/21.16	0.730/0.802	0.794/0.955	0.172/0.219
Mask2IV	Ours	150/156	0.178/0.111	21.48/22.30	0.741/0.815	0.847/0.971	0.200/0.220

FVD is reduced by 8.7% on HOI4D (150 vs. 163) and by 10.9% on BridgeData V2 (156 vs. 175).
Mask2IV outperforms all baselines across all metrics.

Ablation Study¶

Configuration	FVD↓	LPIPS↓	PSNR↑	SSIM↑	Note
ControlNet	157.38	0.182	21.49	0.747	Auxiliary network approach
MaskLatent	130.07	0.157	22.33	0.760	Direct latent concatenation, superior
+object mask	115.14	0.132	23.85	0.802	Adding object trajectory, large gain
+random d/e	108.80	0.124	24.16	0.802	Random dilation/erosion augmentation
+contact loss	104.61	0.126	24.37	0.804	Contact-weighted loss

(Ablation conducted on HOI4D using ground-truth mask trajectories.)

Key Findings¶

Direct mask latent concatenation outperforms ControlNet: Training is more stable with faster early convergence.
Adding object trajectories yields the largest gain: FVD drops from 130 to 115 (−11.5%), demonstrating that modeling hand motion alone is insufficient — joint hand-object motion modeling is necessary.
Random perturbation genuinely improves robustness: FVD drops from 115 to 109.
Contact-weighted loss further improves quality: FVD drops from 109 to 105.
Flexible object specification: Different interactions with different objects can be generated within the same scene by modifying the mask.
Text and position conditions are complementary: Text is better suited for describing action types, while position conditions enable precise spatial control.

Highlights & Insights¶

Dual advantage of decoupled design: Reduces generation difficulty while providing more flexible control (users can modify predicted trajectories).
Innovation of contact region-weighted loss: Precisely focuses on the most critical region for interaction (hand-object contact boundary), using the geometric information of masks to define the weighting map.
Unified framework covering both humans and robots: The same method handles two interaction scenarios, differing only in condition type (text vs. position).
Clever object identification strategy: In BridgeData V2, manipulated objects are identified by low inter-frame mIoU over time, requiring no additional annotation.
Color-encoding design detail: Masks are converted into differently colored RGB images for VAE input, allowing the model to distinguish between interactor and object roles.

Limitations & Future Work¶

Resolution limitation: 320×512 is relatively low; computational cost would increase substantially at higher resolutions.
Only 16 frames: Adequate for short interactions, but insufficient to cover long-horizon manipulation (e.g., multi-step assembly).
Error accumulation in two-stage inference: The quality of trajectories predicted in Stage 1 directly affects final video quality.
No support for multi-step interactions: Sequential actions such as "pick up then put down" are not modeled.
Absence of human evaluation: The method relies entirely on automatic metrics, with no perceptual user study on video quality.
Dataset scale: HOI4D covers a limited range of action categories (primarily grasping); more diverse interaction types remain to be explored.

Masks as intermediate representations: More semantically meaningful than optical flow and more precise than bounding boxes, making them an ideal control signal for interaction generation.
Decoupling trajectory prediction from video generation: This paradigm is transferable to other control signals (e.g., skeletons, keypoints).
Contact map definition: The dilation-intersection approach offers a simple yet effective definition of contact regions applicable to other hand-object interaction tasks.
Integration with robotics: Generated interaction videos can be directly used to train visuomotor policies, offering an alternative data acquisition pathway beyond sim-to-real transfer.

Rating¶

Novelty: ⭐⭐⭐⭐ — The two-stage decoupled design and contact-weighted loss are novel contributions, though the overall framework builds on existing diffusion models.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets and complete ablation studies, but human evaluation is absent.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation and methodology are articulated with exceptional clarity; figures and tables are well-designed.
Value: ⭐⭐⭐⭐ — Offers practical value for data generation in embodied AI.