MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling¶

Conference: ICLR 2026 arXiv: 2508.17404 Code: None (to be released) Area: Video Generation Keywords: Human video generation, structure-appearance decoupling, 3D motion generation, DiT, dense tracking loss

TL;DR¶

MoSA decomposes human video generation into a structure generation stage (a 3D Transformer generates physically plausible motion skeletons) and an appearance generation stage (a DiT synthesizes video conditioned on the skeletons). A Human-Aware Dynamic Control (HADC) module propagates sparse skeleton signals across the entire motion region. Combined with a dense tracking loss and contact constraints, MoSA comprehensively outperforms SOTA models such as HunyuanVideo and Wan 2.1 on FVD, CLIPSIM, and other metrics.

Background & Motivation¶

Background: Current general-purpose video generation models (HunyuanVideo, CogVideoX, Wan 2.1, etc.) achieve high visual quality on natural scenes but frequently produce structural artifacts—limb distortion and unnatural motion—when generating human videos. Methods specialized for human video (e.g., the AnimateAnyone series) are mostly limited to faces/upper bodies or require additional pose-driving inputs, making them ill-suited for complex full-body motion.

Limitations of Prior Work: First, reconstruction objectives based on pure noise denoising inherently favor appearance fidelity over structural consistency—models tend to "look good" while producing physically implausible motion. Second, some methods attempt to generate 2D skeleton sequences directly as guidance, but 2D representations lack depth information under occlusion, leading to structural errors such as interpenetrating legs. Third, skeletons are sparse keypoint representations; even when correctly generated, they provide limited control over subsequent pixel-level appearance generation.

Key Challenge: Human appearance and motion carry fundamentally different signals—appearance requires pixel-level texture detail, while motion must satisfy physical and anatomical constraints. Existing methods couple both within the same generation process, forcing an inherent trade-off.

Goal: (1) How to generate physically plausible complex human motion? (2) How to make sparse skeleton signals effectively guide dense pixel generation? (3) How to model human–environment contact interactions?

Key Insight: The authors observe that human motion has strong priors in 3D space (large-scale MoCap datasets), while appearance is well-suited for pretrained DiT-based generation. The problem is therefore decomposed into two stages: first leveraging 3D priors to generate structurally sound motion sequences, then generating appearance conditioned on the skeletons. Motion plausibility is guaranteed by the 3D Transformer; visual quality is guaranteed by the DiT.

Core Idea: Generate physically plausible skeleton sequences in 3D space using motion priors, then propagate the sparse skeleton guidance to the full motion region via the HADC module to guide the DiT in generating high-fidelity appearance.

Method¶

Overall Architecture¶

MoSA decouples human video generation into two branches. The structure generation branch receives motion semantics from the text prompt and generates a 3D human keypoint sequence via a pretrained 3D Structure Transformer, which is then projected into a 2D skeleton sequence. The appearance generation branch conditions on the full text prompt and skeleton structural features, performing iterative denoising via a DiT backbone. Structural information is passed between the two branches through the HADC module. During training, GT skeletons are used as conditions; during inference, skeletons are automatically generated by the structure branch.

Key Designs¶

3D Structure Transformer (Structure Generation Branch):
- Function: Generates physically plausible 3D human motion keypoint sequences from the text prompt, then projects them to 2D skeletons.
- Mechanism: An LLM first extracts motion-relevant subsets \(p'\) from the full prompt (filtering background descriptions and other irrelevant content). The 3D Structure Transformer \(\mathcal{G}_s^m\) then generates 3D keypoint sequences from Gaussian noise \(z_T^s\) conditioned on \(p'\), which are rendered into 2D skeletons \(g_s\) via a Projection operation. The Transformer adopts an autoregressive architecture pretrained on million-scale MoCap datasets.
- Design Motivation: Compared to directly generating 2D skeletons, 3D generation offers two key advantages: (a) anatomical plausibility is ensured by 3D human body priors, and (b) depth information maintains structural correctness under limb occlusion. Experiments confirm that direct 2D skeleton generation causes interpenetrating legs in occluded regions.
Human-Aware Dynamic Control (HADC) Module:
- Function: Expands sparse skeleton "point guidance" into "dense guidance" covering the entire motion region, addressing the limited control capacity of sparse skeletons.
- Mechanism: HADC modules are inserted between adjacent DiT blocks in the appearance branch. The \(k\)-th module receives skeleton features \(s^k\) and video latent \(a_i^k\), and uses a learnable weight predictor \(\mathcal{P}^k\) to generate a spatially varying dynamic weight map \(w^k = \mathcal{P}^k(s^k, a_i^k)\), producing \(a_o^k = a_i^k \oplus (w^k \odot s^k)\). To ensure the weight map covers the human body region, a learnable network \(\mathcal{U}^k\) converts \(w^k\) into a mask latent and applies an L2 constraint \(\mathcal{L}_m\) against the GT mask.
- Design Motivation: Skeletons consist of only \(K\) keypoints and are too sparse for effective direct injection into the DiT. HADC learns spatial weights to diffuse the skeleton signal across the entire body region, effectively upgrading from "skeleton guidance" to "human-region guidance."
Dense Tracking Loss and Contact Constraints:
- Function: Enhances temporal motion consistency and models human–environment interaction.
- Mechanism: The dense tracking loss \(\mathcal{L}_{track}\) uses CoTracker3 to extract 2D trajectory points from generated and GT videos, computing a weighted L1 distance. The weight \(e^{|t_v - t_v'|/2}\) assigns higher weight to frame pairs with larger temporal intervals, encouraging the model to learn long-range motion dependencies. The contact constraint \(\mathcal{L}_{cont}\) models human–ground and human–object contact relationships in 3D space.
- Design Motivation: Pure noise reconstruction objectives favor appearance over motion; the tracking loss introduces explicit supervision for motion consistency. The contact constraint addresses physically implausible artifacts such as feet sinking into the ground or floating.

Loss & Training¶

The total loss is \(\mathcal{L} = \mathcal{L}_d + \lambda_m \mathcal{L}_m + \lambda_{track} \mathcal{L}_{track} + \lambda_{cont} \mathcal{L}_{cont}\). During training, the pretrained 3D Structure Transformer \(\mathcal{G}_s^m\) is frozen, and skeleton sequences extracted from GT videos serve as structural conditions. The appearance generation branch uses CogVideoX-5B as its backbone. The authors also construct the MoVid dataset (30K human motion videos) covering diverse complex full-body actions including walking, running, jumping, and skating—substantially exceeding the motion diversity of existing human video datasets.

Key Experimental Results¶

Main Results¶

Quantitative comparison with general-purpose video generation models on 300+ text prompts:

Method	FVD↓	CLIPSIM↑	Subject Consistency↑	Background Consistency↑	Motion Smoothness↑	Dynamic Degree↑	Quality↑
ModelScope	1945	0.2739	90.87%	93.41%	96.22%	48.57%	60.12%
VideoCrafter2	1959	0.2801	93.43%	97.01%	97.31%	35.71%	60.32%
LaVie	1778	0.2895	93.80%	95.51%	97.21%	53.73%	62.57%
Mochi 1	1207	0.2903	94.67%	95.32%	97.75%	51.14%	54.65%
CogVideoX	1360	0.2899	93.75%	94.02%	97.78%	51.42%	62.98%
HunyuanVideo	1235	0.2948	94.41%	95.17%	98.95%	50.42%	58.13%
Wan 2.1	1251	0.2951	94.43%	95.55%	98.36%	51.71%	65.21%
MoSA	1093	0.3035	96.83%	97.43%	99.25%	52.86%	65.43%

Ablation Study¶

Contribution of each module (FVD↓ / CLIPSIM↑):

Configuration	FVD	CLIPSIM	Note
Full MoSA	1093	0.3035	All components
w/o structure branch	1262	0.2971	Direct fine-tuning of base model; FVD +169
2D skeleton generation instead of 3D	1230	0.2998	Structural collapse under occlusion
w/o HADC module	1188	0.2973	Insufficient control from sparse skeletons
HADC w/o mask loss	1112	0.3009	Weight map lacks body-region constraint
w/o dense tracking loss	1172	0.3009	Degraded motion consistency
Static weight instead of temporal weighting	1114	0.3016	Insufficient long-range dependency learning
w/o contact constraint	1108	0.3021	Unnatural human–environment interaction
HumanVid dataset	1217	0.2949	Insufficient motion diversity
w/o additional human data	1360	0.2899	Degrades to base model

Transferring the MoSA framework to Wan 2.1 also yields significant gains: Wan 2.1 baseline FVD=1251 / CLIPSIM=0.2951 → with MoSA FVD=1108 / CLIPSIM=0.3044, validating the generalizability of the framework.

Key Findings¶

Structure–appearance decoupling is the dominant contributor: Removing the structure branch increases FVD from 1093 to 1262 (+15.5%), demonstrating that explicit structural guidance is critical for motion quality.
3D outperforms 2D: 3D→2D projection improves over direct 2D generation by 137 FVD, primarily because depth information maintains structural correctness under limb occlusion (visualizations show leg interpenetration with the 2D approach).
HADC module is highly effective: Removing HADC raises FVD by 95; the mask loss alone contributes an additional 19 FVD improvement, confirming that spatial weight constraints effectively extend the guidance signal across the human body region.
Temporal weighting in the dense tracking loss is important: Static weights vs. exponential temporal weighting differ by 21 FVD, indicating that learning long-range motion dependencies requires explicit encouragement.
MoVid dataset is indispensable: Compared to HumanVid, MoVid contributes a 124 FVD improvement due to its coverage of more complex and diverse full-body motions.

Highlights & Insights¶

Systematic decoupling paradigm: Motion = structural signals (requiring physical constraints); appearance = texture signals (requiring visual quality) → these two signal types naturally should be generated by different models. This two-stage "generate structure first, then fill in appearance" design is both principled and efficient, and is more amenable to separate optimization than end-to-end approaches.
HADC's sparse-to-dense design: Skeletons are extremely sparse representations (only \(K\) points), yet HADC propagates the guidance signal from keypoints to the entire body region via a learnable weight predictor, with a mask loss constraining coverage. This sparse-to-dense signal propagation paradigm is transferable to any scenario requiring sparse control signals to guide dense generation.
Temporal weighting in the tracking loss is elegant: The weight \(e^{|t_v - t_v'|/2}\) allocates more gradient to frame pairs with larger temporal spans, compelling the model to learn long-range motion consistency rather than focusing only on adjacent frames. This trick is directly applicable to any video generation task requiring temporal consistency.

Limitations & Future Work¶

Hand motion remains a bottleneck: The 3D Structure Transformer is trained on SMPL body joints only, without finger keypoints, so fine-grained hand motions still exhibit artifacts. The authors identify incorporating hand 3D annotations as a straightforward improvement direction.
Single-person limitation: Although some multi-person interaction results are presented, the overall framework is designed primarily for single-person scenarios and lacks systematic interaction modeling for multi-person cases.
Limited scale of MoVid: At 30K videos, MoVid remains orders of magnitude smaller than general-purpose video datasets (millions of clips), potentially limiting generalization to broader scenarios.
Computational overhead: The structure branch, HADC modules, and tracking loss (requiring a CoTracker3 forward pass) introduce substantial additional training and inference costs.

vs. AnimateAnyone2: AnimateAnyone2 also uses skeleton guidance but requires user-provided driving pose sequences and only supports simple scenarios such as dancing. MoSA's core advantage lies in automatically generating structurally plausible skeletons from text and supporting complex full-body motions such as running and skating.
vs. VideoJAM: VideoJAM also addresses joint motion–appearance representation but performs joint learning within a single model. MoSA more aggressively fully decouples the two branches, allowing the structure branch to focus on physical plausibility and the appearance branch to focus on visual quality.
vs. direct 2D skeleton methods (MotionMaster/DreamDance, etc.): These methods generate or consume skeletons in 2D space and are prone to collapse under occlusion. MoSA fundamentally resolves this issue through 3D→2D projection.

Rating¶

Novelty: ⭐⭐⭐⭐ The structure–appearance decoupling idea is intuitively sound and represents the first systematic implementation in human video generation, though the two-stage "generate structure then generate appearance" paradigm has precedents in other domains.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Quantitative comparisons against 7 general-purpose video models, 6-dimensional VBench evaluation, detailed per-module ablations, cross-backbone transfer validation, and qualitative visualizations—comprehensive throughout.
Writing Quality: ⭐⭐⭐⭐ Logic is clear and figures are informative, though some formula notation definitions are redundant.
Value: ⭐⭐⭐⭐ Provides a systematic decoupling paradigm for human video generation; the MoVid dataset also offers community value. However, the 30K dataset scale and single-person limitation require further extension for broader deployment.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated and the architecture diagram is intuitive.
Value: ⭐⭐⭐⭐⭐ Represents a significant advance in motion plausibility for human video generation.