Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling¶

Conference: ICCV 2025 arXiv: 2507.14915 Code: Project Page Area: image_generation (3D Dance Generation) Keywords: dance generation, music alignment, whole-body motion, hierarchical residual vector quantization, cross-modal retrieval

TL;DR¶

This paper introduces the SoulDance dataset (the first high-quality 3D dance dataset encompassing body, hand, and facial motion) and the SoulNet framework (hierarchical residual vector quantization + music-aligned generative model + cross-modal retrieval), achieving the first whole-body 3D dance generation with coordinated facial expressions, body, and hand movements aligned to musical rhythm and emotion.

Background & Motivation¶

Core Problem: Existing methods generate only body or body+hand dance motions and lack facial expression generation, resulting in incomplete emotional expressiveness. Additionally, insufficient cross-modal alignment between music and dance remains a persistent issue.

Three Major Challenges:

Dataset Gap: Existing dance datasets either contain only body motion (AIST++) or lack facial expressions (FineDance), and most are derived from video-based pose estimation with insufficient accuracy.

Modeling Difficulty: Complex hierarchical dependencies exist among body, hands, and face, and existing methods cannot effectively capture coordinated cross-part motion.

Poor Cross-Modal Alignment: Existing methods directly feed low-level audio features (e.g., Librosa, Jukebox features) into motion generators without explicit music-dance alignment mechanisms.

Motivation: To construct the first whole-body motion dataset using professional motion capture systems, while designing a hierarchical quantization and cross-modal alignment framework to address coordinated generation and music synchronization.

Method¶

Overall Architecture¶

SoulNet consists of three core components:

HRVQ (Hierarchical Residual Vector Quantization): Models cross-part dependencies among body, hands, and face via hierarchical residual quantization.
MAGM (Music-Aligned Generative Model): A Transformer-based model that generates dance token sequences aligned to music.
MMR (Music-Motion Retrieval Module): A pretrained cross-modal alignment prior for music-motion retrieval.

Key Designs 1: HRVQ¶

Conventional VQ-VAE quantizes whole-body motion uniformly, leading to significant information loss and neglect of inter-part relationships. The core innovation of HRVQ is a multi-level body–hand–face chained design:

Whole-body motion is decomposed into body \(m^b\), hand \(m^h\), and face \(m^f\) components.
At each RVQ level \(v\), the body residual is quantized first to obtain \(b^v\); \(b^v\) is then passed as a hint to guide hand quantization yielding \(h^v\); finally \(h^v\) is passed as a hint to guide facial quantization yielding \(f^v\).
A transformation function \(\mathcal{T}\) fuses upstream part information with the current part residual, enabling hierarchical dependency modeling.

\[b^v = VQ_b(r_b^v), \quad h^v = VQ_h(\mathcal{T}(r_h^v, b^v)), \quad f^v = VQ_f(\mathcal{T}(r_f^v, h^v))\]

\(V=5\) residual quantization levels are employed, balancing reconstruction accuracy and generation efficiency.

Key Designs 2: MAGM Two-Stage Generation¶

Stage 1: A Transformer layer performs mask-and-predict on the base layer \(T_0\) (concatenated body+hand+face tokens), incorporating music features to generate primary motion tokens.
Stage 2: \(V-1\) residual layers progressively predict residuals \(T_{1:V}\) to refine motion details.
At each stage, MMR alignment constraints are introduced via \(\mathcal{L}_{\text{Align-body}}\) and \(\mathcal{L}_{\text{Align-whole}}\), respectively.

Inspired by CLIP, a pretrained music-motion alignment model is employed:

Motion Encoder: Compresses motion sequences into latent codes \(\mathbf{z}\).
Music Encoder: Extracts audio features using the Jukebox encoder and maps them to the aligned space \(\mathbf{c}\).
Contrastive loss \(\mathcal{L}_{\text{Align}}\) is used to pull matched music-dance pairs closer.
Two MMR modules are trained: \(\mathcal{L}_{\text{Align-body}}\) aligns body motion with beat, and \(\mathcal{L}_{\text{Align-whole}}\) aligns whole-body dynamics with musical emotion.

Loss & Training¶

HRVQ Training:

\[\mathcal{L}_{hrvq} = \|m - \hat{m}\|_1 + \alpha \sum_v \|r_b^v - sg[b^v]\|_2 + \beta \sum_v \|r_h^v - sg[h^v]\|_2 + \gamma \sum_v \|r_f^v - sg[f^v]\|_2\]

MAGM Training (two stages):

\[\mathcal{L}_{mask} = \sum_{t_i \in \text{mask}} -\log_\theta(T_0 | T_0^m, C) + \lambda_b \mathcal{L}_{\text{Align-body}}\]

\[\mathcal{L}_{res} = \sum_{i=1}^{V} -\log_\phi(T_{i:V} | T_0, C) + \lambda_w \mathcal{L}_{\text{Align-whole}}\]

where \(\lambda_b = \lambda_w = 0.5\).

Key Experimental Results¶

SoulDance Dataset¶

Property	Value
Total Duration	12.5 hours
Frame Rate	60 FPS
Dance Styles	15
Music Clips	284
Dancers	5 professional dancers
Body/Hand/Face Included	✓/✓/✓ (only holistic dataset)
Motion Capture Cameras	15 optical cameras
Facial Parameters	ARKit 52-dim blendshapes

Main Results: Motion Reconstruction (MPJPE, mm↓)¶

Method	Whole Body	Body	Hand	Face
Vanilla VQ (512)	137.130	97.660	129.518	4.013
RVQ-5	108.983	71.831	106.836	2.379
HRVQ-5 (Ours)	83.679	47.895	85.085	1.153

HRVQ reduces whole-body reconstruction error by 23.2% compared to RVQ.

Main Results: Dance Generation (SoulDance)¶

Method	FID↓	Div↑	MM↑	MMR-MS↓	BAS↑	EAS↑
FACT	1.008	0.646	0.656	0.685	0.221	0.358
Bailando	1.379	1.307	1.117	0.585	0.236	0.401
EDGE	2.619	0.723	0.745	0.716	0.241	0.246
FineNet	1.463	1.262	0.832	0.694	0.213	0.263
SoulNet	0.029	1.312	1.310	0.369	0.244	0.594

SoulNet achieves an FID of only 0.029 (vs. second-best 1.008), reduces MMR-MS by 46.3% (vs. Bailando), and improves EAS by 48.1% (vs. Bailando).

Ablation Study: HRVQ + MMR¶

Method	FID↓	MMR-MS↓
VQ-512	1.610	0.703
RVQ-512 + MMR	0.067	0.540
HRVQ-512	0.048	0.418
HRVQ-512 + MMR	0.029	0.369

Ablation Study: Alignment Losses¶

\(\mathcal{L}_{\text{Align-body}}\)	\(\mathcal{L}_{\text{Align-whole}}\)	FID↓	BAS↑	MMR-MS↓
✓	✓	0.029	0.244	0.369
✓	✗	0.031	0.242	0.387
✗	✓	0.042	0.237	0.372

\(\mathcal{L}_{\text{Align-body}}\) primarily improves local feature alignment (FID, BAS), while \(\mathcal{L}_{\text{Align-whole}}\) primarily enhances global structural alignment (MMR-MS).

User Study (1–10 scale)¶

Method	Overall	Body	Hand	Emotion	Alignment
FineNet	6.44	5.89	6.11	5.33	6.56
SoulNet	7.33	8.45	7.56	7.67	7.78

Key Findings¶

Increasing residual levels \(V\) from 1 to 5 consistently improves performance; generation quality degrades beyond 5–6 levels.
The MMR module has a limited effect on FID improvement but significantly enhances MMR-MS (music alignment).
Inference speed is 0.086s (vs. FACT 1.782s, EDGE 1.521s), approximately 17–20× faster.

Highlights & Insights¶

First Holistic Dance Dataset: SoulDance is the first high-quality motion-capture dance dataset simultaneously containing body, hand, and facial motion, filling a critical gap in the field.
Chained Hierarchical Quantization: The body→hand→face chained information flow in HRVQ is an elegant design—using body motion as a hint to guide hand quantization, which in turn guides face quantization, reflecting the physical coordination principles inherent in dance.
Dual-Granularity Alignment: \(\mathcal{L}_{\text{Align-body}}\) handles local beat alignment while \(\mathcal{L}_{\text{Align-whole}}\) handles global emotional alignment, forming complementary supervision signals.
Novel Evaluation Metrics: EmotionAlign Score and MMR-Matching Score address the inadequacy of existing metrics in evaluating facial expression generation and fine-grained rhythmic alignment.
High Inference Efficiency: The mask-and-predict generation paradigm achieves an order-of-magnitude speedup over diffusion-based models.

Limitations & Future Work¶

Dataset Scale: SoulDance contains only 12.5 hours of data across 15 dance styles, providing less coverage than FineDance's 22 styles.
Limited Diversity on AIST++: The paper acknowledges that SoulNet's Diversity metric is lower than some competing methods on the smaller AIST++ benchmark.
Limited Facial Representation: The use of ARKit 52-dim blendshapes rather than finer-grained facial models may result in information loss for subtle expressions.
Unidirectional Chained Quantization: The unidirectional body→hand→face chain cannot model the reverse influence of facial expressions on body motion.
Evaluation Metric Dependency: EAS relies on the accuracy of expression recognition algorithms, and MMR-MS depends on the quality of its own pretraining, introducing a risk of circular evaluation.

Bailando (VQ-VAE + RL): The first work to introduce VQ-VAE into dance generation, but training is complex and whole-body modeling is absent.
EDGE (Diffusion-based): Generates high-quality body dance via diffusion models, but ignores hands and face.
FineNet (FineDance): The first method to jointly generate body and hand motion, but without facial expressions.
MoMask (RVQ for motion): Introduced RVQ into human motion generation; this work extends it to hierarchical RVQ.
CLIP (Contrastive Learning): The MMR module directly draws on CLIP's cross-modal alignment paradigm.
Insights: The chained hierarchical quantization design is generalizable to other multi-part coordinated generation tasks, such as whole-body sign language generation and character animation.

Rating¶

Dimension	Score (1–10)	Remarks
Novelty	8	First holistic dance dataset + novel HRVQ chained quantization design
Technical Depth	8	Complete three-module co-design with clear theoretical derivation of HRVQ
Experimental Thoroughness	9	Three datasets, multiple ablations, user study, and two new metrics—very comprehensive
Writing Quality	8	Clear paper structure with highly informative figures and tables
Value	7	Dataset and framework have direct application value for whole-body animation
Overall	8.0	Solid ICCV-level contribution combining dataset, methodology, and thorough evaluation

Rating¶

Novelty: TBD
Experimental Thoroughness: TBD
Writing Quality: TBD
Value: TBD

Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs 1: HRVQ¶

Key Designs 2: MAGM Two-Stage Generation¶

Key Designs 3: MMR Cross-Modal Retrieval¶

Loss & Training¶

Key Experimental Results¶

SoulDance Dataset¶

Main Results: Motion Reconstruction (MPJPE, mm↓)¶

Main Results: Dance Generation (SoulDance)¶

Ablation Study: HRVQ + MMR¶

Ablation Study: Alignment Losses¶

User Study (1–10 scale)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Rating¶

Related Papers¶