Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling¶
Conference: ICCV 2025 arXiv: 2507.14915 Code: Project Page Area: image_generation (3D Dance Generation) Keywords: dance generation, music alignment, whole-body motion, hierarchical residual vector quantization, cross-modal retrieval
TL;DR¶
This paper introduces the SoulDance dataset (the first high-quality 3D dance dataset encompassing body, hand, and facial motion) and the SoulNet framework (hierarchical residual vector quantization + music-aligned generative model + cross-modal retrieval), achieving the first whole-body 3D dance generation with coordinated facial expressions, body, and hand movements aligned to musical rhythm and emotion.
Background & Motivation¶
Core Problem: Existing methods generate only body or body+hand dance motions and lack facial expression generation, resulting in incomplete emotional expressiveness. Additionally, insufficient cross-modal alignment between music and dance remains a persistent issue.
Three Major Challenges:
Dataset Gap: Existing dance datasets either contain only body motion (AIST++) or lack facial expressions (FineDance), and most are derived from video-based pose estimation with insufficient accuracy.
Modeling Difficulty: Complex hierarchical dependencies exist among body, hands, and face, and existing methods cannot effectively capture coordinated cross-part motion.
Poor Cross-Modal Alignment: Existing methods directly feed low-level audio features (e.g., Librosa, Jukebox features) into motion generators without explicit music-dance alignment mechanisms.
Motivation: To construct the first whole-body motion dataset using professional motion capture systems, while designing a hierarchical quantization and cross-modal alignment framework to address coordinated generation and music synchronization.
Method¶
Overall Architecture¶
SoulNet consists of three core components:
- HRVQ (Hierarchical Residual Vector Quantization): Models cross-part dependencies among body, hands, and face via hierarchical residual quantization.
- MAGM (Music-Aligned Generative Model): A Transformer-based model that generates dance token sequences aligned to music.
- MMR (Music-Motion Retrieval Module): A pretrained cross-modal alignment prior for music-motion retrieval.
Key Designs 1: HRVQ¶
Conventional VQ-VAE quantizes whole-body motion uniformly, leading to significant information loss and neglect of inter-part relationships. The core innovation of HRVQ is a multi-level body–hand–face chained design:
- Whole-body motion is decomposed into body \(m^b\), hand \(m^h\), and face \(m^f\) components.
- At each RVQ level \(v\), the body residual is quantized first to obtain \(b^v\); \(b^v\) is then passed as a hint to guide hand quantization yielding \(h^v\); finally \(h^v\) is passed as a hint to guide facial quantization yielding \(f^v\).
- A transformation function \(\mathcal{T}\) fuses upstream part information with the current part residual, enabling hierarchical dependency modeling.
\(V=5\) residual quantization levels are employed, balancing reconstruction accuracy and generation efficiency.
Key Designs 2: MAGM Two-Stage Generation¶
- Stage 1: A Transformer layer performs mask-and-predict on the base layer \(T_0\) (concatenated body+hand+face tokens), incorporating music features to generate primary motion tokens.
- Stage 2: \(V-1\) residual layers progressively predict residuals \(T_{1:V}\) to refine motion details.
- At each stage, MMR alignment constraints are introduced via \(\mathcal{L}_{\text{Align-body}}\) and \(\mathcal{L}_{\text{Align-whole}}\), respectively.
Key Designs 3: MMR Cross-Modal Retrieval¶
Inspired by CLIP, a pretrained music-motion alignment model is employed:
- Motion Encoder: Compresses motion sequences into latent codes \(\mathbf{z}\).
- Music Encoder: Extracts audio features using the Jukebox encoder and maps them to the aligned space \(\mathbf{c}\).
- Contrastive loss \(\mathcal{L}_{\text{Align}}\) is used to pull matched music-dance pairs closer.
- Two MMR modules are trained: \(\mathcal{L}_{\text{Align-body}}\) aligns body motion with beat, and \(\mathcal{L}_{\text{Align-whole}}\) aligns whole-body dynamics with musical emotion.
Loss & Training¶
HRVQ Training:
MAGM Training (two stages):
where \(\lambda_b = \lambda_w = 0.5\).
Key Experimental Results¶
SoulDance Dataset¶
| Property | Value |
|---|---|
| Total Duration | 12.5 hours |
| Frame Rate | 60 FPS |
| Dance Styles | 15 |
| Music Clips | 284 |
| Dancers | 5 professional dancers |
| Body/Hand/Face Included | ✓/✓/✓ (only holistic dataset) |
| Motion Capture Cameras | 15 optical cameras |
| Facial Parameters | ARKit 52-dim blendshapes |
Main Results: Motion Reconstruction (MPJPE, mm↓)¶
| Method | Whole Body | Body | Hand | Face |
|---|---|---|---|---|
| Vanilla VQ (512) | 137.130 | 97.660 | 129.518 | 4.013 |
| RVQ-5 | 108.983 | 71.831 | 106.836 | 2.379 |
| HRVQ-5 (Ours) | 83.679 | 47.895 | 85.085 | 1.153 |
HRVQ reduces whole-body reconstruction error by 23.2% compared to RVQ.
Main Results: Dance Generation (SoulDance)¶
| Method | FID↓ | Div↑ | MM↑ | MMR-MS↓ | BAS↑ | EAS↑ |
|---|---|---|---|---|---|---|
| FACT | 1.008 | 0.646 | 0.656 | 0.685 | 0.221 | 0.358 |
| Bailando | 1.379 | 1.307 | 1.117 | 0.585 | 0.236 | 0.401 |
| EDGE | 2.619 | 0.723 | 0.745 | 0.716 | 0.241 | 0.246 |
| FineNet | 1.463 | 1.262 | 0.832 | 0.694 | 0.213 | 0.263 |
| SoulNet | 0.029 | 1.312 | 1.310 | 0.369 | 0.244 | 0.594 |
SoulNet achieves an FID of only 0.029 (vs. second-best 1.008), reduces MMR-MS by 46.3% (vs. Bailando), and improves EAS by 48.1% (vs. Bailando).
Ablation Study: HRVQ + MMR¶
| Method | FID↓ | MMR-MS↓ |
|---|---|---|
| VQ-512 | 1.610 | 0.703 |
| RVQ-512 + MMR | 0.067 | 0.540 |
| HRVQ-512 | 0.048 | 0.418 |
| HRVQ-512 + MMR | 0.029 | 0.369 |
Ablation Study: Alignment Losses¶
| \(\mathcal{L}_{\text{Align-body}}\) | \(\mathcal{L}_{\text{Align-whole}}\) | FID↓ | BAS↑ | MMR-MS↓ |
|---|---|---|---|---|
| ✓ | ✓ | 0.029 | 0.244 | 0.369 |
| ✓ | ✗ | 0.031 | 0.242 | 0.387 |
| ✗ | ✓ | 0.042 | 0.237 | 0.372 |
\(\mathcal{L}_{\text{Align-body}}\) primarily improves local feature alignment (FID, BAS), while \(\mathcal{L}_{\text{Align-whole}}\) primarily enhances global structural alignment (MMR-MS).
User Study (1–10 scale)¶
| Method | Overall | Body | Hand | Emotion | Alignment |
|---|---|---|---|---|---|
| FineNet | 6.44 | 5.89 | 6.11 | 5.33 | 6.56 |
| SoulNet | 7.33 | 8.45 | 7.56 | 7.67 | 7.78 |
Key Findings¶
- Increasing residual levels \(V\) from 1 to 5 consistently improves performance; generation quality degrades beyond 5–6 levels.
- The MMR module has a limited effect on FID improvement but significantly enhances MMR-MS (music alignment).
- Inference speed is 0.086s (vs. FACT 1.782s, EDGE 1.521s), approximately 17–20× faster.
Highlights & Insights¶
- First Holistic Dance Dataset: SoulDance is the first high-quality motion-capture dance dataset simultaneously containing body, hand, and facial motion, filling a critical gap in the field.
- Chained Hierarchical Quantization: The body→hand→face chained information flow in HRVQ is an elegant design—using body motion as a hint to guide hand quantization, which in turn guides face quantization, reflecting the physical coordination principles inherent in dance.
- Dual-Granularity Alignment: \(\mathcal{L}_{\text{Align-body}}\) handles local beat alignment while \(\mathcal{L}_{\text{Align-whole}}\) handles global emotional alignment, forming complementary supervision signals.
- Novel Evaluation Metrics: EmotionAlign Score and MMR-Matching Score address the inadequacy of existing metrics in evaluating facial expression generation and fine-grained rhythmic alignment.
- High Inference Efficiency: The mask-and-predict generation paradigm achieves an order-of-magnitude speedup over diffusion-based models.
Limitations & Future Work¶
- Dataset Scale: SoulDance contains only 12.5 hours of data across 15 dance styles, providing less coverage than FineDance's 22 styles.
- Limited Diversity on AIST++: The paper acknowledges that SoulNet's Diversity metric is lower than some competing methods on the smaller AIST++ benchmark.
- Limited Facial Representation: The use of ARKit 52-dim blendshapes rather than finer-grained facial models may result in information loss for subtle expressions.
- Unidirectional Chained Quantization: The unidirectional body→hand→face chain cannot model the reverse influence of facial expressions on body motion.
- Evaluation Metric Dependency: EAS relies on the accuracy of expression recognition algorithms, and MMR-MS depends on the quality of its own pretraining, introducing a risk of circular evaluation.
Related Work & Insights¶
- Bailando (VQ-VAE + RL): The first work to introduce VQ-VAE into dance generation, but training is complex and whole-body modeling is absent.
- EDGE (Diffusion-based): Generates high-quality body dance via diffusion models, but ignores hands and face.
- FineNet (FineDance): The first method to jointly generate body and hand motion, but without facial expressions.
- MoMask (RVQ for motion): Introduced RVQ into human motion generation; this work extends it to hierarchical RVQ.
- CLIP (Contrastive Learning): The MMR module directly draws on CLIP's cross-modal alignment paradigm.
- Insights: The chained hierarchical quantization design is generalizable to other multi-part coordinated generation tasks, such as whole-body sign language generation and character animation.
Rating¶
| Dimension | Score (1–10) | Remarks |
|---|---|---|
| Novelty | 8 | First holistic dance dataset + novel HRVQ chained quantization design |
| Technical Depth | 8 | Complete three-module co-design with clear theoretical derivation of HRVQ |
| Experimental Thoroughness | 9 | Three datasets, multiple ablations, user study, and two new metrics—very comprehensive |
| Writing Quality | 8 | Clear paper structure with highly informative figures and tables |
| Value | 7 | Dataset and framework have direct application value for whole-body animation |
| Overall | 8.0 | Solid ICCV-level contribution combining dataset, methodology, and thorough evaluation |
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD