MagicArticulate: Make Your 3D Models Articulation-Ready¶
Conference: CVPR 2025
arXiv: 2502.12135
Code: GitHub
Area: LLM Evaluation
Keywords: 3D articulation, skeleton generation, skinning weight prediction, auto-regressive transformer, functional diffusion, Articulation-XL
TL;DR¶
This paper proposes MagicArticulate, a two-stage framework. The first stage models skeleton generation as a sequence prediction task using an auto-regressive Transformer. The second stage predicts skinning weights via a functional diffusion process combined with a volumetric geodesic distance prior. Together with the large-scale Articulation-XL dataset (33K+), it achieves automatic conversion from static 3D models to animatable assets.
Background & Motivation¶
Background: The demand for animatable 3D models has grown exponentially in fields such as gaming, VR/AR, and robotic simulation. However, converting static models into forms that support animation (skeleton + skinning weights) traditionally relies on manual annotation by professional artists, which is time-consuming and labor-intensive.
Limitations of Prior Work: 1. Template-based methods (e.g., Pinocchio) rely on predefined skeleton templates, which only apply to specific categories like humans and fail to generalize to diverse structures. 2. Template-free methods (e.g., curve skeleton extraction) often generate excessively dense joints that are unsuitable for animation. 3. Learning-based methods (e.g., RigNet) rely on hand-crafted features and assumptions about shape orientation, limiting their generalization across categories. 4. The lack of large-scale benchmark datasets hinders the development of general-purpose solutions.
Key Challenge: Skeleton structures of different 3D objects vary drastically (with bone counts ranging from 2 to 100+), requiring flexible handling of variable-length structures, while skinning weights must transition smoothly across complex mesh topologies.
Key Insight: Constructing a large-scale dataset + utilizing auto-regressive sequence modeling for variable-length skeletons + leveraging functional diffusion for smooth, continuous skinning weights.
Method¶
Overall Architecture¶
A two-stage pipeline: 1. Skeleton Generation Stage: Input 3D mesh → sample point cloud → extract shape tokens via a pre-trained shape encoder → auto-regressive Transformer generates bone tokens sequentially → detokenization yields skeleton coordinates and joint connectivity. 2. Skinning Weight Prediction Stage: Input mesh + generated skeleton → functional diffusion framework predicts the vertex-to-joint skinning weight matrix → export to standard formats (FBX/GLB).
Key Designs¶
1. Articulation-XL Large-Scale Dataset - Function: Curation of 33K+ 3D models from Objaverse-XL, annotated with high-quality skeletons and skinning weights. - Mechanism: A three-stage pipeline: (a) initial filtering (deduplication, excluding single-joint shapes and models with more than 100 bones, resulting in 38.8K models); (b) VLM filtering (GPT-4o assesses skeleton quality from four rendered camera views); (c) automatic category label annotation using VLMs. - Design Motivation: Addressing the fundamental bottleneck of the lack of large-scale datasets in this field. VLM filtering filters out poorly defined skeletons, which ablation studies show improves CD-J2J by approximately 15%.
2. Auto-Regressive Skeleton Generation (Sequence Modeling) - Function: Represents a skeleton as a sequence of bones (each bone defined by 6 coordinates representing two joint endpoints) and generates them auto-regressively using an OPT-350M decoder-only Transformer. - Mechanism: - Skeleton tokenization: normalize to \([-0.5, 0.5]^3\) → discretize into a \(128^3\) grid → 6 tokens per bone. - Two ordering strategies: spatial ordering (z-y-x ascending order) and hierarchical ordering (layer-by-layer based on the skeletal hierarchy). - Shape conditioning: sample 8192 points → extract from pre-trained encoder → prepend 257 shape tokens to the sequence. - Train using cross-entropy for next-token prediction. - Design Motivation: Auto-regressive modeling naturally handles variable-length sequences (bone counts 2-100 across models) and captures bone-to-bone dependencies. VQ-VAE is skipped as the sequence length is relatively short (\(\le 600\) tokens).
3. Functional Diffusion Skinning Weight Prediction - Function: Treats skinning weights as a continuous function \(\mathbb{R}^3 \to \mathbb{R}^n\) over the mesh surface, employing a DDPM-based functional diffusion for denoising. - Mechanism: - Introduces a volumetric geodesic distance prior \(\mathcal{G}\), forcing the model to learn the residual \(f: \mathcal{P} \to (\mathcal{W} - \mathcal{G})\). - The diffusion process adds noise to the skinning weight function, and the denoising network recovers the original weights. - Conditioning signals: joint coordinates + global shape features (pre-trained encoder). - Normalize skinning weights and geodesic distances to \([-1, 1]\) before adding noise. - Design Motivation: Functional diffusion naturally models continuous, high-dimensional weight distributions. The geodesic distance prior provides physically meaningful guidance (ablation shows removing it drops precision by 0.6% and recall by 3.9%).
Loss & Training¶
- Skeleton generation: Cross-entropy loss \(\mathcal{L}_{pred} = \text{CE}(\mathbf{T}, \hat{\mathbf{T}})\)
- Skinning weight: \(x_0\)-prediction MSE loss \(\mathcal{L}_{denoise} = \|D_\theta(\{x, f_t(x)\}, t) - f_0(x)\|_2^2\)
- DDPM scheduler, 1000 timesteps, linear beta schedule.
- Data augmentation: scaling, translation, rotation.
- Hardware: 8\(\times\)A100 GPUs, skeleton training takes ~2 days, skinning training takes ~1 day.
Key Experimental Results¶
Main Results — Skeleton Generation (metrics \(\times 10^{-2}\), lower is better)¶
| Method | Dataset | CD-J2J | CD-J2B | CD-B2B |
|---|---|---|---|---|
| Pinocchio | Arti-XL | 8.360 | 6.677 | 5.689 |
| RigNet | Arti-XL | 7.478 | 5.892 | 4.932 |
| Ours-spatial | Arti-XL | 2.586 | 1.959 | 1.661 |
| RigNet | ModelsRes. | 4.143 | 2.961 | 2.675 |
| Ours-spatial | ModelsRes. | 3.343 | 2.455 | 2.140 |
Main Results — Skinning Weights (Precision/Recall: higher is better; L1: lower is better)¶
| Method | Dataset | Precision | Recall | avg L1 |
|---|---|---|---|---|
| GVB | Arti-XL | 75.7% | 68.3% | 0.724 |
| RigNet | Arti-XL | 72.4% | 71.1% | 0.698 |
| Ours | Arti-XL | 80.7% | 77.2% | 0.337 |
| GVB | ModelsRes. | 69.3% | 79.2% | 0.687 |
| RigNet | ModelsRes. | 77.1% | 83.5% | 0.464 |
| Ours | ModelsRes. | 82.1% | 81.6% | 0.398 |
Ablation Study¶
Skeleton Generation Ablation (Arti-XL, spatial ordering):
| Configuration | CD-J2J | CD-J2B | CD-B2B |
|---|---|---|---|
| w/o data filtering | 2.982 | 2.327 | 2.015 |
| 4096 points | 2.635 | 2.024 | 1.727 |
| 12288 points | 2.685 | 2.048 | 1.760 |
| Ours (8192) | 2.586 | 1.959 | 1.661 |
Skinning Weight Ablation (ModelsResource):
| Configuration | Precision | Recall | avg L1 |
|---|---|---|---|
| w/o geodesic dist. | 81.5% | 77.7% | 0.444 |
| w/o weights norm | 82.0% | 77.9% | 0.436 |
| w/o shape features | 81.4% | 81.3% | 0.412 |
| Ours | 82.1% | 81.6% | 0.398 |
Key Findings¶
- Cross-Dataset Generalization: Trained on Arti-XL and tested on ModelsResource, the proposed method remains competitive (CD-J2J 4.103), while RigNet degenerates significantly across domains (7.132).
- Applicability to AI-Generated Models: On 3D meshes generated by Tripo 2.0, Ours produces reasonable skeletons, whereas both RigNet and Pinocchio fail.
- VLM Data Filtering is Crucial: Without the filtering process, all evaluation metrics drop by approximately 15%.
- Spatial Ordering Outperforms Hierarchical Ordering: Spatial ordering allows the model to focus on positional accuracy, while hierarchical ordering requires the model to additionally learn the skeletal hierarchy.
Highlights & Insights¶
- Reformulating skeleton generation as sequence prediction is an elegant design that elegantly leverages auto-regressive Transformers to handle variable-length structures.
- The combination of functional diffusion and geodesic distance residual learning is natural, effectively fusing physical priors with data-driven methods.
- The Articulation-XL dataset (33K+ models) fills a crucial void in this domain, and VLM-assisted quality filtering presents a highly practical data curation strategy.
- The complete pipeline outputs standard formats (FBX/GLB) that can be directly imported into Blender/Maya, demonstrating strong practical utility for industrial applications.
Limitations & Future Work¶
- The maximum number of joints for skinning weights is restricted to 55, and models exceeding this limit are excluded.
- Skeleton generation and skinning weight prediction are partitioned into two independent stages, which might lead to error accumulation.
- Skeletal semantics can be ambiguous for highly symmetric shapes or geometries without clear functionality (e.g., abstract artworks).
- Humanoid models constitute the largest portion of the dataset; generalization to rarer categories (e.g., mechanical structures) remains to be validated.
- Inference relies on sequential auto-regression, which limits the speed for generating large scale skeletons.
Related Work & Insights¶
- RigNet pioneered the framework of learning both skeleton and skinning weights, but its graph neural networks are sensitive to shape orientation; this work circumvented this issue via auto-regression.
- The concept of auto-regressive mesh generation from MeshGPT/MeshAnythingV2 is successfully transferred to skeleton generation, demonstrating a stellar example of cross-task methodological migration.
- The functional diffusion framework is adapted from Functa and applied to skinning weight prediction for the first time.
- Insight: Large-scale annotated data + VLM-driven quality assurance + sequence modeling = a viable path toward general-purpose 3D automation.
Rating¶
⭐⭐⭐⭐