MagicArticulate: Make Your 3D Models Articulation-Ready¶

Conference: CVPR 2025
arXiv: 2502.12135
Code: GitHub
Area: LLM Evaluation
Keywords: 3D articulation, skeleton generation, skinning weight prediction, auto-regressive transformer, functional diffusion, Articulation-XL

TL;DR¶

This paper proposes MagicArticulate, a two-stage framework. The first stage models skeleton generation as a sequence prediction task using an auto-regressive Transformer. The second stage predicts skinning weights via a functional diffusion process combined with a volumetric geodesic distance prior. Together with the large-scale Articulation-XL dataset (33K+), it achieves automatic conversion from static 3D models to animatable assets.

Background & Motivation¶

Background: The demand for animatable 3D models has grown exponentially in fields such as gaming, VR/AR, and robotic simulation. However, converting static models into forms that support animation (skeleton + skinning weights) traditionally relies on manual annotation by professional artists, which is time-consuming and labor-intensive.

Limitations of Prior Work: 1. Template-based methods (e.g., Pinocchio) rely on predefined skeleton templates, which only apply to specific categories like humans and fail to generalize to diverse structures. 2. Template-free methods (e.g., curve skeleton extraction) often generate excessively dense joints that are unsuitable for animation. 3. Learning-based methods (e.g., RigNet) rely on hand-crafted features and assumptions about shape orientation, limiting their generalization across categories. 4. The lack of large-scale benchmark datasets hinders the development of general-purpose solutions.

Key Challenge: Skeleton structures of different 3D objects vary drastically (with bone counts ranging from 2 to 100+), requiring flexible handling of variable-length structures, while skinning weights must transition smoothly across complex mesh topologies.

Key Insight: Constructing a large-scale dataset + utilizing auto-regressive sequence modeling for variable-length skeletons + leveraging functional diffusion for smooth, continuous skinning weights.

Method¶

Overall Architecture¶

A two-stage pipeline: 1. Skeleton Generation Stage: Input 3D mesh → sample point cloud → extract shape tokens via a pre-trained shape encoder → auto-regressive Transformer generates bone tokens sequentially → detokenization yields skeleton coordinates and joint connectivity. 2. Skinning Weight Prediction Stage: Input mesh + generated skeleton → functional diffusion framework predicts the vertex-to-joint skinning weight matrix → export to standard formats (FBX/GLB).

Key Designs¶

1. Articulation-XL Large-Scale Dataset - Function: Curation of 33K+ 3D models from Objaverse-XL, annotated with high-quality skeletons and skinning weights. - Mechanism: A three-stage pipeline: (a) initial filtering (deduplication, excluding single-joint shapes and models with more than 100 bones, resulting in 38.8K models); (b) VLM filtering (GPT-4o assesses skeleton quality from four rendered camera views); (c) automatic category label annotation using VLMs. - Design Motivation: Addressing the fundamental bottleneck of the lack of large-scale datasets in this field. VLM filtering filters out poorly defined skeletons, which ablation studies show improves CD-J2J by approximately 15%.

2. Auto-Regressive Skeleton Generation (Sequence Modeling) - Function: Represents a skeleton as a sequence of bones (each bone defined by 6 coordinates representing two joint endpoints) and generates them auto-regressively using an OPT-350M decoder-only Transformer. - Mechanism: - Skeleton tokenization: normalize to \([-0.5, 0.5]^3\) → discretize into a \(128^3\) grid → 6 tokens per bone. - Two ordering strategies: spatial ordering (z-y-x ascending order) and hierarchical ordering (layer-by-layer based on the skeletal hierarchy). - Shape conditioning: sample 8192 points → extract from pre-trained encoder → prepend 257 shape tokens to the sequence. - Train using cross-entropy for next-token prediction. - Design Motivation: Auto-regressive modeling naturally handles variable-length sequences (bone counts 2-100 across models) and captures bone-to-bone dependencies. VQ-VAE is skipped as the sequence length is relatively short (\(\le 600\) tokens).

3. Functional Diffusion Skinning Weight Prediction - Function: Treats skinning weights as a continuous function \(\mathbb{R}^3 \to \mathbb{R}^n\) over the mesh surface, employing a DDPM-based functional diffusion for denoising. - Mechanism: - Introduces a volumetric geodesic distance prior \(\mathcal{G}\), forcing the model to learn the residual \(f: \mathcal{P} \to (\mathcal{W} - \mathcal{G})\). - The diffusion process adds noise to the skinning weight function, and the denoising network recovers the original weights. - Conditioning signals: joint coordinates + global shape features (pre-trained encoder). - Normalize skinning weights and geodesic distances to \([-1, 1]\) before adding noise. - Design Motivation: Functional diffusion naturally models continuous, high-dimensional weight distributions. The geodesic distance prior provides physically meaningful guidance (ablation shows removing it drops precision by 0.6% and recall by 3.9%).

Loss & Training¶

Skeleton generation: Cross-entropy loss \(\mathcal{L}_{pred} = \text{CE}(\mathbf{T}, \hat{\mathbf{T}})\)
Skinning weight: \(x_0\)-prediction MSE loss \(\mathcal{L}_{denoise} = \|D_\theta(\{x, f_t(x)\}, t) - f_0(x)\|_2^2\)
DDPM scheduler, 1000 timesteps, linear beta schedule.
Data augmentation: scaling, translation, rotation.
Hardware: 8\(\times\)A100 GPUs, skeleton training takes ~2 days, skinning training takes ~1 day.

Key Experimental Results¶

Main Results — Skeleton Generation (metrics \(\times 10^{-2}\), lower is better)¶

Method	Dataset	CD-J2J	CD-J2B	CD-B2B
Pinocchio	Arti-XL	8.360	6.677	5.689
RigNet	Arti-XL	7.478	5.892	4.932
Ours-spatial	Arti-XL	2.586	1.959	1.661
RigNet	ModelsRes.	4.143	2.961	2.675
Ours-spatial	ModelsRes.	3.343	2.455	2.140

Main Results — Skinning Weights (Precision/Recall: higher is better; L1: lower is better)¶

Method	Dataset	Precision	Recall	avg L1
GVB	Arti-XL	75.7%	68.3%	0.724
RigNet	Arti-XL	72.4%	71.1%	0.698
Ours	Arti-XL	80.7%	77.2%	0.337
GVB	ModelsRes.	69.3%	79.2%	0.687
RigNet	ModelsRes.	77.1%	83.5%	0.464
Ours	ModelsRes.	82.1%	81.6%	0.398

Ablation Study¶

Skeleton Generation Ablation (Arti-XL, spatial ordering):

Configuration	CD-J2J	CD-J2B	CD-B2B
w/o data filtering	2.982	2.327	2.015
4096 points	2.635	2.024	1.727
12288 points	2.685	2.048	1.760
Ours (8192)	2.586	1.959	1.661

Skinning Weight Ablation (ModelsResource):

Configuration	Precision	Recall	avg L1
w/o geodesic dist.	81.5%	77.7%	0.444
w/o weights norm	82.0%	77.9%	0.436
w/o shape features	81.4%	81.3%	0.412
Ours	82.1%	81.6%	0.398

Key Findings¶

Cross-Dataset Generalization: Trained on Arti-XL and tested on ModelsResource, the proposed method remains competitive (CD-J2J 4.103), while RigNet degenerates significantly across domains (7.132).
Applicability to AI-Generated Models: On 3D meshes generated by Tripo 2.0, Ours produces reasonable skeletons, whereas both RigNet and Pinocchio fail.
VLM Data Filtering is Crucial: Without the filtering process, all evaluation metrics drop by approximately 15%.
Spatial Ordering Outperforms Hierarchical Ordering: Spatial ordering allows the model to focus on positional accuracy, while hierarchical ordering requires the model to additionally learn the skeletal hierarchy.

Highlights & Insights¶

Reformulating skeleton generation as sequence prediction is an elegant design that elegantly leverages auto-regressive Transformers to handle variable-length structures.
The combination of functional diffusion and geodesic distance residual learning is natural, effectively fusing physical priors with data-driven methods.
The Articulation-XL dataset (33K+ models) fills a crucial void in this domain, and VLM-assisted quality filtering presents a highly practical data curation strategy.
The complete pipeline outputs standard formats (FBX/GLB) that can be directly imported into Blender/Maya, demonstrating strong practical utility for industrial applications.

Limitations & Future Work¶

The maximum number of joints for skinning weights is restricted to 55, and models exceeding this limit are excluded.
Skeleton generation and skinning weight prediction are partitioned into two independent stages, which might lead to error accumulation.
Skeletal semantics can be ambiguous for highly symmetric shapes or geometries without clear functionality (e.g., abstract artworks).
Humanoid models constitute the largest portion of the dataset; generalization to rarer categories (e.g., mechanical structures) remains to be validated.
Inference relies on sequential auto-regression, which limits the speed for generating large scale skeletons.

RigNet pioneered the framework of learning both skeleton and skinning weights, but its graph neural networks are sensitive to shape orientation; this work circumvented this issue via auto-regression.
The concept of auto-regressive mesh generation from MeshGPT/MeshAnythingV2 is successfully transferred to skeleton generation, demonstrating a stellar example of cross-task methodological migration.
The functional diffusion framework is adapted from Functa and applied to skinning weight prediction for the first time.
Insight: Large-scale annotated data + VLM-driven quality assurance + sequence modeling = a viable path toward general-purpose 3D automation.

Rating¶

⭐⭐⭐⭐