Skip to content

MagicArticulate: Make Your 3D Models Articulation-Ready

Conference: CVPR 2025
arXiv: 2502.12135
Code: GitHub
Area: LLM Evaluation
Keywords: 3D articulation, skeleton generation, skinning weight prediction, auto-regressive transformer, functional diffusion, Articulation-XL

TL;DR

This paper proposes MagicArticulate, a two-stage framework. The first stage models skeleton generation as a sequence prediction task using an auto-regressive Transformer. The second stage predicts skinning weights via a functional diffusion process combined with a volumetric geodesic distance prior. Together with the large-scale Articulation-XL dataset (33K+), it achieves automatic conversion from static 3D models to animatable assets.

Background & Motivation

Background: The demand for animatable 3D models has grown exponentially in fields such as gaming, VR/AR, and robotic simulation. However, converting static models into forms that support animation (skeleton + skinning weights) traditionally relies on manual annotation by professional artists, which is time-consuming and labor-intensive.

Limitations of Prior Work: 1. Template-based methods (e.g., Pinocchio) rely on predefined skeleton templates, which only apply to specific categories like humans and fail to generalize to diverse structures. 2. Template-free methods (e.g., curve skeleton extraction) often generate excessively dense joints that are unsuitable for animation. 3. Learning-based methods (e.g., RigNet) rely on hand-crafted features and assumptions about shape orientation, limiting their generalization across categories. 4. The lack of large-scale benchmark datasets hinders the development of general-purpose solutions.

Key Challenge: Skeleton structures of different 3D objects vary drastically (with bone counts ranging from 2 to 100+), requiring flexible handling of variable-length structures, while skinning weights must transition smoothly across complex mesh topologies.

Key Insight: Constructing a large-scale dataset + utilizing auto-regressive sequence modeling for variable-length skeletons + leveraging functional diffusion for smooth, continuous skinning weights.

Method

Overall Architecture

A two-stage pipeline: 1. Skeleton Generation Stage: Input 3D mesh → sample point cloud → extract shape tokens via a pre-trained shape encoder → auto-regressive Transformer generates bone tokens sequentially → detokenization yields skeleton coordinates and joint connectivity. 2. Skinning Weight Prediction Stage: Input mesh + generated skeleton → functional diffusion framework predicts the vertex-to-joint skinning weight matrix → export to standard formats (FBX/GLB).

Key Designs

1. Articulation-XL Large-Scale Dataset - Function: Curation of 33K+ 3D models from Objaverse-XL, annotated with high-quality skeletons and skinning weights. - Mechanism: A three-stage pipeline: (a) initial filtering (deduplication, excluding single-joint shapes and models with more than 100 bones, resulting in 38.8K models); (b) VLM filtering (GPT-4o assesses skeleton quality from four rendered camera views); (c) automatic category label annotation using VLMs. - Design Motivation: Addressing the fundamental bottleneck of the lack of large-scale datasets in this field. VLM filtering filters out poorly defined skeletons, which ablation studies show improves CD-J2J by approximately 15%.

2. Auto-Regressive Skeleton Generation (Sequence Modeling) - Function: Represents a skeleton as a sequence of bones (each bone defined by 6 coordinates representing two joint endpoints) and generates them auto-regressively using an OPT-350M decoder-only Transformer. - Mechanism: - Skeleton tokenization: normalize to \([-0.5, 0.5]^3\) → discretize into a \(128^3\) grid → 6 tokens per bone. - Two ordering strategies: spatial ordering (z-y-x ascending order) and hierarchical ordering (layer-by-layer based on the skeletal hierarchy). - Shape conditioning: sample 8192 points → extract from pre-trained encoder → prepend 257 shape tokens to the sequence. - Train using cross-entropy for next-token prediction. - Design Motivation: Auto-regressive modeling naturally handles variable-length sequences (bone counts 2-100 across models) and captures bone-to-bone dependencies. VQ-VAE is skipped as the sequence length is relatively short (\(\le 600\) tokens).

3. Functional Diffusion Skinning Weight Prediction - Function: Treats skinning weights as a continuous function \(\mathbb{R}^3 \to \mathbb{R}^n\) over the mesh surface, employing a DDPM-based functional diffusion for denoising. - Mechanism: - Introduces a volumetric geodesic distance prior \(\mathcal{G}\), forcing the model to learn the residual \(f: \mathcal{P} \to (\mathcal{W} - \mathcal{G})\). - The diffusion process adds noise to the skinning weight function, and the denoising network recovers the original weights. - Conditioning signals: joint coordinates + global shape features (pre-trained encoder). - Normalize skinning weights and geodesic distances to \([-1, 1]\) before adding noise. - Design Motivation: Functional diffusion naturally models continuous, high-dimensional weight distributions. The geodesic distance prior provides physically meaningful guidance (ablation shows removing it drops precision by 0.6% and recall by 3.9%).

Loss & Training

  • Skeleton generation: Cross-entropy loss \(\mathcal{L}_{pred} = \text{CE}(\mathbf{T}, \hat{\mathbf{T}})\)
  • Skinning weight: \(x_0\)-prediction MSE loss \(\mathcal{L}_{denoise} = \|D_\theta(\{x, f_t(x)\}, t) - f_0(x)\|_2^2\)
  • DDPM scheduler, 1000 timesteps, linear beta schedule.
  • Data augmentation: scaling, translation, rotation.
  • Hardware: 8\(\times\)A100 GPUs, skeleton training takes ~2 days, skinning training takes ~1 day.

Key Experimental Results

Main Results — Skeleton Generation (metrics \(\times 10^{-2}\), lower is better)

Method Dataset CD-J2J CD-J2B CD-B2B
Pinocchio Arti-XL 8.360 6.677 5.689
RigNet Arti-XL 7.478 5.892 4.932
Ours-spatial Arti-XL 2.586 1.959 1.661
RigNet ModelsRes. 4.143 2.961 2.675
Ours-spatial ModelsRes. 3.343 2.455 2.140

Main Results — Skinning Weights (Precision/Recall: higher is better; L1: lower is better)

Method Dataset Precision Recall avg L1
GVB Arti-XL 75.7% 68.3% 0.724
RigNet Arti-XL 72.4% 71.1% 0.698
Ours Arti-XL 80.7% 77.2% 0.337
GVB ModelsRes. 69.3% 79.2% 0.687
RigNet ModelsRes. 77.1% 83.5% 0.464
Ours ModelsRes. 82.1% 81.6% 0.398

Ablation Study

Skeleton Generation Ablation (Arti-XL, spatial ordering):

Configuration CD-J2J CD-J2B CD-B2B
w/o data filtering 2.982 2.327 2.015
4096 points 2.635 2.024 1.727
12288 points 2.685 2.048 1.760
Ours (8192) 2.586 1.959 1.661

Skinning Weight Ablation (ModelsResource):

Configuration Precision Recall avg L1
w/o geodesic dist. 81.5% 77.7% 0.444
w/o weights norm 82.0% 77.9% 0.436
w/o shape features 81.4% 81.3% 0.412
Ours 82.1% 81.6% 0.398

Key Findings

  1. Cross-Dataset Generalization: Trained on Arti-XL and tested on ModelsResource, the proposed method remains competitive (CD-J2J 4.103), while RigNet degenerates significantly across domains (7.132).
  2. Applicability to AI-Generated Models: On 3D meshes generated by Tripo 2.0, Ours produces reasonable skeletons, whereas both RigNet and Pinocchio fail.
  3. VLM Data Filtering is Crucial: Without the filtering process, all evaluation metrics drop by approximately 15%.
  4. Spatial Ordering Outperforms Hierarchical Ordering: Spatial ordering allows the model to focus on positional accuracy, while hierarchical ordering requires the model to additionally learn the skeletal hierarchy.

Highlights & Insights

  • Reformulating skeleton generation as sequence prediction is an elegant design that elegantly leverages auto-regressive Transformers to handle variable-length structures.
  • The combination of functional diffusion and geodesic distance residual learning is natural, effectively fusing physical priors with data-driven methods.
  • The Articulation-XL dataset (33K+ models) fills a crucial void in this domain, and VLM-assisted quality filtering presents a highly practical data curation strategy.
  • The complete pipeline outputs standard formats (FBX/GLB) that can be directly imported into Blender/Maya, demonstrating strong practical utility for industrial applications.

Limitations & Future Work

  • The maximum number of joints for skinning weights is restricted to 55, and models exceeding this limit are excluded.
  • Skeleton generation and skinning weight prediction are partitioned into two independent stages, which might lead to error accumulation.
  • Skeletal semantics can be ambiguous for highly symmetric shapes or geometries without clear functionality (e.g., abstract artworks).
  • Humanoid models constitute the largest portion of the dataset; generalization to rarer categories (e.g., mechanical structures) remains to be validated.
  • Inference relies on sequential auto-regression, which limits the speed for generating large scale skeletons.
  • RigNet pioneered the framework of learning both skeleton and skinning weights, but its graph neural networks are sensitive to shape orientation; this work circumvented this issue via auto-regression.
  • The concept of auto-regressive mesh generation from MeshGPT/MeshAnythingV2 is successfully transferred to skeleton generation, demonstrating a stellar example of cross-task methodological migration.
  • The functional diffusion framework is adapted from Functa and applied to skinning weight prediction for the first time.
  • Insight: Large-scale annotated data + VLM-driven quality assurance + sequence modeling = a viable path toward general-purpose 3D automation.

Rating

⭐⭐⭐⭐