ArtFormer: Controllable Generation of Diverse 3D Articulated Objects¶

Conference: CVPR 2025
arXiv: 2412.07237
Code: https://github.com/ShuYuMo2003/ArtFormer
Area: 3D Generation / Articulated Object Modeling
Keywords: Articulated Objects, Tree Structure Parameterization, Shape Prior, Controllable Generation, Text/Image Guidance

TL;DR¶

This work proposes the ArtFormer framework, which generates high-quality, diverse, and kinematically accurate 3D articulated objects from text/image descriptions via tree structure parameterization and a conditional diffusion shape prior, significantly outperforming existing methods in generation quality and diversity.

Background & Motivation¶

Background¶

Background: Research on generating articulated objects (multiple rigid bodies connected by joints) remains limited. NAP is restricted by pre-defined graph structures, while CAGE/SINGAPO rely on retrieval-based methods, which limits diversity.

Key Challenge: Quality vs. flexibility (fixed structures restrict diversity) and geometric quality vs. kinematic accuracy (conflicts between the two constraints).

Key Insight: Articulated objects are essentially tree-structured \(\rightarrow\) Tree Position Encoding captures hierarchical relationships + Shape Prior ensures geometric quality + Gumbel-Softmax expands diversity.

Proposed Approach¶

Goal:

Overall Architecture¶

Phase 1: Shape Prior Pre-training (using VAE + SDF + conditional diffusion model to learn geometric priors)
Phase 2: Articulation Transformer (using Tree Position Encoding + iterative decoding to generate nodes layer by layer)

Key Designs¶

Tree Structure Parameterization: Each node stores geometric attributes (bbox \(b_i \in \mathbb{R}^6\), latent code \(z_i \in \mathbb{R}^{768}\)).

Method¶

Overall Architecture¶

Phase 1: Shape Prior Pre-training (using VAE + SDF + conditional diffusion model to learn geometric priors)
Phase 2: Articulation Transformer (using Tree Position Encoding + iterative decoding to generate nodes layer by layer)

Key Designs¶

Tree Structure Parameterization: Each node stores geometric attributes (bbox \(b_i \in \mathbb{R}^6\), latent code \(z_i \in \mathbb{R}^{768}\)) + kinematic attributes (joint axis \(j_i \in \mathbb{R}^6\), motion range \(l_i \in \mathbb{R}^4\)), with a total dimension of \(D=785\).
Shape Prior (Gumbel-Softmax Sampling): Decomposes the geometric code into 4 components, expanding the latent space from \(4N\) to \(N^4\) via 4 independent codebooks. This significantly increases diversity without increasing computational costs.
Tree Position Encoding (TPE): Encodes the root-to-node path using a bidirectional GRU for absolute position, and concatenates path node encodings for relative position. This supports arbitrary tree structures.
Iterative Decoding: Predicts child nodes for existing nodes round by round until all outputs are termination tokens, effectively capturing inter-part dependencies.

Loss & Training¶

\(L_{trans} = \beta_o L_o + \beta_P L_P + L_a\) (termination classification + codebook KL + attribute MSE)

Key Experimental Results¶

Main Results¶

Method	POR↓	MMD↓	COV↑	DS↑
NAP-128	0.805	0.3085	0.7021	0.13
CAGE	0.251	0.6064	0.5319	0.07
Ours	0.709	0.5213	0.5266	0.67

Ablation Study¶

Configuration	POR↓	MMD↓	COV↑
Full Model	0.709	0.5213	0.5266
w/o TPE	1.170%	0.5000	0.5053
w/o Shape Prior	2.502%	0.4574	0.7606

Key Findings¶

The diversity metric (DS=0.67) is significantly higher than existing methods (CAGE is only 0.07).
After removing TPE, POR increases by 65%, demonstrating the necessity of positional information.
All metrics severely degrade when the Shape Prior is removed.

Highlights & Insights¶

Tree structure parameterization elegantly simplifies the representation of articulated objects.
Gumbel-Softmax expands the diversity space (\(4N \rightarrow N^4\)) without increasing computational overhead.
Tree Position Encoding (TPE) is a key innovation, as clearly demonstrated by the ablation study.

Limitations & Future Work¶

Training was conducted on only 6 object categories, with limited sample sizes for some categories.
Quantitative conditional control (e.g., rotation angles) is challenging.
Multi-category training using SDF exhibits a drop in generalization performance.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Integration of Tree Position Encoding, Shape Prior, and Gumbel-Softmax
Experimental Thoroughness: ⭐⭐⭐⭐ Detailed ablation study; lacks cross-dataset validation
Writing Quality: ⭐⭐⭐⭐⭐ Clear figures and tables, logically rigorous
Value: ⭐⭐⭐⭐ Strong potential for applications in robotics and digital twins

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects¶

TL;DR¶

Background & Motivation¶

Background¶

Proposed Approach¶

Overall Architecture¶

Key Designs¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Related Papers¶