ArtFormer: Controllable Generation of Diverse 3D Articulated Objects¶
Conference: CVPR 2025
arXiv: 2412.07237
Code: https://github.com/ShuYuMo2003/ArtFormer
Area: 3D Generation / Articulated Object Modeling
Keywords: Articulated Objects, Tree Structure Parameterization, Shape Prior, Controllable Generation, Text/Image Guidance
TL;DR¶
This work proposes the ArtFormer framework, which generates high-quality, diverse, and kinematically accurate 3D articulated objects from text/image descriptions via tree structure parameterization and a conditional diffusion shape prior, significantly outperforming existing methods in generation quality and diversity.
Background & Motivation¶
Background¶
Background: Research on generating articulated objects (multiple rigid bodies connected by joints) remains limited. NAP is restricted by pre-defined graph structures, while CAGE/SINGAPO rely on retrieval-based methods, which limits diversity.
Key Challenge: Quality vs. flexibility (fixed structures restrict diversity) and geometric quality vs. kinematic accuracy (conflicts between the two constraints).
Key Insight: Articulated objects are essentially tree-structured \(\rightarrow\) Tree Position Encoding captures hierarchical relationships + Shape Prior ensures geometric quality + Gumbel-Softmax expands diversity.
Proposed Approach¶
Goal:
Overall Architecture¶
Phase 1: Shape Prior Pre-training (using VAE + SDF + conditional diffusion model to learn geometric priors)
Phase 2: Articulation Transformer (using Tree Position Encoding + iterative decoding to generate nodes layer by layer)
Key Designs¶
- Tree Structure Parameterization: Each node stores geometric attributes (bbox \(b_i \in \mathbb{R}^6\), latent code \(z_i \in \mathbb{R}^{768}\)).
Method¶
Overall Architecture¶
Phase 1: Shape Prior Pre-training (using VAE + SDF + conditional diffusion model to learn geometric priors)
Phase 2: Articulation Transformer (using Tree Position Encoding + iterative decoding to generate nodes layer by layer)
Key Designs¶
-
Tree Structure Parameterization: Each node stores geometric attributes (bbox \(b_i \in \mathbb{R}^6\), latent code \(z_i \in \mathbb{R}^{768}\)) + kinematic attributes (joint axis \(j_i \in \mathbb{R}^6\), motion range \(l_i \in \mathbb{R}^4\)), with a total dimension of \(D=785\).
-
Shape Prior (Gumbel-Softmax Sampling): Decomposes the geometric code into 4 components, expanding the latent space from \(4N\) to \(N^4\) via 4 independent codebooks. This significantly increases diversity without increasing computational costs.
-
Tree Position Encoding (TPE): Encodes the root-to-node path using a bidirectional GRU for absolute position, and concatenates path node encodings for relative position. This supports arbitrary tree structures.
-
Iterative Decoding: Predicts child nodes for existing nodes round by round until all outputs are termination tokens, effectively capturing inter-part dependencies.
Loss & Training¶
\(L_{trans} = \beta_o L_o + \beta_P L_P + L_a\) (termination classification + codebook KL + attribute MSE)
Key Experimental Results¶
Main Results¶
| Method | POR↓ | MMD↓ | COV↑ | DS↑ |
|---|---|---|---|---|
| NAP-128 | 0.805 | 0.3085 | 0.7021 | 0.13 |
| CAGE | 0.251 | 0.6064 | 0.5319 | 0.07 |
| Ours | 0.709 | 0.5213 | 0.5266 | 0.67 |
Ablation Study¶
| Configuration | POR↓ | MMD↓ | COV↑ |
|---|---|---|---|
| Full Model | 0.709 | 0.5213 | 0.5266 |
| w/o TPE | 1.170% | 0.5000 | 0.5053 |
| w/o Shape Prior | 2.502% | 0.4574 | 0.7606 |
Key Findings¶
- The diversity metric (DS=0.67) is significantly higher than existing methods (CAGE is only 0.07).
- After removing TPE, POR increases by 65%, demonstrating the necessity of positional information.
- All metrics severely degrade when the Shape Prior is removed.
Highlights & Insights¶
- Tree structure parameterization elegantly simplifies the representation of articulated objects.
- Gumbel-Softmax expands the diversity space (\(4N \rightarrow N^4\)) without increasing computational overhead.
- Tree Position Encoding (TPE) is a key innovation, as clearly demonstrated by the ablation study.
Limitations & Future Work¶
- Training was conducted on only 6 object categories, with limited sample sizes for some categories.
- Quantitative conditional control (e.g., rotation angles) is challenging.
- Multi-category training using SDF exhibits a drop in generalization performance.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Integration of Tree Position Encoding, Shape Prior, and Gumbel-Softmax
- Experimental Thoroughness: ⭐⭐⭐⭐ Detailed ablation study; lacks cross-dataset validation
- Writing Quality: ⭐⭐⭐⭐⭐ Clear figures and tables, logically rigorous
- Value: ⭐⭐⭐⭐ Strong potential for applications in robotics and digital twins