AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation¶
Conference: ECCV 2024
arXiv: 2312.03795
Code: https://zz7379.github.io/AnimatableDreamer/
Area: Model Compression
Keywords: Text-to-4D, Non-rigid 3D, Score Distillation, Skeleton Animation, Canonical Space
TL;DR¶
This work proposes AnimatableDreamer, which extracts skeletons and motion from monocular videos and generates text-guided animatable 3D non-rigid models via Canonical Score Distillation (CSD), comprehensively outperforming existing methods in both generation quality and temporal consistency.
Background & Motivation¶
Background: Text-to-3D generation (SDS/DreamFusion) is capable of generating high-quality static 3D objects, but generating deformable/non-rigid objects remains challenging. Existing methods are either limited to specific categories (e.g., humans) or fail to guarantee morphological consistency across different poses.
Limitations of Prior Work: - Applying vanilla SDS directly to animatable objects disrupts motion consistency, leading to incoherent surfaces generated under different poses. - Reconstructing deformable objects from monocular videos yields poor geometric quality in unobserved regions (e.g., the other side of an animal). - Existing methods (e.g., BANMo) require multiple video sequences to obtain high-quality 3D reconstructions.
Key Challenge: Generating animatable non-rigid 3D models requires SDS supervision, but there is a lack of a consistency bridge between the canonical space and the observation space.
Goal: (a) How to extract reusable skeletons/skinning from monocular videos; (b) How to generate new animatable 3D models guided by text under skeletal constraints.
Key Insight: The warping field is utilized as a bridge between the canonical space and the observation space, allowing gradients from the diffusion model to propagate back to the canonical model through warping.
Core Idea: Canonical Score Distillation: calculating diffusion prior gradients on observed deformed poses and backpropagating them to the canonical model via differentiable warping, ensuring consistency across all poses.
Method¶
Overall Architecture¶
Two stages: (1) Extraction stage—learning an implicit articulate model (NeuS + skeletal warping field) from a monocular video and enhancing unobserved regions with CSD; (2) Generation stage—generating a text-guided new 3D model on the extracted skeleton using CSD + MVDream.
Key Designs¶
-
Implicit Articulate Model:
- Function: Representing the canonical model using NeuS (SDF + color + feature descriptor), and performing deformation via linear blend skinning (LBS).
- Mechanism: Skinning weights for \(B\) bones are modeled using Gaussian distributions and calculated via Mahalanobis distance. Bone transformations are learned using an MLP with Fourier temporal embeddings.
- Design Motivation: The canonical space guarantees temporal consistency, while skeletal skinning provides controllable animation capabilities.
-
Skeleton Construction:
- Function: Constructing a structured skeleton from the learned bone model to constrain the generation process.
- Mechanism: DINOv2 features are used to compute semantic relationships between bones, and skinning weights are used to compute morphological relationships, yielding a joint connectivity strength \(\mathcal{T}_{j,k}\). Translation and angular constraints are applied to prevent implausible deformations.
- Design Motivation: Skeleton constraints ensure that the generated new objects maintain plausible motion patterns.
-
Canonical Score Distillation (CSD) \(\leftarrow\) Core Contribution:
- Function: Allowing supervision signals from the diffusion model to propagate back to the canonical model via the warping mechanism.
- Mechanism: $\(\nabla_\phi \mathcal{L}_{CSD} = \mathbb{E}[\underbrace{(\epsilon_\theta - \epsilon)}_{\text{扩散先验}} \cdot \underbrace{\frac{\partial \mathcal{R}(\mathbf{X}_*)}{\partial \mathbf{X}_*}}_{\text{Canonical 渲染}} \cdot \underbrace{\frac{\partial W(\mathbf{X}^t)}{\partial \phi_w}}_{\text{Warp 精化}}]\)$
- Three gradient terms: diffusion prior provides appearance guidance \(\rightarrow\) canonical rendering guarantees consistency \(\rightarrow\) warp refinement optimizes deformation parameters.
- Using MVDream (a multi-view diffusion model) to render 4 orthogonal views simultaneously, ensuring 3D consistency.
- Design Motivation: Standard SDS only computes gradients in the observation space, failing to guarantee canonical space consistency. CSD ensures gradient propagation to the canonical model via the chain rule of warping.
Loss & Training¶
- Extraction stage: \(\mathcal{L}_{Ext} = \mathcal{L}_{recon}(\text{RGB+轮廓+光流}) + \mathcal{L}_{CSD} + \mathcal{L}_{reg}(\text{特征匹配+循环一致性})\)
- Generation stage: \(\mathcal{L}_{Gen} = \mathcal{L}_{skel} + \mathcal{L}_{bone} + \mathcal{L}_{CSD} + \mathcal{L}_{reg}\)
- Rendering with a resolution of 200×200 and 4 orthogonal views.
- Training takes approximately 5 hours for 12,000 iterations on a single A800 GPU.
Key Experimental Results¶
Main Results¶
| Method | CLIP↑ | CLIP-T↑ | R-Precision@10↑ | GPT Eval3D↑ |
|---|---|---|---|---|
| ProlificDreamer | 33.1 | 95.9 | 56.3 | 959 |
| MVDream | 34.8 | 94.4 | 31.2 | 979 |
| Ours | 38.2 | 96.6 | 87.5 | 1098 |
Monocular Reconstruction Task (Chamfer Distance ↓ / F-score@2% ↑)¶
| Method | Videos | Cat-Coco | Cat-Pikachu | Penguin | Shiba |
|---|---|---|---|---|---|
| BANMo | 1 | 10.7/15.3 | 3.71/57.3 | 6.47/43.9 | 6.81/36.6 |
| RAC | 1 | 6.25/42.2 | 3.60/60.2 | 4.68/53.7 | 7.94/30.1 |
| Ours | 1 | 3.65/63.3 | 2.0/88.9 | 3.7/64.0 | 4.54/53.9 |
Ablation Study¶
| Configuration | CLIP↑ | R-Precision↑ | Description |
|---|---|---|---|
| w/o bone+skel | 27.1 | 35.6 | No bone constraints, complete failure |
| w/o skel | 28.4 | 40.1 | No skeleton constraints, motion collapse |
| w/o bone | 37.8 | 81.7 | No bone surface constraints, slight degradation in quality |
| Full model | 38.2 | 87.5 | Full model is optimal |
Key Findings¶
- CSD contributes significantly to reconstruction: Without CSD on Cat-Coco, CD increases from 3.65 to 8.34 (+128%), and F-score drops from 63.3 to 32.6 (-48%).
- R-Precision undergoes the most significant improvement: 87.5% vs. MVDream 31.2%, showing that skeleton constraints greatly improve text-model consistency.
- Monocular reconstruction outperforms multi-video methods: It achieves 3.65 on Cat-Coco, outperforming BANMo (4.66) using 4 videos, which demonstrates that CSD effectively compensates for the missing information in single-view setups.
Highlights & Insights¶
- Key insight of CSD: Utilizing warping as a differentiable bridge between the canonical space and the observation space. This concept can be generalized to any task that requires performing SDS on a transformed space, such as cloth simulation and fluids.
- Two-level design of skeleton constraints: Semantic correlation (DINOv2 feature similarity) + morphological correlation (skinning weight overlap) jointly determine the strength of joint connections, which is more robust than purely geometric or semantic alternatives.
- Unified framework for generation and reconstruction: The same CSD framework can be used for both reconstruction from videos (enhancing unobserved regions) and generation from text (ensuring animation consistency).
Limitations & Future Work¶
- High GPU memory consumption: CSD requires a long gradient chain from the camera space to the canonical space, and MVDream processes 4 views simultaneously, leading to high VRAM demands.
- Limited resolution: It can only render at 200×200 resolution, limiting detail quality.
- Restricted to skeleton-driven deformation: It cannot handle topological changes (e.g., object splitting) or non-skeleton-driven deformations such as cloth.
- Reliance on video quality: Skeleton extraction depends on the quality of HOI detection and optical flow, which might fail under severe video occlusions.
Related Work & Insights¶
- vs. DreamFusion/ProlificDreamer: While they generate static 3D models, this work generates animatable 3D models. CSD is a natural extension of SDS.
- vs. BANMo/RAC: These pure reconstruction methods rely on multiple video sequences. In contrast, this work utilizes the diffusion prior of CSD to compensate for information missing from a single video.
- vs. 4D generation (such as MAV3D): MAV3D directly performs SDS on 4D NeRF with no canonical space to guarantee consistency.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The concept of CSD is novel, presenting a significant contribution by using warping as a differentiable bridge for SDS.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated on both generation and reconstruction tasks with detailed ablation studies, though the dataset scale is relatively small.
- Writing Quality: ⭐⭐⭐⭐ Clear method illustrations and well-derived CSD equations.
- Value: ⭐⭐⭐⭐⭐ Opens a new paradigm for animatable 3D generation, and the core concept of CSD is widely reusable.