AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation¶

Conference: ECCV 2024
arXiv: 2312.03795
Code: https://zz7379.github.io/AnimatableDreamer/
Area: Model Compression
Keywords: Text-to-4D, Non-rigid 3D, Score Distillation, Skeleton Animation, Canonical Space

TL;DR¶

This work proposes AnimatableDreamer, which extracts skeletons and motion from monocular videos and generates text-guided animatable 3D non-rigid models via Canonical Score Distillation (CSD), comprehensively outperforming existing methods in both generation quality and temporal consistency.

Background & Motivation¶

Background: Text-to-3D generation (SDS/DreamFusion) is capable of generating high-quality static 3D objects, but generating deformable/non-rigid objects remains challenging. Existing methods are either limited to specific categories (e.g., humans) or fail to guarantee morphological consistency across different poses.

Limitations of Prior Work: - Applying vanilla SDS directly to animatable objects disrupts motion consistency, leading to incoherent surfaces generated under different poses. - Reconstructing deformable objects from monocular videos yields poor geometric quality in unobserved regions (e.g., the other side of an animal). - Existing methods (e.g., BANMo) require multiple video sequences to obtain high-quality 3D reconstructions.

Key Challenge: Generating animatable non-rigid 3D models requires SDS supervision, but there is a lack of a consistency bridge between the canonical space and the observation space.

Goal: (a) How to extract reusable skeletons/skinning from monocular videos; (b) How to generate new animatable 3D models guided by text under skeletal constraints.

Key Insight: The warping field is utilized as a bridge between the canonical space and the observation space, allowing gradients from the diffusion model to propagate back to the canonical model through warping.

Core Idea: Canonical Score Distillation: calculating diffusion prior gradients on observed deformed poses and backpropagating them to the canonical model via differentiable warping, ensuring consistency across all poses.

Method¶

Overall Architecture¶

Two stages: (1) Extraction stage—learning an implicit articulate model (NeuS + skeletal warping field) from a monocular video and enhancing unobserved regions with CSD; (2) Generation stage—generating a text-guided new 3D model on the extracted skeleton using CSD + MVDream.

Key Designs¶

Implicit Articulate Model:
- Function: Representing the canonical model using NeuS (SDF + color + feature descriptor), and performing deformation via linear blend skinning (LBS).
- Mechanism: Skinning weights for $B$ bones are modeled using Gaussian distributions and calculated via Mahalanobis distance. Bone transformations are learned using an MLP with Fourier temporal embeddings.
- Design Motivation: The canonical space guarantees temporal consistency, while skeletal skinning provides controllable animation capabilities.
Skeleton Construction:
- Function: Constructing a structured skeleton from the learned bone model to constrain the generation process.
- Mechanism: DINOv2 features are used to compute semantic relationships between bones, and skinning weights are used to compute morphological relationships, yielding a joint connectivity strength $\mathcal{T}_{j,k}$. Translation and angular constraints are applied to prevent implausible deformations.
- Design Motivation: Skeleton constraints ensure that the generated new objects maintain plausible motion patterns.
Canonical Score Distillation (CSD) $\leftarrow$ Core Contribution:
- Function: Allowing supervision signals from the diffusion model to propagate back to the canonical model via the warping mechanism.
- Mechanism: $$\nabla_\phi \mathcal{L}_{CSD} = \mathbb{E}[\underbrace{(\epsilon_\theta - \epsilon)}_{\text{扩散先验}} \cdot \underbrace{\frac{\partial \mathcal{R}(\mathbf{X}_*)}{\partial \mathbf{X}_*}}_{\text{Canonical 渲染}} \cdot \underbrace{\frac{\partial W(\mathbf{X}^t)}{\partial \phi_w}}_{\text{Warp 精化}}]$$
- Three gradient terms: diffusion prior provides appearance guidance $\rightarrow$ canonical rendering guarantees consistency $\rightarrow$ warp refinement optimizes deformation parameters.
- Using MVDream (a multi-view diffusion model) to render 4 orthogonal views simultaneously, ensuring 3D consistency.
- Design Motivation: Standard SDS only computes gradients in the observation space, failing to guarantee canonical space consistency. CSD ensures gradient propagation to the canonical model via the chain rule of warping.

Loss & Training¶

Extraction stage: $\mathcal{L}_{Ext} = \mathcal{L}_{recon}(\text{RGB+轮廓+光流}) + \mathcal{L}_{CSD} + \mathcal{L}_{reg}(\text{特征匹配+循环一致性})$
Generation stage: $\mathcal{L}_{Gen} = \mathcal{L}_{skel} + \mathcal{L}_{bone} + \mathcal{L}_{CSD} + \mathcal{L}_{reg}$
Rendering with a resolution of 200×200 and 4 orthogonal views.
Training takes approximately 5 hours for 12,000 iterations on a single A800 GPU.

Key Experimental Results¶

Main Results¶

Method	CLIP↑	CLIP-T↑	R-Precision@10↑	GPT Eval3D↑
ProlificDreamer	33.1	95.9	56.3	959
MVDream	34.8	94.4	31.2	979
Ours	38.2	96.6	87.5	1098

Monocular Reconstruction Task (Chamfer Distance ↓ / F-score@2% ↑)¶

Method	Videos	Cat-Coco	Cat-Pikachu	Penguin	Shiba
BANMo	1	10.7/15.3	3.71/57.3	6.47/43.9	6.81/36.6
RAC	1	6.25/42.2	3.60/60.2	4.68/53.7	7.94/30.1
Ours	1	3.65/63.3	2.0/88.9	3.7/64.0	4.54/53.9

Ablation Study¶

Configuration	CLIP↑	R-Precision↑	Description
w/o bone+skel	27.1	35.6	No bone constraints, complete failure
w/o skel	28.4	40.1	No skeleton constraints, motion collapse
w/o bone	37.8	81.7	No bone surface constraints, slight degradation in quality
Full model	38.2	87.5	Full model is optimal

Key Findings¶

CSD contributes significantly to reconstruction: Without CSD on Cat-Coco, CD increases from 3.65 to 8.34 (+128%), and F-score drops from 63.3 to 32.6 (-48%).
R-Precision undergoes the most significant improvement: 87.5% vs. MVDream 31.2%, showing that skeleton constraints greatly improve text-model consistency.
Monocular reconstruction outperforms multi-video methods: It achieves 3.65 on Cat-Coco, outperforming BANMo (4.66) using 4 videos, which demonstrates that CSD effectively compensates for the missing information in single-view setups.

Highlights & Insights¶

Key insight of CSD: Utilizing warping as a differentiable bridge between the canonical space and the observation space. This concept can be generalized to any task that requires performing SDS on a transformed space, such as cloth simulation and fluids.
Two-level design of skeleton constraints: Semantic correlation (DINOv2 feature similarity) + morphological correlation (skinning weight overlap) jointly determine the strength of joint connections, which is more robust than purely geometric or semantic alternatives.
Unified framework for generation and reconstruction: The same CSD framework can be used for both reconstruction from videos (enhancing unobserved regions) and generation from text (ensuring animation consistency).

Limitations & Future Work¶

High GPU memory consumption: CSD requires a long gradient chain from the camera space to the canonical space, and MVDream processes 4 views simultaneously, leading to high VRAM demands.
Limited resolution: It can only render at 200×200 resolution, limiting detail quality.
Restricted to skeleton-driven deformation: It cannot handle topological changes (e.g., object splitting) or non-skeleton-driven deformations such as cloth.
Reliance on video quality: Skeleton extraction depends on the quality of HOI detection and optical flow, which might fail under severe video occlusions.

vs. DreamFusion/ProlificDreamer: While they generate static 3D models, this work generates animatable 3D models. CSD is a natural extension of SDS.
vs. BANMo/RAC: These pure reconstruction methods rely on multiple video sequences. In contrast, this work utilizes the diffusion prior of CSD to compensate for information missing from a single video.
vs. 4D generation (such as MAV3D): MAV3D directly performs SDS on 4D NeRF with no canonical space to guarantee consistency.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The concept of CSD is novel, presenting a significant contribution by using warping as a differentiable bridge for SDS.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated on both generation and reconstruction tasks with detailed ablation studies, though the dataset scale is relatively small.
Writing Quality: ⭐⭐⭐⭐ Clear method illustrations and well-derived CSD equations.
Value: ⭐⭐⭐⭐⭐ Opens a new paradigm for animatable 3D generation, and the core concept of CSD is widely reusable.