Disco4D: Disentangled 4D Human Generation and Animation from a Single Image¶

Conference: CVPR 2025
arXiv: 2409.17280
Code: https://disco-4d.github.io/
Area: 3D Vision
Keywords: 4D Human Generation, Clothing Disentanglement, Gaussian Splatting, Human Animation, Single-image Reconstruction

TL;DR¶

Disco4D proposes a 4D human generation framework that disentangles clothing (represented by a Gaussian model) from the human body (represented by the SMPL-X model), generating animatable, editable, and layered 3D clothed human models from a single image, and supporting realistic 4D clothing dynamics.

Background & Motivation¶

Existing single-image 3D human reconstruction methods (such as PIFu, ECON) reconstruct the body and clothing as a single-layer holistic mesh, which cannot support applications like virtual try-on or clothing editing.
The single-layer non-animatable mesh produced by holistic reconstruction makes re-animation and dynamic customization extremely difficult.
Prior work has not achieved 4D layered human generation and animation from a single image or a few images.
A representation that can separate the body and clothing is needed, while simultaneously supporting high-fidelity generation, fine-grained editing, and realistic animation.

Method¶

Overall Architecture¶

Disco4D adopts a bottom-up construction approach: (1) represent the human body using the SMPL-X parametric model; (2) represent clothing using 3D Gaussians; (3) bind the clothing Gaussians to the SMPL-X mesh and iteratively optimize them; (4) achieve separation and extraction of clothing categories via identity encoding. The complete human representation is denoted as \(S_{human} = S_{body} \cup S_{cloth}\).

Key Designs¶

SMPL-X Gaussian Human Representation:
- Function: Provides a stable human anchor structure.
- Mechanism: Directly binds flat 3D Gaussians to each triangular face of the SMPL-X mesh (similar to SuGaR). The Gaussian positions \(\mu_{body}\) are determined by predefined barycentric coordinates, rotations are derived from face normals, and opacity is fixed to 1.0.
- Design Motivation: SMPL-X excels at capturing human structure and kinematics. Fixing the body representation maintains integrity, allowing the learning process to focus on clothing.
Mesh Embedding and Separable Optimization of Clothing Gaussians:
- Function: Models clothing as a layer independent of the human body.
- Mechanism: Each clothing Gaussian is embedded onto a triangular face of the canonical SMPL-X mesh, located using a local coordinate system \(\mu = O + \sigma i + \beta j + \gamma k\). SDF loss and pruning are used to ensure clothing remains outside the body. An identity encoding \(e \in \mathbb{R}^{15}\) is introduced to associate each Gaussian with its clothing category.
- Design Motivation: Embedding Gaussians into triangular faces allows them to animate naturally with SMPL-X deformations; identity encoding enables different garments to be independently extracted and edited.
Diffusion Model-Enhanced Texture Completion and 4D Animation:
- Function: Completes textures in occluded regions and learns clothing dynamics.
- Mechanism: Extracts texture details of occluded areas from a diffusion model using SDS loss. 4D animation predicts changes in position, rotation, and scale of clothing Gaussians via a deformation network \(S'' = \phi(S', t)\), while the body is driven by SMPL-X. Thus, the clothing follows the body while exhibiting its own dynamics.
- Design Motivation: Single images inevitably contain occluded regions, for which diffusion models provide reasonable priors; decoupled animation of clothing and body yields more realistic results.

Loss & Training¶

Total loss function: \(\mathcal{L} = \mathcal{L}_{ori} + \mathcal{L}_{id} + \mathcal{L}_{ani} + \mathcal{L}_{sdf} + \mathcal{L}_{SDS}\)

\(\mathcal{L}_{ori}\): Standard 3D Gaussian rendering loss.
\(\mathcal{L}_{id} = \mathcal{L}_{2d} + \mathcal{L}_{3d}\): Identity encoding loss, incorporating cross-entropy classification and 3D nearest-neighbor consistency regularization.
\(\mathcal{L}_{ani}\): Anisotropy constraint, preventing overly thin Gaussian kernels during deformation.
\(\mathcal{L}_{sdf}\): SDF loss, ensuring clothing Gaussians remain outside the SMPL-X body.
\(\mathcal{L}_{SDS}\): Score Distillation Sampling loss for texture completion in occluded regions.

Key Experimental Results¶

Main Results¶

Method	CLIP(All)↑	CLIP(Pants)↑	PSNR(NV)↑	SSIM(NV)↑	LPIPS(NV)↓
Disco4D(CloSe)	0.856	0.858	20.10	0.918	0.081
LGM	0.829	0.727	20.50	0.939	0.077
DreamGaussian	0.734	0.693	20.08	0.939	0.089
SHERF	0.777	0.785	18.96	0.912	0.083

Ablation Study¶

Configuration	CLIP(All)↑	Assets↑	PSNR↑	SSIM↑	LPIPS↓
Disco4D (reposed + deform)	0.900	0.865	25.46	0.96	0.035
Disco4D (reposed)	0.853	0.774	23.94	0.95	0.049
DG4D (Disco4D init)	0.870	0.849	21.02	0.93	0.065
GaussianAvatar	0.822	0.768	20.01	0.93	0.069

Key Findings¶

Disco4D significantly outperforms all baselines in clothing disentanglement (indicated by CLIP Assets scores).
Incorporating learned clothing deformations (+learned deformations) improves PSNR from 23.94 to 25.46 and reduces LPIPS from 0.049 to 0.035.
In user studies, Disco4D achieves an image consistency score of 3.142 (on a 5-point scale), far exceeding LGM's 2.338 and DreamGaussian's 2.017.
Using SMPL-X as a body anchor significantly improves the geometric accuracy of the face and limbs.

Highlights & Insights¶

First to achieve single-image 4D layered human generation: Supports simultaneous body-clothing separation, animation, and editing.
Ingenious identity encoding mechanism: Employs a 15-dimensional learnable vector with 2D segmentation supervision and 3D nearest-neighbor regularization to achieve fine-grained clothing category separation.
No training on human-specific datasets: Leverages general-purpose diffusion models, removing the need for human-specific dataset training.
Rich editing capabilities: Supports clothing removal, color changing, material modification, and cross-character clothing transfer.

Limitations & Future Work¶

The accuracy of the initial SMPL-X estimation affects the quality of all downstream steps.
Modeling loose clothing (like long dresses) and complex accessories remains challenging.
The texture quality from diffusion model distillation might be suboptimal in heavily occluded areas such as the back.
Clothing dynamics learning relies on video inputs; under the single-image modality, animation is limited to simple SMPL-X driving.

Compared to NeRF-based methods like SHERF/ELICIT, the Gaussian representation offers significant advantages in efficiency and editing flexibility.
The layered representation idea of SMPL-X + Gaussian can be extended to other layered reconstruction tasks.
The design of identity encoding can inspire finer-grained scene decomposition approaches.
It provides a practical framework for virtual try-on, digital fashion, and content creation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to introduce body-clothing disentanglement to single-image 4D generation with an elegant framework design.
Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative evaluation on multiple datasets + user studies, but lacks more real-world demos.
Writing Quality: ⭐⭐⭐⭐ Clear structure and detailed method description.
Value: ⭐⭐⭐⭐⭐ High practical value, providing a new paradigm for fields like virtual try-on.