WordRobe: Text-Guided Generation of Textured 3D Garments¶
Conference: ECCV 2024
arXiv: 2403.17541
Code: Planned to be open-source (project homepage available)
Area: 3D Garment Generation / Text-driven 3D Content Creation
Keywords: Text-to-3D Garment, Latent Space, UDF, ControlNet Texture, CLIP-guided Generation
TL;DR¶
Proposes WordRobe, which learns a 3D garment UDF latent space through a coarse-to-fine two-stage encoder-decoder framework. It utilizes a weakly supervised CLIP mapping network to achieve text-driven 3D garment generation and editing, and leverages the view-composited property of ControlNet to generate view-consistent texture maps in a single forward inference pass, running 13 times faster than Text2Tex.
Background & Motivation¶
Background: 3D garment modeling is in high demand for virtual try-on, game characters, and AR/VR experiences. Traditional methods rely on manual modeling using design tools like CLO or high-end 3D scanners, which are costly and difficult to scale.
Limitations of Prior Work:
-
Parametric methods (BCNet, SMPLicit): Restricted by human body templates such as SMPL, they can only handle tight-fitting clothes, leading to limited garment varieties.
-
Non-parametric methods (ReEF, xCloth): Can model various styles but output posed meshes with low-quality textures, making them difficult to use directly in graphics pipelines.
-
DrapeNet: Although capable of learning a latent space, it lacks support for textures, offers uncontrollable generation, and requires explicit labels for editing.
-
General text-to-3D (DreamBooth3D, etc.): Generate garments with poor surface quality. NeRF/SDF cannot handle open surfaces (cuffs, necklines), and multi-view optimization is extremely slow.
Key Challenge: A method is required that can both generate high-quality, unposed (canonical T-pose) 3D garment meshes and intuitively control shape and texture via text.
Key Insight: A divide-and-conquer strategy—first learn the latent space of garment geometry, then achieve text control through CLIP mapping, and finally utilize the zero-shot capability of ControlNet to efficiently generate textures.
Core Idea: Learning a robust 3D garment UDF latent space + weakly supervised CLIP-to-Latent mapping + view-composited ControlNet to generate textures in a single formulation.
Method¶
Overall Architecture¶
Input text prompt \(\rightarrow\) CLIP text encoder yields embedding \(\psi\) \(\rightarrow\) Mapping Network predicts garment latent code \(\phi \in \Omega\) \(\rightarrow\) Coarse Decoder + Fine Decoder decode to UDF in two steps \(\rightarrow\) Marching Cubes extracts mesh \(\rightarrow\) UV parameterization \(\rightarrow\) Front and back depth maps are merged and fed into ControlNet \(\rightarrow\) Generates a view-composited RGB image \(\rightarrow\) Projected onto the UV texture map \(\rightarrow\) Final textured 3D garment.
Key Designs¶
-
Coarse-to-Fine 3D Garment Latent Space:
- Function: Encodes multi-category 3D garments into a 32-dimensional latent space \(\Omega\), allowing decoding back to high-quality geometry.
- Mechanism: Uses a DGCNN as the encoder \(\xi\) to encode the garment surface point cloud into \(\phi \in \mathbb{R}^{32}\); employs two task-specific MLP decoders: \(D_{coarse}\) predicts smooth UDF, and \(D_{fine}\) predicts residuals to correct high-frequency details.
- Key Formula: \(\sigma_{fine} = D_{coarse}(\phi) + D_{fine}(\phi) = \sigma_{coarse} + \sigma_{delta}\)
- Design Motivation: A single decoder cannot simultaneously learn a regularized latent space and high-frequency geometric details (e.g., wrinkles, folds). The coarse decoder is responsible for the overall shape + latent space regularization, while the fine decoder is responsible for detailing.
- Latent space disentanglement loss: \(\mathcal{L}_{latent} = \|\Sigma_b - \mathbf{I}_k\|\), which encourages independence among latent dimensions, allowing single-dimension manipulation for a single attribute change.
- Coarse stage loss: \(\mathcal{L}_{coarse} = \lambda_{dist}\mathcal{L}_{dist} + \lambda_{grad}\mathcal{L}_{grad} + \lambda_{latent}\mathcal{L}_{latent}\)
- Fine stage loss: \(\mathcal{L}_{fine} = \lambda_{dist}\mathcal{L}_{dist} + \lambda_{grad}\mathcal{L}_{grad}\)
-
CLIP-Guided Weakly Supervised Mapping Network:
- Function: Trains an \(MLP_{map}\) to map CLIP embeddings to garment latent codes, enabling text control.
- Mechanism: Eliminates the need for manual text labeling—renders depth maps with random rotations for each training garment, feeds them into ControlNet to generate garment images, and then obtains \(\psi_i\) via a CLIP image encoder; simultaneously, \(\phi_i\) is obtained via the encoder \(\xi\). The \(MLP_{map}\) is trained on \((\psi_i, \phi_i)\) pairs.
- Training loss: Simple L1 loss \(\|\text{MLP}_{map}(\psi_i) - \phi_i\|_1\)
- Design Motivation: 3D garment data lacks text annotations. Leveraging ControlNet to generate realistic garment images and passing them through a CLIP image encoder cleverly eliminates the annotation requirement.
- Prompt template construction: Random combinations of "a garment made of {silk/cotton/wool/leather}, with {vibrant/dull/bright/shiny/matte} colors"
-
View-Composited ControlNet Texture Synthesis:
- Function: Generates view-consistent, high-quality texture maps for 3D garments in a single forward inference pass.
- Mechanism: Discovered a key property of ControlNet—when multi-view depth maps are concatenated into a single image input, the generated RGB image maintains color and lighting consistency across different views. Front-back orthographic projections are used to render depth maps \(\pi_{depth}\), which are concatenated into a single 1024x1024 image. ControlNet generates \(\pi_{rgb}\), which is then projected onto the UV texture map.
- Design Motivation: Multi-view optimization methods like Text2Tex are slow (~5 min/prompt) and suffer from view inconsistencies and patchy artifacts. The proposed method only requires ~22 seconds.
- Orthographic projection selection: Perspective projection loses more information in tangent regions; front-back division is a natural choice for garments, reducing visible seams.
Loss & Training¶
- Encoder \(\xi\) + \(D_{coarse}\) are jointly trained for 20 epochs (\(\lambda_{dist}=1.0, \lambda_{grad}=0.3, \lambda_{latent}=0.2\))
- \(D_{fine}\) is trained separately for 10 epochs
- \(MLP_{map}\): 10-layer MLP with skip connections, optimized with AdamW
- Training data: Approximately 20,000 unposed garments from [20], belonging to 19 categories, with 12 categories for training and 7 for testing.
Key Experimental Results¶
Main Results — Garment Geometry Quality¶
| Method | CD↓ | P2S↓ |
|---|---|---|
| DrapeNet | 1.796 | 0.573 |
| Ours (Single Stage) | 1.631 | 0.494 |
| Ours (Full) | 1.078 | 0.329 |
Compared to DrapeNet, CD is reduced by approximately 40% and P2S is reduced by approximately 42%.
Cross-Dataset Generalization (CLOTH3D)¶
| Method | CD (topwear)↓ | P2S (topwear)↓ |
|---|---|---|
| DrapeNet (trained on CLOTH3D) | 1.522 | 0.631 |
| Ours (trained on [20]) | 1.491 | 0.635 |
Even when trained only on the [20] dataset and tested on CLOTH3D, the model still achieves performance comparable to or even better than DrapeNet trained directly on CLOTH3D.
Texture Synthesis Comparison¶
| Method | CLIP Score (ViT-H/14)↑ | Speed |
|---|---|---|
| Text2Tex | 0.263 ± 0.047 | ~5 min |
| WordRobe | 0.304 ± 0.043 | ~22 sec |
13 times faster, with a higher CLIP score and better view consistency.
Ablation Study¶
| Configuration | CD↓ | P2S↓ | Description |
|---|---|---|---|
| w/o \(\mathcal{L}_{grad}\) | 1.886 | 0.612 | Lacks gradient regularization |
| w/o \(\mathcal{L}_{latent}\) | 1.094 | 0.331 | Disentanglement loss is auxiliary but not decisive |
| Full | 1.078 | 0.329 | Optimal |
| Interpolation Evaluation | \(\Delta_{area}\)↓ | \(\Delta_{vol}\)↓ |
|---|---|---|
| w/o \(\mathcal{L}_{latent}\) | 0.028 | 1.275 |
| with \(\mathcal{L}_{latent}\) | 0.022 | 1.206 |
Key Findings¶
- The coarse-to-fine two-stage decoding significantly reduces surface noise and holes.
- \(\mathcal{L}_{grad}\) plays a critical regularizing role in reducing high-frequency noise.
- \(\mathcal{L}_{latent}\) has little impact on CD/P2S but significantly improves the quality of interpolation.
- In user studies, 63% of users prefer WordRobe (vs. 27% preferring the DreamFusion variant).
Highlights & Insights¶
- Weakly supervised CLIP mapping scheme: Utilizes ControlNet generation \(\rightarrow\) CLIP encoding to construct training pairs, completely avoiding manual text annotation. The mechanism is clever and generalizable.
- View-composited property of ControlNet: This empirical finding is highly practical—rendering multi-view depth maps into a single composite image and feeding it to ControlNet naturally preserves view consistency in the output. This represents a new paradigm for texture generation without requiring multi-view optimization.
- Practicality of canonical T-pose generation: Directly interfaces with standard animation/simulation pipelines (rigging, skinning, cloth simulation), offering high industrial application value.
- CLIP arithmetic for latent editing: Leverages CLIP text-text vector arithmetic to automatically locate dimensions in the latent space that need modification, eliminating the need for explicit labeling.
Limitations & Future Work¶
- Front-back orthographic projection loses texture information in tangent regions, requiring inpainting which can lead to blurry seams.
- The implicit UDF representation struggles to model fine-grained geometric details (pockets, buttons, etc.).
- Texture synthesis contains baked-in shadows/lighting/edge hallucinations, which limits its applicability to new illumination environments.
- Handles only single-piece garments, with no support for layered clothing yet.
- Both training and evaluation data are synthetic (CLOTH3D, [20]), and generalization to real-world garments remains to be verified.
Related Work & Insights¶
- vs DrapeNet: Both learn a garment UDF latent space, but DrapeNet has no textures, no text control, and requires explicit labels for editing. WordRobe is a comprehensive upgrade in all aspects.
- vs Text2Tex: Text2Tex uses progressive multi-view inpainting for texture generation, which is slow and view-inconsistent. WordRobe is a single-pass forward approach and is 13x faster.
- vs DreamBooth3D/DreamFusion: General text-to-3D methods using SDF cannot handle the open surfaces of clothing, and their geometric quality is far inferior to specialized methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first text-driven unposed textured 3D garment generation framework, featuring three major innovations: coarse-to-fine, weakly supervised CLIP mapping, and view-composited texture.
- Experimental Thoroughness: ⭐⭐⭐⭐ Quantitative comparisons, user studies, cross-dataset generalization, and ablation analyses are comprehensively covered.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, complete formulation of the method, and rich illustrations.
- Value: ⭐⭐⭐⭐ High industrial application value; the concept of view-composited texture generation can be widely generalized.