Skip to content

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Background & Motivation

3D garment modeling holds significant application value in virtual try-on, digital humans, gaming, and film production. Traditional 3D garment modeling relies on professional CAD software and fashion designers, resulting in an extremely high barrier to entry. In recent years, deep learning-based methods have attempted to directly predict 3D garment shapes from images, but they face the following challenges:

Geometric Quality: Meshes reconstructed through end-to-end regression often lack detail, particularly in terms of folds and sewing lines.

Editability: The predicted 3D meshes are difficult to subject to subsequent parametric editing (e.g., modifying sleeve length, neckline shape).

Physical Plausibility: Directly regressed shapes do not guarantee physical simulation viability, making them unsuitable for fabric simulation.

Language Interaction: It is challenging for users to control garment generation and editing through natural language descriptions.

GarmentCode provides a parametric representation framework for garments: each garment is defined by a set of JSON parameters, covering clothing type, measurements, and pattern details. This allows for the deterministic generation of sewing patterns, from which high-quality 3D garments are obtained via physical simulation. However, the parameter space of GarmentCode is complex (originally around 900 tokens), making it difficult to predict directly using neural networks.

This paper proposes ChatGarment, which integrates a vision-language model (VLM) with GarmentCode to achieve 3D garment generation and editing based on image or text inputs.

Method

GarmentCodeRC: Compressed Parameter Representation

The original JSON parameters of GarmentCode contain approximately 900 tokens, which presents an excessively long prediction length for LLMs. This paper proposes GarmentCodeRC (Reduced Code), compressing the parameters to around 350 tokens through the following strategies:

Compression Strategy Description Token Reduction
Redundant Parameter Removal Delete redundant items that can be derived from other parameters ~200 tokens
Numerical Precision Truncation Reduce floating-point precision from 6 decimal places to 3 ~150 tokens
Key Name Abbreviation Substitute long key names with short key names ~100 tokens
Default Value Omission Omit parameters that are equal to their default values ~100 tokens

Total compression ratio: \(900 \to 350\) tokens (approximately 61% compression rate), while maintaining lossless generation quality.

LLaVA Fine-tuning

ChatGarment is based on the LLaVA architecture and undergoes the following fine-tuning:

Input Modalities

  • Image Input: Garment photos or rendered images
  • Text Input: Natural language descriptions (e.g., "a V-neck short-sleeve dress")
  • Mixed Input: Image + editing instructions (e.g., "make the sleeves longer")

\<ENDS> Token

For numerical parameters in JSON, the standard text generation method (predicting digit by digit) is inefficient and prone to error accumulation. This paper introduces a special \<ENDS> token to mark the end of numerical values:

"sleeve_length": 0.65<ENDS>

Functions of the \<ENDS> token: 1. Clarifying numerical boundaries to avoid generating unnecessary digits. 2. Serving as a termination signal during decoding to improve the certainty of numerical predictions. 3. Reducing cumulative error in numerical regression tasks.

Data Generation

Data Type Quantity Purpose
3D Garment Models 40,000 units Parametric garments generated by GarmentCode
Multi-view Renderings 1,000,000+ images 25+ view renders per garment
Text Descriptions 40,000 sentences Automated generation + manual verification
Editing Instruction Pairs 200,000+ pairs (original parameters, editing instructions, target parameters) triplets

Loss & Training

The training process is divided into three stages: 1. Stage 1: Freeze the vision encoder, train the projection layer and LLM to enable the model to understand garment images. 2. Stage 2: End-to-end fine-tuning to learn the mapping from images/text to GarmentCodeRC. 3. Stage 3: Editing instruction fine-tuning to learn parameter modifications based on edit instructions.

Experimental Results

Garment Reconstruction Precision

Method Dress4D CD↓ (mm) CAPE CD↓ (mm) Parameter Accuracy↑
SewFormer 27.06 23.45 62.3%
NeuralTailor 19.82 17.31 71.5%
DressCode 8.47 7.92 83.1%
ChatGarment 3.12 3.85 94.7%

The Chamfer Distance of ChatGarment on the Dress4D dataset is only 3.12mm, representing an 88.5% reduction compared to SewFormer's 27.06mm.

Text-to-Garment Generation

Evaluation Dimension ChatGarment Baseline Methods
Text Consistency (CLIP Score) 0.312 0.247
Geometric Quality (FID-3D) 23.7 45.2
User Preference Rate 78.3% 21.7%

Garment Editing

ChatGarment supports multiple editing operations: - Local Editing: Modifying sleeve length, neckline, hemline, etc. - Style Transfer: Applying the style of one garment to another. - Semantic Editing: Modifying via natural language descriptions (e.g., "add a bow", "make the waist slimmer").

The parameter changes post-editing correspond precisely to user intentions, keeping the unedited parts unchanged.

Conclusion & Outlook

ChatGarment innovatively integrates VLMs with parametric garment representations. Through GarmentCodeRC compression (900→350 tokens), the \<ENDS> numerical termination token, and large-scale data construction, it achieves high-quality 3D garment generation. It exhibits a vast accuracy advantage on Dress4D, with a CD of 3.12 vs SewFormer's 27.06. This system enables non-expert users to create and edit 3D garments via natural language interaction.