ChatGarment: Garment Estimation, Generation and Editing via Large Language Models¶

Background & Motivation¶

3D garment modeling holds significant application value in virtual try-on, digital humans, gaming, and film production. Traditional 3D garment modeling relies on professional CAD software and fashion designers, resulting in an extremely high barrier to entry. In recent years, deep learning-based methods have attempted to directly predict 3D garment shapes from images, but they face the following challenges:

Geometric Quality: Meshes reconstructed through end-to-end regression often lack detail, particularly in terms of folds and sewing lines.

Editability: The predicted 3D meshes are difficult to subject to subsequent parametric editing (e.g., modifying sleeve length, neckline shape).

Physical Plausibility: Directly regressed shapes do not guarantee physical simulation viability, making them unsuitable for fabric simulation.

Language Interaction: It is challenging for users to control garment generation and editing through natural language descriptions.

GarmentCode provides a parametric representation framework for garments: each garment is defined by a set of JSON parameters, covering clothing type, measurements, and pattern details. This allows for the deterministic generation of sewing patterns, from which high-quality 3D garments are obtained via physical simulation. However, the parameter space of GarmentCode is complex (originally around 900 tokens), making it difficult to predict directly using neural networks.

This paper proposes ChatGarment, which integrates a vision-language model (VLM) with GarmentCode to achieve 3D garment generation and editing based on image or text inputs.

Method¶

GarmentCodeRC: Compressed Parameter Representation¶

The original JSON parameters of GarmentCode contain approximately 900 tokens, which presents an excessively long prediction length for LLMs. This paper proposes GarmentCodeRC (Reduced Code), compressing the parameters to around 350 tokens through the following strategies:

Compression Strategy	Description	Token Reduction
Redundant Parameter Removal	Delete redundant items that can be derived from other parameters	~200 tokens
Numerical Precision Truncation	Reduce floating-point precision from 6 decimal places to 3	~150 tokens
Key Name Abbreviation	Substitute long key names with short key names	~100 tokens
Default Value Omission	Omit parameters that are equal to their default values	~100 tokens

Total compression ratio: \(900 \to 350\) tokens (approximately 61% compression rate), while maintaining lossless generation quality.

LLaVA Fine-tuning¶

ChatGarment is based on the LLaVA architecture and undergoes the following fine-tuning:

Input Modalities¶

Image Input: Garment photos or rendered images
Text Input: Natural language descriptions (e.g., "a V-neck short-sleeve dress")
Mixed Input: Image + editing instructions (e.g., "make the sleeves longer")

\<ENDS> Token¶

For numerical parameters in JSON, the standard text generation method (predicting digit by digit) is inefficient and prone to error accumulation. This paper introduces a special \<ENDS> token to mark the end of numerical values:

"sleeve_length": 0.65<ENDS>

Functions of the \<ENDS> token: 1. Clarifying numerical boundaries to avoid generating unnecessary digits. 2. Serving as a termination signal during decoding to improve the certainty of numerical predictions. 3. Reducing cumulative error in numerical regression tasks.

Data Generation¶

Data Type	Quantity	Purpose
3D Garment Models	40,000 units	Parametric garments generated by GarmentCode
Multi-view Renderings	1,000,000+ images	25+ view renders per garment
Text Descriptions	40,000 sentences	Automated generation + manual verification
Editing Instruction Pairs	200,000+ pairs	(original parameters, editing instructions, target parameters) triplets

Loss & Training¶

The training process is divided into three stages: 1. Stage 1: Freeze the vision encoder, train the projection layer and LLM to enable the model to understand garment images. 2. Stage 2: End-to-end fine-tuning to learn the mapping from images/text to GarmentCodeRC. 3. Stage 3: Editing instruction fine-tuning to learn parameter modifications based on edit instructions.

Experimental Results¶

Garment Reconstruction Precision¶

Method	Dress4D CD↓ (mm)	CAPE CD↓ (mm)	Parameter Accuracy↑
SewFormer	27.06	23.45	62.3%
NeuralTailor	19.82	17.31	71.5%
DressCode	8.47	7.92	83.1%
ChatGarment	3.12	3.85	94.7%

The Chamfer Distance of ChatGarment on the Dress4D dataset is only 3.12mm, representing an 88.5% reduction compared to SewFormer's 27.06mm.

Text-to-Garment Generation¶

Evaluation Dimension	ChatGarment	Baseline Methods
Text Consistency (CLIP Score)	0.312	0.247
Geometric Quality (FID-3D)	23.7	45.2
User Preference Rate	78.3%	21.7%

Garment Editing¶

ChatGarment supports multiple editing operations: - Local Editing: Modifying sleeve length, neckline, hemline, etc. - Style Transfer: Applying the style of one garment to another. - Semantic Editing: Modifying via natural language descriptions (e.g., "add a bow", "make the waist slimmer").

The parameter changes post-editing correspond precisely to user intentions, keeping the unedited parts unchanged.

Conclusion & Outlook¶

ChatGarment innovatively integrates VLMs with parametric garment representations. Through GarmentCodeRC compression (900→350 tokens), the \<ENDS> numerical termination token, and large-scale data construction, it achieves high-quality 3D garment generation. It exhibits a vast accuracy advantage on Dress4D, with a CD of 3.12 vs SewFormer's 27.06. This system enables non-expert users to create and edit 3D garments via natural language interaction.