PartCraft: Crafting Creative Objects by Parts¶

Conference: ECCV 2024
arXiv: 2407.04604
Code: https://github.com/kamwoh/partcraft
Area: Others (Generative AI / Controllable Generation)
Keywords: Part-level control, text-to-image generation, textual inversion, attention loss, creative generation

TL;DR¶

This work proposes PartCraft, which achieves part-selection-based control for text-to-image generation for the first time. Users can "pick" different parts (such as a bird's head, wings, and body) from various objects, and the model naturally combines them into a novel and structurally coherent creative object.

Background & Motivation¶

Current creative control in generative AI (such as Stable Diffusion) primarily relies on textual descriptions or sketches, but: - Imprecise Text Control: Complex visual details are difficult to describe precisely in language, leading to generation results that deviate from expectations. - High Sketch Barrier: Not all users possess detailed drawing skills. - Limitations of Prior Work: Methods like DreamBooth and Textual Inversion learn the "entire object" as a unit, making them unable to achieve part-level combinatorial control. - Complex Extra Control Signals: Methods based on bounding boxes or segmentation masks require extensive extra inputs.

Core Motivation: Human creativity often recombines different parts of existing concepts—for instance, wanting an "ideal bird" with a bluebird's head, a cardinal's wings, and a sparrow's body. PartCraft allows users to achieve such creative combinations through simple part "selection".

Method¶

Overall Architecture¶

PartCraft is built on Stable Diffusion v1.5 and adopts a Textual Inversion strategy. The overall workflow is as follows: 1. Unsupervised Part Discovery: Leveraging DINOv2 features for three-tier hierarchical clustering to decompose objects into semantic parts. 2. Part Encoding: Mapping each part into the text token space. 3. Attention-Loss-Based Training: Ensuring that all parts are correctly placed in the image and do not overlap with each other. 4. Bottleneck Encoder: Accelerating convergence and enhancing generation fidelity.

Key Designs¶

Unsupervised Part Discovery (Three-Tier Hierarchical Clustering):
- DINOv2 is utilized to extract feature maps from all training images.
- Top Tier: K-means (k=2) separates foreground and background.
- Middle Tier: Clusters foreground patches into \(M\) semantic parts (e.g., bird's head, wings, etc.).
- Bottom Tier: Further subdivides each middle-tier cluster into \(K\) sub-categories (corresponding to the same part of different species).
- Each region of every image receives a cluster label \(p = (0, k_0), (1, k_1), ..., (M, k_M)\), serving as the text description during training.
- DINOv2 is preferred over models like VLPart for its higher flexibility and domain generalization capability.
Part Token Bottleneck Encoder:
- Traditional textual inversion directly learns word embeddings \(e(p)\), where tokens lack information interaction, leading to low learning efficiency.
- Introduce a bottleneck network \(f(\cdot)\) consisting of a two-layer MLP + ReLU: \(y_p = f(e(p))\).
- Core Idea: First project tokens into a shared "part category space" (e.g., a general representation of a "head"), and then fine-tune it to adapt to specific details.
- Experiments show significantly accelerated convergence (traditional methods are a special case where \(f\) is the identity function).
Entropy-Based Normalized Attention Loss:
- Training solely with the diffusion loss \(\mathcal{L}_{ldm}\) leads to part entanglement (since the head and body of the same species always appear in pairs).
- Formulate an attention regularization in the form of cross-entropy: \(\mathcal{L}_{attn} = \mathbb{E}_{z,t,m}[-(S_m \log \hat{A}_m + (1-S_m)\log(1-\hat{A}_m))]\)
- Here, \(\hat{A}_m\) represents the attention map normalized across all parts, and \(S_m\) is the segmentation mask of the \(m\)-th part.
- Key of Normalization: Ensures that the sum of attention of all parts at each image location is 1, meaning each location is occupied by at most one part.
- Compared to the MSE attention loss of Break-a-Scene, the entropy loss naturally fits the constraint of "only one part appearing."

Loss & Training¶

Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{ldm} + \lambda_{attn} \mathcal{L}_{attn}\), where \(\lambda_{attn} = 0.01\)

LoRA is used to fine-tune cross-attention blocks (instead of the full model) to reduce training overhead.
Attention maps are focused on the 16×16 resolution layer (rich in semantic information).
\(M=5\) parts (head, chest/belly, wings, legs, tail) are set for birds, and \(M=7\) parts are set for dogs.
\(K=256\) ensures coverage of all fine-grained categories.

Key Experimental Results¶

Main Results¶

Evaluated part reconstruction on CUB-200-2011 (birds) and Stanford Dogs:

Method	FID↓	CLIP↑	DINO↑	EMR↑	CoSim↑
Textual Inversion	10.10	0.784	0.607	0.305	0.842
DreamBooth	12.94	0.775	0.594	0.355	0.856
Custom Diffusion	37.61	0.694	0.504	0.338	0.833
Break-a-Scene	20.05	0.742	0.549	0.390	0.854
PartCraft	12.86	0.783	0.618	0.460	0.882

PartCraft exceeds the strongest baseline Break-a-Scene by 7% in EMR (Exact Match Rate) and 2.8% in CoSim.

Ablation Study¶

Configuration	FID↓	EMR↑	CoSim↑	Description
Full PartCraft	12.86	0.460	0.882	Full Model
w/o Bottleneck	16.36	~0.460	~0.882	FID degrades by 3.5, generation quality drops
MSE attn loss (BaS)	-	Significantly drops	Significantly drops	EMR/CoSim degrade substantially
w/o both	-	Worst	Worst	Double degradation

Key Findings¶

More part combinations make it harder: As the number of mixed species increases from 1 to 4, both EMR and CoSim decrease.
PartCraft still significantly outperforms other methods under 4-species combinations. Although Break-a-Scene also uses attention loss, its effect is inferior to the normalized entropy loss of this work.
Word embedding space visualization (tSNE) shows that PartCraft's part tokens naturally cluster semantically (heads cluster together, wings cluster together), whereas other methods yield chaotic embeddings.
Cross-domain transfer: The learned dog parts can be transferred to cats/lions (e.g., "a cat with beagle ears"), and can also be used for creative generation (e.g., a bird-shaped robot).

Highlights & Insights¶

Select-to-create: Simplifies the creative process to "clicking and selecting parts," without requiring text descriptions or drawing skills, making it elegant and practical.
Entropy-normalized attention loss is the core contribution—solving the entanglement problem in multi-part learning, with clear design motivation (each position belongs to only one part).
The Bottleneck encoder design cleverly leverages "shared-part knowledge," allowing different instances of the same semantic (e.g., heads of different birds) to share a representation space.
The unsupervised part discovery scheme is flexible and scalable, requiring no part annotations.

Limitations & Future Work¶

Part discovery relies on the self-supervised features of DINOv2, which imposes an inherent ceiling on accuracy; stronger encoders (such as improved versions of VLPart) could be introduced.
Combination results for small parts (e.g., tails, legs) are relatively poor, because these parts occupy small areas in the images, making both clustering and attention supervision less precise.
Cross-domain part combination (e.g., combining animal parts with car parts) is still in its infancy.
Only validated on Stable Diffusion v1.5; upgrading to newer models might yield better performance.

Closest to Break-a-Scene, but the latter's MSE attention loss is less effective than the entropy loss proposed in this work.
Can inspire the 3D generation field: if similar part-level composition control could be achieved on 3D models, it would hold immense application value.
The part discovery module can be used independently to provide semantic segmentation for other fine-grained control tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ (The idea of part selection -> creative generation is novel, though the core technique is based on existing frameworks)
Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets + quantitative & qualitative + ablation + visualization + rich transfer experiments)
Writing Quality: ⭐⭐⭐⭐⭐ (Motivating examples are intuitive, and method descriptions are clear)
Value: ⭐⭐⭐⭐ (Holds practical application potential for creative design fields)