PartCraft: Crafting Creative Objects by Parts¶
Conference: ECCV 2024
arXiv: 2407.04604
Code: https://github.com/kamwoh/partcraft
Area: Others (Generative AI / Controllable Generation)
Keywords: Part-level control, text-to-image generation, textual inversion, attention loss, creative generation
TL;DR¶
This work proposes PartCraft, which achieves part-selection-based control for text-to-image generation for the first time. Users can "pick" different parts (such as a bird's head, wings, and body) from various objects, and the model naturally combines them into a novel and structurally coherent creative object.
Background & Motivation¶
Current creative control in generative AI (such as Stable Diffusion) primarily relies on textual descriptions or sketches, but: - Imprecise Text Control: Complex visual details are difficult to describe precisely in language, leading to generation results that deviate from expectations. - High Sketch Barrier: Not all users possess detailed drawing skills. - Limitations of Prior Work: Methods like DreamBooth and Textual Inversion learn the "entire object" as a unit, making them unable to achieve part-level combinatorial control. - Complex Extra Control Signals: Methods based on bounding boxes or segmentation masks require extensive extra inputs.
Core Motivation: Human creativity often recombines different parts of existing concepts—for instance, wanting an "ideal bird" with a bluebird's head, a cardinal's wings, and a sparrow's body. PartCraft allows users to achieve such creative combinations through simple part "selection".
Method¶
Overall Architecture¶
PartCraft is built on Stable Diffusion v1.5 and adopts a Textual Inversion strategy. The overall workflow is as follows: 1. Unsupervised Part Discovery: Leveraging DINOv2 features for three-tier hierarchical clustering to decompose objects into semantic parts. 2. Part Encoding: Mapping each part into the text token space. 3. Attention-Loss-Based Training: Ensuring that all parts are correctly placed in the image and do not overlap with each other. 4. Bottleneck Encoder: Accelerating convergence and enhancing generation fidelity.
Key Designs¶
-
Unsupervised Part Discovery (Three-Tier Hierarchical Clustering):
- DINOv2 is utilized to extract feature maps from all training images.
- Top Tier: K-means (k=2) separates foreground and background.
- Middle Tier: Clusters foreground patches into \(M\) semantic parts (e.g., bird's head, wings, etc.).
- Bottom Tier: Further subdivides each middle-tier cluster into \(K\) sub-categories (corresponding to the same part of different species).
- Each region of every image receives a cluster label \(p = (0, k_0), (1, k_1), ..., (M, k_M)\), serving as the text description during training.
- DINOv2 is preferred over models like VLPart for its higher flexibility and domain generalization capability.
-
Part Token Bottleneck Encoder:
- Traditional textual inversion directly learns word embeddings \(e(p)\), where tokens lack information interaction, leading to low learning efficiency.
- Introduce a bottleneck network \(f(\cdot)\) consisting of a two-layer MLP + ReLU: \(y_p = f(e(p))\).
- Core Idea: First project tokens into a shared "part category space" (e.g., a general representation of a "head"), and then fine-tune it to adapt to specific details.
- Experiments show significantly accelerated convergence (traditional methods are a special case where \(f\) is the identity function).
-
Entropy-Based Normalized Attention Loss:
- Training solely with the diffusion loss \(\mathcal{L}_{ldm}\) leads to part entanglement (since the head and body of the same species always appear in pairs).
- Formulate an attention regularization in the form of cross-entropy: \(\mathcal{L}_{attn} = \mathbb{E}_{z,t,m}[-(S_m \log \hat{A}_m + (1-S_m)\log(1-\hat{A}_m))]\)
- Here, \(\hat{A}_m\) represents the attention map normalized across all parts, and \(S_m\) is the segmentation mask of the \(m\)-th part.
- Key of Normalization: Ensures that the sum of attention of all parts at each image location is 1, meaning each location is occupied by at most one part.
- Compared to the MSE attention loss of Break-a-Scene, the entropy loss naturally fits the constraint of "only one part appearing."
Loss & Training¶
Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{ldm} + \lambda_{attn} \mathcal{L}_{attn}\), where \(\lambda_{attn} = 0.01\)
- LoRA is used to fine-tune cross-attention blocks (instead of the full model) to reduce training overhead.
- Attention maps are focused on the 16×16 resolution layer (rich in semantic information).
- \(M=5\) parts (head, chest/belly, wings, legs, tail) are set for birds, and \(M=7\) parts are set for dogs.
- \(K=256\) ensures coverage of all fine-grained categories.
Key Experimental Results¶
Main Results¶
Evaluated part reconstruction on CUB-200-2011 (birds) and Stanford Dogs:
| Method | FID↓ | CLIP↑ | DINO↑ | EMR↑ | CoSim↑ |
|---|---|---|---|---|---|
| Textual Inversion | 10.10 | 0.784 | 0.607 | 0.305 | 0.842 |
| DreamBooth | 12.94 | 0.775 | 0.594 | 0.355 | 0.856 |
| Custom Diffusion | 37.61 | 0.694 | 0.504 | 0.338 | 0.833 |
| Break-a-Scene | 20.05 | 0.742 | 0.549 | 0.390 | 0.854 |
| PartCraft | 12.86 | 0.783 | 0.618 | 0.460 | 0.882 |
PartCraft exceeds the strongest baseline Break-a-Scene by 7% in EMR (Exact Match Rate) and 2.8% in CoSim.
Ablation Study¶
| Configuration | FID↓ | EMR↑ | CoSim↑ | Description |
|---|---|---|---|---|
| Full PartCraft | 12.86 | 0.460 | 0.882 | Full Model |
| w/o Bottleneck | 16.36 | ~0.460 | ~0.882 | FID degrades by 3.5, generation quality drops |
| MSE attn loss (BaS) | - | Significantly drops | Significantly drops | EMR/CoSim degrade substantially |
| w/o both | - | Worst | Worst | Double degradation |
Key Findings¶
- More part combinations make it harder: As the number of mixed species increases from 1 to 4, both EMR and CoSim decrease.
- PartCraft still significantly outperforms other methods under 4-species combinations. Although Break-a-Scene also uses attention loss, its effect is inferior to the normalized entropy loss of this work.
- Word embedding space visualization (tSNE) shows that PartCraft's part tokens naturally cluster semantically (heads cluster together, wings cluster together), whereas other methods yield chaotic embeddings.
- Cross-domain transfer: The learned dog parts can be transferred to cats/lions (e.g., "a cat with beagle ears"), and can also be used for creative generation (e.g., a bird-shaped robot).
Highlights & Insights¶
- Select-to-create: Simplifies the creative process to "clicking and selecting parts," without requiring text descriptions or drawing skills, making it elegant and practical.
- Entropy-normalized attention loss is the core contribution—solving the entanglement problem in multi-part learning, with clear design motivation (each position belongs to only one part).
- The Bottleneck encoder design cleverly leverages "shared-part knowledge," allowing different instances of the same semantic (e.g., heads of different birds) to share a representation space.
- The unsupervised part discovery scheme is flexible and scalable, requiring no part annotations.
Limitations & Future Work¶
- Part discovery relies on the self-supervised features of DINOv2, which imposes an inherent ceiling on accuracy; stronger encoders (such as improved versions of VLPart) could be introduced.
- Combination results for small parts (e.g., tails, legs) are relatively poor, because these parts occupy small areas in the images, making both clustering and attention supervision less precise.
- Cross-domain part combination (e.g., combining animal parts with car parts) is still in its infancy.
- Only validated on Stable Diffusion v1.5; upgrading to newer models might yield better performance.
Related Work & Insights¶
- Closest to Break-a-Scene, but the latter's MSE attention loss is less effective than the entropy loss proposed in this work.
- Can inspire the 3D generation field: if similar part-level composition control could be achieved on 3D models, it would hold immense application value.
- The part discovery module can be used independently to provide semantic segmentation for other fine-grained control tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The idea of part selection -> creative generation is novel, though the core technique is based on existing frameworks)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets + quantitative & qualitative + ablation + visualization + rich transfer experiments)
- Writing Quality: ⭐⭐⭐⭐⭐ (Motivating examples are intuitive, and method descriptions are clear)
- Value: ⭐⭐⭐⭐ (Holds practical application potential for creative design fields)