ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts¶

Conference: CVPR 2025
arXiv: 2412.02912
Code: Project Page
Area: Image Generation
Keywords: 3D Shape Guidance, Text-to-Image, Shape Tokens, CLIP Space, Controllable Generation

TL;DR¶

Proposed ShapeWords, which encodes 3D shapes into special tokens embeddable within text prompts (Shape2CLIP module). This enables viewpoint-agnostic 3D shape-guided text-to-image generation, significantly outperforming ControlNet depth-map conditioning in compositional settings.

Background & Motivation¶

Shape control in text-to-image generation faces three major challenges: - Inbalance in Text-Visual Conditioning: Methods like ControlNet control shapes using depth/edge maps, but shape conditions often override text descriptions in compositional scenes (e.g., "a chair under a tree"). - Viewpoint Dependence: Depth maps only capture 2D information from a single viewpoint, discarding the full 3D geometry. - Lack of Shape Exploration: Users might want to see variations of the target shape, but existing methods can only replicate it precisely or ignore the shape entirely.

Core Idea: Directly "writing" 3D shape information into text prompt tokens (rather than treating it as an external condition) allows shape and text to fuse naturally within the same space.

Method¶

Overall Architecture¶

Encode the 3D shape into a sequence of tokens \(\mathbf{B} \in \mathbb{R}^{65 \times 384}\) using Point-BERT.
Encode the text prompt as \(\mathbf{T}\) using OpenCLIP, with shape placeholders replaced by their category names.
The Shape2CLIP module learns a prompt residual \(\delta\mathbf{T}\) via cross-attention to modify the embeddings of the shape token and the EOS token.
Feed the modified embeddings into Stable Diffusion 2.1 to generate images.
Control the shape guidance strength using a user-specified parameter \(\lambda\).

Key Design 1: Shape2CLIP Residual Mapping¶

Function: Injecting 3D shape information into the text prompt embedding space.
Mechanism: Within a 6-layer cross-attention block, Point-BERT's shape representations act as key/value, and prompt embeddings act as query. The output residual \(\delta\mathbf{T}\) only modifies the embeddings of two key positions: the shape identifier token and the EOS token, leaving others unchanged: \(\mathbf{T}'[s,e] = \mathbf{T}[s,e] + \delta\mathbf{T}(\mathbf{B}, \mathbf{T}; \theta)\).
Design Motivation: The residual formulation is less prone to overfitting than direct feed-forward (given limited training data). Modifying only two tokens instead of all preserves the original semantics of the rest of the prompt (such as scene descriptions and style instructions), preventing the shape information from "washing out" the textual meaning.

Key Design 2: Controllable Shape Guidance Intensity¶

Function: Allowing users to flexibly control the trade-off between "faithfulness to the 3D shape" and "creative variation".
Mechanism: Linear interpolation via a parameter \(\lambda \in [0,1]\) during inference: \(\mathbf{T}'[s,e] = \mathbf{T}[s,e] + \lambda \cdot \delta\mathbf{T}\). Setting \(\lambda=0\) ignores the shape, \(\lambda=1\) enforces maximum shape constraint, and intermediate values yield stylized variations.
Design Motivation: Users might only have crude geometric prototypes (e.g., simple boxes) and require the model to perform reasonable detail synthesis while maintaining the basic structure.

Key Design 3: SDS-based Training¶

Function: Training only the Shape2CLIP module while freezing all other components.
Mechanism: Depth-conditioned images generated by ControlNet on ShapeNet shapes serve as training data (1.58M pairs). However, the Shape2CLIP module is trained using the SDS loss \(\mathcal{L}_{\text{SDS}}(\theta) = W(t)\|\hat{\epsilon}_{i,k} - \epsilon_{i,k}\|_2^2\), with stability improved using the DreamTime time-weighted function.
Design Motivation: The SDS loss directly aligns the training target with generation quality rather than simply mimicking ControlNet outputs. Prompts for the training data specifically avoid mentioning concrete 3D structures, forcing the model to learn generalized mappings instead of memorizing specific shape-appearance pairs.

Loss & Training¶

SDS loss + DreamTime time-weighting.

Key Experimental Results¶

Main Results: Generation Quality on Compositional Prompts¶

Method	FID↓	KID↓	Aes.↑	CLIP↑
ControlNet	97.0	10.40	5.24	26.9
CNet-Stop@60	90.5	10.25	5.20	28.3
CNet-Stop@80	92.4	9.72	5.17	27.5
ShapeWords	73.8	8.58	5.45	31.5

ShapeWords comprehensively outperforms ControlNet variants under compositional settings, reducing FID by 24% and achieving a 17% gain in CLIP similarity.

Shape Faithfulness Evaluation (Simple Prompts)¶

Method	S-IOU↑	S-CD↓
CNet-Stop@20 (category)	Lower	Higher
ShapeWords@20	Higher	Lower

ShapeWords consistently outperforms corresponding ControlNet variants across different steps.

Key Findings¶

ControlNet severely disregards textual context under compositional prompts (e.g., generating only the chair without the tree for "a chair under a tree").
ShapeWords utilizes ControlNet-generated data during training, but does not rely on any depth maps during inference.
In user studies, ShapeWords was consistently preferred in both shape and text compliance.
The shape guidance strength \(\lambda\) facilitates a smooth transition between shape preservation and creative variation.

Highlights & Insights¶

Shape-as-Word: The paradigm of encoding 3D geometry into language tokens elegantly unifies shape and text conditioning.
Compositional Scene Breakthrough: Demonstrates superior performance over depth-map methods in complex prompts requiring a blend of "shape + context".
Viewpoint Agnostic: Shape2CLIP encodes full 3D information rather than single-view perspectives, supporting multi-view consistent generation.

Limitations & Future Work¶

Trained on only 5 categories of ShapeNet, leaving generalization to unseen shape categories to be validated.
The alignment accuracy between the generated images' shapes and the target 3D models is still behind ControlNet (a trade-off between precision and flexibility).
Training relies on synthetic data generated by ControlNet, causing potential distribution mismatches with real-world images.
Future work could extend this approach to large-scale 3D datasets (e.g., Objaverse) and more diverse shape categories.

ControlNet: The seminal baseline for depth-map conditioning, significantly outperformed by ShapeWords in compositional scenarios.
Continuous 3D Words: Closely related work that embeds 3D attributes into tokens, though it focuses on attributes rather than explicit shapes.
Point-BERT: Offers structure-aware feature encoding for 3D shapes.

Rating¶

⭐⭐⭐⭐ — The pioneering idea of embedding 3D shapes into the language space is clean and powerful, showing real practical value in solving compositional prompt scenarios. Limited class coverage remains the primary bottleneck.