Synthia: Novel Concept Design with Affordance Composition¶
Conference: ACL 2025
arXiv: 2502.17793
Code: Yes (https://github.com/HyeonjeongHa/SYNTHIA)
Area: Other
Keywords: affordance, concept design, curriculum learning, contrastive learning, T2I models
TL;DR¶
Synthia proposes a novel concept design framework based on affordance composition. By leveraging a hierarchical concept ontology, an affordance sampling strategy, and curriculum learning to fine-tune a T2I model, it generates innovative designs that are both visually novel and functionally coherent.
Background & Motivation¶
Text-to-image (T2I) models are widely applied in AI-driven design, but existing methods suffer from two core limitations:
Neglecting functional coherence: Existing approaches rely on complex text descriptions to generate visual variations without considering whether multiple functions can be harmoniously integrated into a single coherent concept. For instance, when requested to exhibit both "driving" and "vacuuming" functions, Stable Diffusion merely generates a car, lacking the vacuuming functionality altogether.
Lacking a structured functional foundation: Directly feeding LLM-generated prompts into T2I models suffers from a lack of understanding regarding the hierarchical "concept-part-affordance" structure.
The core idea of Synthia is to use affordances as control signals for concept composition, rather than relying on complex textual descriptions. For example, given the affordances "brew + deliver", the model should automatically synthesize a novel design that possesses the functions of both a coffee maker and a cart.
Method¶
Overall Architecture¶
Synthia comprises three phases: 1. Affordance Composition Curriculum Construction — Creating training data from easy to hard based on ontology. 2. Affordance-based Curriculum Learning — Fine-tuning the T2I model combined with contrastive objectives. 3. Evaluation — Automatic and human evaluation of faithfulness, novelty, practicality, and coherence.
Key Designs¶
-
Hierarchical Concept Ontology \(\mathcal{O} = (\mathcal{S}, \mathcal{C}, \mathcal{P}, \mathcal{A})\)
- Four-level structure: Super-category \(\rightarrow\) Concept \(\rightarrow\) Part \(\rightarrow\) Affordance.
- Example: furniture \(\rightarrow\) sofa \(\rightarrow\) {leg, cushion} \(\rightarrow\) {support, rest}.
- Scale: 30 super-categories, 590 concepts, 1,172 parts, 686 affordances.
- Design Motivation: Provides a structured functional foundation for the T2I model to avoid compositions based solely on superficial visual features.
-
Affordance Sampling Strategy
- Defines concept distance \(D_\mathcal{C}(c_i, c_j)\) by fusing affordance-level Jaccard similarity and BERT semantic similarity (\(\alpha=0.7, \beta=0.3\)).
- Derives affordance distance \(D_\mathcal{A}(a_i, a_j)\) by averaging the pairwise distances of associated concepts.
- Design Motivation: Avoids redundant combinations caused by random sampling (e.g., cook + heat), ensuring the selection of sufficiently distinct affordance pairs.
-
Three-Stage Curriculum Construction
- Phase 1: Near-distance affordance pairs \(\rightarrow\) learning basic concept-affordance associations.
- Phase 2: Medium-distance \(\rightarrow\) learning fine-grained compositional structures.
- Phase 3: Far-distance \(\rightarrow\) challenging the model to synthesize genuinely novel and functionally coherent concepts.
- A total of 600 affordance pairs are sampled, with 10 images generated per pair (via DALL-E) and the Top-3 retained after CLIP filtering.
-
Contrastive Learning Fine-Tuning
- Positive Constraint: Pseudo-novel concept images corresponding to the target affordances.
- Negative Constraint: Existing concept images in the ontology that already contain the target affordances.
- Total Loss: \(\mathcal{L} = \mathcal{L}_{pos} - \gamma \cdot \mathcal{L}_{neg}\)
- Includes noise prediction loss to prevent catastrophic forgetting.
Inference¶
At inference time, only the affordances are provided as positive constraints, requiring no negative constraints or complex descriptions. Prompt format: "a new design that has functions of {desired affordances}."
Key Experimental Results¶
Main Results — Automatic and Human Evaluation (Table 1)¶
| Model | Faithfulness (Auto/Human) | Novelty (Auto/Human) | Practicality (Auto/Human) | Coherence (Auto/Human) |
|---|---|---|---|---|
| Stable Diffusion | 3.77/2.96 | 3.74/2.44 | 3.34/3.02 | 3.29/2.75 |
| Kandinsky3 | 3.38/2.95 | 4.02/2.98 | 2.92/3.01 | 3.89/3.41 |
| ConceptLab | 3.39/2.73 | 4.08/3.11 | 2.93/2.68 | 3.96/3.54 |
| Synthia | 3.99/3.81 | 4.55/3.89 | 3.35/3.38 | 4.81/4.06 |
Synthia achieves an absolute gain of 25.1% and 14.7% in human evaluation for novelty and coherence, respectively.
Ablation Study¶
| Ablation Item | Key Findings |
|---|---|
| Training Data Size | Gradual improvement from 200 \(\rightarrow\) 400 \(\rightarrow\) 600 pairs, with 600 pairs being optimal. |
| Affordance Distance | Synthia's novelty consistently outperforms baselines on far-distance pairs. |
| Curriculum Learning vs. Random Training | Curriculum learning significantly outperforms random training in the early stages of training. |
| 3/4 Affordance Inputs | Trained only on pairs (2), yet maintains high performance on 3/4 affordances. |
Key Findings¶
- Existing T2I models tend to generate existing concepts rather than novel designs on near-distance affordances.
- Curriculum learning significantly accelerates training and guides the model to generate high-quality novel concepts.
- Synthia even outperforms DALL-E in relative evaluations, suggesting that the fine-tuned concept composition capability surpasses that of the original base model.
- The human evaluation IAA is 67.5%, and the alignment rate between automatic and human evaluation reaches 91.25%.
Highlights & Insights¶
- Unique Affordance Perspective: Synthetically treats "functionality" as a first-class citizen in concept design, rather than relying solely on visual features.
- Exquisite Ontology Design: The hierarchical concept ontology provides a solid foundation for structured functional compositions.
- Effective Curriculum Learning: The training strategy of affordance composition from easy to hard significantly outperforms random training.
- Simple Inference: Inference requires no complex prompts or negative constraints, but only the affordance keywords.
Limitations & Future Work¶
- Training data relies heavily on pseudo-novel images generated by DALL-E, and is thus limited by the quality of DALL-E itself.
- Ontology construction requires manual design, which is costly to scale to more domains.
- Only the composition of 2 affordances is evaluated; the performance on larger numbers of affordances remains to be validated.
- Physics feasibility and manufacturing constraints are not considered.
Related Work & Insights¶
- ConceptLab leverages Diffusion Prior to optimize generation but neglects functional coherence.
- Concept Weaver refines based on template images and similarly disregards affordance.
- The theory of combinatorial creativity (Han et al., 2018) provides a psychological foundation for far-distance affordance composition.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Affordance-driven concept design is a novel perspective; the ontology + curriculum learning framework is original.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive automatic and human evaluations with thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and intuitive illustrations.
- Value: ⭐⭐⭐⭐ — Practical application value for AI-assisted design.