LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery¶
Conference: ICLR 2026
arXiv: 2510.22503
Code: github.com/scientific-discovery/LLEMA
Area: LLM NLP
Keywords: materials discovery, LLM evolutionary search, multi-objective optimization, crystal structure generation, surrogate models
TL;DR¶
This paper proposes LLEMA, a framework that integrates LLM scientific knowledge with chemistry-rule-guided evolutionary search and memory-driven iterative optimization, achieving superior hit rates, stability, and Pareto front quality across 14 multi-objective materials discovery tasks.
Background & Motivation¶
Materials discovery requires searching a vast combinatorial space of chemical compositions and crystal structures while simultaneously satisfying multiple, often conflicting, objectives. The traditional discovery process is resource-intensive and slow, and existing approaches face the following challenges:
- Traditional generative models (CDVAE, G-SchNet, DiffCSP, MatterGen) require task-specific retraining, lack generalization capability, and do not leverage the extensive prior knowledge embedded in LLMs.
- Existing LLM-based methods (e.g., LLMatDesign) rely on prompt engineering or unguided material generation, producing candidates that are theoretically plausible but often thermodynamically unstable or unsynthesizable.
- Single-objective limitation: Most methods reduce materials discovery to a single-objective task, whereas real-world scenarios are inherently multi-objective (e.g., thermoelectric materials must simultaneously optimize electrical conductivity and thermal resistance).
LLEMA is the first framework to simultaneously possess all four properties: domain knowledge integration, multi-objective optimization, rule-guided generation, and evolutionary optimization.
Method¶
Overall Architecture¶
LLEMA comprises four core components (see Figure 1):
- (A) Materials candidate generation: The LLM generates candidates based on task descriptions and property constraints.
- (B) Crystallographic representation: Generated materials are converted into structured CIF files.
- (C) Physicochemical property prediction: Properties such as band gap and formation energy are predicted.
- (D) Fitness evaluation and feedback: Constraint satisfaction is assessed, and results are iteratively fed back via success/failure memory pools.
Problem Formulation¶
The materials discovery task \(\mathcal{T}\) is modeled as a constrained multi-objective optimization problem:
where each constraint \(c_i\) may take the form of an interval, lower-bound, or upper-bound constraint:
Key Designs¶
Hypothesis Generation:
At each iteration \(n\), the LLM \(\pi_\theta\) samples a batch of candidate materials from prompt \(\mathbf{p}_n\), which consists of four components:
- Task specification: Natural-language objectives and property constraints (e.g., "wide-bandgap semiconductor, bandgap ≥ 2.5 eV").
- Chemistry-informed design principles: Rules such as isoelectronic substitution, stoichiometry preservation, and phase stability, serving as evolutionary operators.
- Demonstration samples: Positive and negative examples sampled from the success pool \(\mathbb{M}^+\) and failure pool \(\mathbb{M}^-\).
- Crystallographic representation: The LLM outputs crystal configurations in JSON format (chemical formula, lattice parameters, atomic coordinates).
Physicochemical Property Prediction:
- A hierarchical prediction system first queries the Materials Project database for exact matches.
- Out-of-distribution candidates are evaluated using surrogate models (CGCNN, ALIGNN).
- This yields a property vector \(f(m) \in \mathbb{R}^d\).
Fitness Evaluation and Memory Management:
A composite scoring function is defined as:
- Candidates satisfying all constraints (\(S \geq 0\)) are added to the success pool \(\mathbb{M}^+\).
- Candidates violating constraints are added to the failure pool \(\mathbb{M}^-\).
Multi-Island Evolutionary Strategy:
- The population is divided into \(m=5\) independent islands.
- Islands are selected via Boltzmann sampling: \(P_i = \frac{\exp(s_i/\tau_c)}{\sum_j \exp(s_j/\tau_c)}\)
- Within each island, top-\(k\) sampling from \(\mathbb{M}^+\) and \(\mathbb{M}^-\) constructs the prompt for the next iteration.
Loss & Training¶
Users need only provide a CSV file containing task names and property constraints; the framework then automatically:
- Constructs prompts from the CSV and iteratively generates candidates.
- Applies surrogate models using publicly available pretrained weights (no retraining required).
- Prunes candidates violating hard constraints by assigning them low scores.
Key Experimental Results¶
Main Results: 14 Materials Discovery Tasks¶
Hit rate (H.R.) and stability (Stab.) are evaluated across 14 tasks spanning five domains: electronics, energy, coatings, optics, and aerospace.
| Method | Wide-Bandgap H.R/Stab | SAW/BAW H.R/Stab | Solid-State H.R/Stab | Piezo H.R/Stab | Transparent H.R/Stab |
|---|---|---|---|---|---|
| CDVAE | 0.04/0.04 | 0.29/0.00 | 0.04/0.04 | 42.19/0.00 | 0.00/0.00 |
| MatterGen | 6.56/4.15 | 26.27/0.00 | 5.33/3.11 | 21.64/0.00 | 9.38/0.00 |
| LLMatDesign | 4.19/1.13 | 47.59/0.13 | 2.51/2.44 | 32.16/1.38 | 0.04/0.04 |
| LLEMA (Mistral) | 17.08/10.71 | 31.58/6.80 | 31.79/20.78 | 67.11/4.84 | 43.87/18.48 |
| LLEMA (GPT) | 33.62/22.42 | 59.88/10.74 | 46.17/25.37 | 63.46/3.22 | 39.11/14.85 |
LLEMA substantially outperforms all baselines across nearly all tasks, with a particularly pronounced advantage in stability—baseline methods may generate candidates that satisfy property constraints but are thermodynamically unstable.
Ablation Study: Component Contributions¶
| Method | Hit Rate↑ | Stability↑ | Memorization Rate↓ |
|---|---|---|---|
| LLM (direct generation) | 4.4 | 1.8 | 95.3 |
| + Memory feedback | 15.1 | 20.1 | 58.3 |
| + Mutation & crossover | 29.8 | 21.5 | 25.3 |
| LLEMA (full) | 30.2 | 27.6 | 16.6 |
Key Findings¶
- Evolutionary optimization substantially reduces memorization: The Materials Project repetition rate drops from 95.3% under pure LLM generation to 16.6% under LLEMA.
- Surrogate models are indispensable: Removing surrogate models causes both hit rate and stability to collapse to <5%, and the search degenerates into trivial repetition.
- Convergence dynamics: The proportion of valid candidates increases from ~27% at iteration 250 to ~33% at iteration 1000.
- Pareto front advantage: For wide-bandgap semiconductor and hard/rigid ceramic tasks, all Pareto-optimal solutions originate from LLEMA.
- Discovered candidates align with domain expert research: For example, ZrAl₂O₅ and Hf₀.₅Zr₀.₅O₂ correspond to well-known high-\(k\) dielectric material families.
Highlights & Insights¶
- Chemical knowledge encoded as evolutionary operators: Domain knowledge such as substitution rules, stoichiometry conservation, and oxidation state consistency is translated into operators that guide LLM generation, rather than being reduced to simple prompt engineering.
- Multi-island evolutionary strategy: Inspired by works such as FunSearch, the parallel island structure balances exploration and exploitation.
- High practical utility: New tasks can be initiated with only a CSV file; surrogate models use pretrained weights without retraining.
- Benchmark contribution: The paper introduces a benchmark of 14 industrially relevant multi-objective materials discovery tasks, each with well-defined physical constraints.
Limitations & Future Work¶
- Reliance on surrogate models (CGCNN, ALIGNN) for property prediction means that prediction errors accumulate and can misdirect the search.
- Experimental validation is absent—newly discovered materials are verified only computationally and have not been synthesized in the laboratory.
- Iterative LLM queries incur high costs; 250 iterations require a substantial number of API calls.
- Chemistry rules are currently designed manually by domain scientists; automated rule discovery is a natural direction for extension.
- Evaluation is limited to GPT-4o-mini and Mistral-Small; stronger LLMs may yield further performance gains.
Related Work & Insights¶
- Relationship to FunSearch (Romera-Paredes et al., 2024): LLEMA's multi-island evolutionary strategy is directly inspired by FunSearch, extending it from program synthesis to materials discovery.
- Complementarity with MatterGen: MatterGen performs inverse design via conditional sampling with diffusion models, whereas LLEMA employs LLM reasoning combined with evolutionary search; the two approaches are complementary.
- Implications for AI4Science: This work demonstrates how to integrate broad LLM knowledge with domain-specific constraints, and the paradigm is transferable to drug design, catalyst discovery, and related fields.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The first unified framework for materials discovery combining LLM evolutionary search, chemistry rules, and multi-objective optimization.
- Technical Depth: ⭐⭐⭐⭐ — The multi-layered framework incorporates surrogate models, multi-island evolution, and memory management.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 14 tasks, multiple baselines, comprehensive ablations, and qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and rigorous problem formulation.
- Value: ⭐⭐⭐⭐ — Low barrier to entry (CSV only), though dependent on surrogate model quality.
- Overall Rating: ⭐⭐⭐⭐ (8/10)