Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Grammar¶
Conference: ICML 2025
arXiv: 2505.22948
Code: https://github.com/shiningsunnyday/induction
Area: Interpretability
Keywords: Molecular grammar, Multi-modal foundation models, Graph grammar, Interpretability, Molecular generation
TL;DR¶
FMG leverages the chemical knowledge of multi-modal foundation models (MMFMs) to induce interpretable molecular graph grammars. By rendering molecules as images and describing them via text, combined with cross-modal alignment through prompt learning, it replaces traditional grammar learning methods that rely on expert annotations or heuristics.
Background & Motivation¶
Background: Data-efficient molecular generation methods leverage graph grammars to introduce interpretability, making the generation process understandable and controllable. Graph grammars decompose molecules into meaningful substructures (production rules) and recombine them to generate new molecules.
Limitations of Prior Work: Existing grammar learning relies on expert annotations or unreliable heuristic inference algorithms. Expert annotations are expensive and lack scalability, while heuristic methods (e.g., frequent subgraph mining) lack chemical semantics, leading to unstable grammar rule quality.
Key Challenge: How to maintain the interpretability advantages of graph grammars while avoiding dependence on expert knowledge? An automated, scalable, and chemically meaningful grammar induction method is required.
Goal: Automatically induce high-quality molecular graph grammars by leveraging the inherent chemical knowledge of multi-modal foundation models.
Key Insight: Render molecules into 2D images and describe molecules via text generated by MMFMs, then align these modalities using prompt learning to identify meaningful molecular substructures.
Core Idea: Discover interpretable molecular grammar rules automatically through multi-modal descriptions (images + text) by utilizing the chemical common sense of MMFMs.
Method¶
Overall Architecture¶
- Input: Molecular graphs (SMILES or molecular graph structures)
- Intermediate process: (1) Render molecules as 2D images → extract visual features using an MMFM; (2) Generate text descriptions using an MMFM → extract semantic features; (3) Align both modalities via prompt learning → identify functional substructures
- Output: Interpretable molecular graph grammar (a set of production rules)
Key Designs¶
-
Molecular Image Rendering and Visual Encoding:
- Render molecular structures into standard 2D chemical structure images
- Extract visual representations of molecules using the vision encoder of an MMFM (such as CLIP or LLaVA-like models)
- Design Motivation: MMFMs have learned rich chemical structural knowledge (such as functional group identification and skeleton patterns) during pre-training
-
Text Description and Semantic Encoding:
- Leverage MMFMs to generate text descriptions for molecules (e.g., chemical properties, functional groups)
- Extract text embeddings as semantic conditioning
- Design Motivation: Text provides high-level semantic information complementary to vision, assisting in the identification of chemically meaningful substructures
-
Cross-Modal Alignment via Prompt Learning:
- Design learnable prompts to align visual and text modalities
- Automatically discover consistent molecular substructure patterns across both modalities through alignment learning
- These patterns serve as the induced grammar production rules
- Design Motivation: Cross-modal consistency serves as a strong indicator of the chemical significance of a substructure — genuinely meaningful functional groups correspond to both the image and the text
Loss & Training¶
- Prompt learning utilizes contrastive loss to align visual and text embeddings
- Grammar rules are extracted from aligned substructure patterns via inductive learning
- FMG acts as a plug-and-play module that can replace existing grammar learning methods
Key Experimental Results¶
Main Results¶
| Task | Metric | FMG | Previous Methods | Gain |
|---|---|---|---|---|
| Molecular Generation (Validity) | Validity↑ | Higher | Baseline | Significant |
| Molecular Generation (Diversity) | Diversity↑ | Higher | Baseline | Gain |
| Synthesizability | SA Score↑ | Better | Baseline | Gain |
| Property Prediction | Accuracy↑ | Better | Baseline | Gain |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Vision modality only | Decrease | Text provides key semantic information |
| Text modality only | Decrease | Vision provides structural details |
| Without prompt learning | Significant decrease | Cross-modal alignment is the core mechanism |
| Replaced with heuristic grammar | Decrease | FMG grammar is more chemically meaningful |
Key Findings¶
- FMG outperforms existing grammar learning methods in synthesizability, diversity, and data efficiency.
- The induced grammar rules possess built-in chemical interpretability.
- FMG can be applied as a plug-and-play replacement in existing grammar-based molecular generation and property prediction frameworks.
- The complementarity of multi-modal information is crucial for grammar quality.
Highlights & Insights¶
- Ingenious Utilization of MMFMs: Transferring the knowledge of large-scale pre-trained models to the specialized task of molecular grammar induction.
- Balancing Interpretability and Automation: Grammar rules are both chemically meaningful and free from manual annotation dependency.
- Plug-and-Play Design: Requiring no changes to downstream generation/prediction frameworks, replacing only the grammar learning component.
- Data Efficiency: Particularly advantageous in few-shot scenarios.
Limitations & Future Work¶
- The chemical knowledge of MMFMs is limited and may underperform in highly specialized chemical areas (such as organometallic chemistry).
- Rendering molecules as 2D images leads to the loss of 3D conformational information.
- The performance of prompt learning might be sensitive to the choice of MMFM.
- Scalability when extending to larger molecules or polymers needs to be verified.
Related Work & Insights¶
- Tree-decomposition-based molecular generation methods like JT-VAE provide the foundation for grammar frameworks.
- The application of MMFMs in chemistry is an emerging direction, and this work represents an innovative application within it.
- Insights: The knowledge of MMFMs may be equally valuable for structural induction in other scientific fields (e.g., materials, proteins).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Applying MMFMs to molecular grammar induction is a completely novel concept.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task verification is provided, but the scale is relatively limited.
- Writing Quality: ⭐⭐⭐⭐ The logic is clear.
- Value: ⭐⭐⭐⭐ Provides a new tool for interpretable molecular generation.