Structural Reasoning Improves Molecular Understanding of LLM¶
| Conference/Journal | Year | Paper Link | Code |
|---|---|---|---|
| ACL 2025 | 2025 | arXiv 2410.05610 | - |
Area: LLMs + Chemistry / Molecular Understanding
Keywords: molecular reasoning, structural information, chain-of-thought, SMILES, molecule-to-text
TL;DR¶
This paper proposes the Molecular Structural Reasoning (MSR) framework. By explicitly incorporating six key structural elements of molecules (molecular formula, longest carbon chain, aromatic rings, ring compounds, functional groups, and chiral centers) as intermediate reasoning steps, it significantly improves LLM performance on molecular understanding tasks.
Background & Motivation¶
Problem Definition: LLMs are increasingly applied in chemistry (molecular description generation, retrosynthesis, text-to-molecule, etc.), but even state-of-the-art LLMs (GPT-4o, Llama3) cannot accurately infer key structural information of molecules. For example, the accuracy of counting aromatic rings is only about 50%-75%.
Why Structural Information Matters: - Molecular properties (toxicity, solubility, boiling point, etc.) heavily depend on structural features. - Chemists reason about molecules starting from structures: they first identify rings and carbon chain backbones, and then locate functional groups. - Injecting accurate structural information into LLMs can improve the correctness of molecular generation.
Motivation: To design a framework that allows LLMs to "sketch" molecular structures before answering, resembling a chain-of-thought approach specifically tailored for chemical structural reasoning.
Method¶
Overall Architecture¶
MSR consists of two modules: the Reasoning Module (generating structural information) and the Answering Module (generating the final answer based on the original input and structural information). Depending on whether the molecule is provided as an input, it operates in two modes:
- Analytic Reasoning: Input contains the molecule \(\rightarrow\) Use external tools (RDKit) to extract precise structural information \(\rightarrow\) Fine-tune the answering module.
- Synthetic Reasoning: Input does not contain the molecule (e.g., text descriptions) \(\rightarrow\) Fine-tune the reasoning module to infer structural information from text \(\rightarrow\) Fine-tune the answering module to generate the molecule.
Key Designs¶
- Six Key Structural Elements: Mimicking the reasoning process of chemists, defined from coarse to fine:
- Molecular formula (atom types and counts)
- Longest carbon chain length (backbone information)
- Number of aromatic rings (stability and electronic properties)
- Ring compounds (ring system types)
- Functional groups (chemical reactivities)
- Chiral centers (stereochemistry)
- Match Ratio Reject Sampling: In synthetic reasoning, \(k\) candidate molecules are first generated using beam search. The structural match ratio between each candidate and the MSR is calculated, and the molecule with the highest match ratio is selected as the output.
- Reliability Filtering: In synthetic reasoning, only structural components with sufficiently high inference accuracy are retained, while unreliable inference results are discarded.
Loss & Training¶
Standard sequence-to-sequence cross-entropy loss, where MSR is added to the training data as an additional input (for analytic reasoning) or an intermediate output (for the reasoning module of synthetic reasoning).
Experiments¶
Main Results 1: Molecule-to-Text¶
L+M Dataset:
| Model | BLEU-2 | BLEU-4 | ROUGE-L | METEOR |
|---|---|---|---|---|
| MolT5-base | 0.738 | 0.535 | 0.539 | 0.718 |
| MolT5-base + MSR | 0.805 | 0.592 | 0.642 | 0.822 |
| MolT5-large | 0.769 | 0.556 | 0.557 | 0.743 |
| MolT5-large + MSR | 0.832 | 0.622 | 0.691 | 0.878 |
ChEBI-20 Dataset (with general LLMs):
| Model | BLEU-4 | ROUGE-L | METEOR |
|---|---|---|---|
| GPT-4o | 0.128 | 0.307 | 0.291 |
| GPT-4o + MSR | 0.174 | 0.313 | 0.341 |
| ChemT5-base + MSR | 0.560 | 0.626 | 0.657 |
| BioT5 (SOTA baseline) | 0.556 | 0.633 | 0.656 |
Main Results 2: Text-to-Molecule¶
| Model | BLEU | Exact Match | MACCS FTS | Morgan FTS | FCD↓ |
|---|---|---|---|---|---|
| MolT5-large | 0.564 | 0.000 | 0.757 | 0.395 | 17.50 |
| MolT5-large + MSR | 0.710 | 0.111 | 0.837 | 0.560 | 1.54 |
| MolT5-base | 0.684 | 0.000 | 0.760 | 0.475 | NaN |
| MolT5-base + MSR | 0.706 | 0.052 | 0.825 | 0.548 | 1.45 |
Ablation Study¶
Inference Module Accuracy (Synthetic Mode):
| Component | MolT5-base (L+M) | MolT5-base (ChEBI) | GPT-4o | Llama3 |
|---|---|---|---|---|
| Aromatic rings | 0.825 | 0.926 | 0.718 | 0.593 |
| Molecular formula | 0.426 | 0.458 | 0.298 | 0.084 |
| Functional groups | 0.889 | 0.957 | 0.298 | 0.137 |
Key Findings¶
- MSR consistently improves performance across all models and tasks, validating the universality of the framework.
- Chemistry-specific LLM + MSR can outperform baselines pre-trained on much larger datasets (e.g., ChemT5-base + MSR \(\approx\) BioT5).
- General LLMs (GPT-4o, Llama3) exhibit much lower structural reasoning accuracy than fine-tuned chemistry LLMs, explaining their performance bottleneck on chemistry tasks.
- Reject sampling in synthetic reasoning effectively enhances the alignment between the generated molecules and MSR.
- MSR enables models to reach good performance faster (improving training efficiency).
Highlights & Insights¶
- Accurately diagnoses the limitations of LLMs in understanding molecular structures and proposes targeted solutions.
- The dual-mode design of analytic/synthetic reasoning elegantly covers both scenarios where molecules serve as either inputs or outputs.
- Match ratio reject sampling leverages the deterministic nature of molecular structural information, cleverly integrating reasoning with verification.
- Comprehensive experimental coverage: 3 tasks, 3 datasets, chemistry LLMs + general LLMs.
Limitations & Future Work¶
- The six structural elements are manually defined, which may not cover all crucial chemical characteristics.
- There is still significant room for improvement in the accuracy of the reasoning module for synthetic reasoning (e.g., molecular formula accuracy is only 42%-47%).
- Dependence on external tools (RDKit) subjects analytic reasoning to the limitations of the tool.
- Evaluation is only conducted on English chemical text, leaving cross-lingual applicability unknown.
- Reject sampling increases computational overhead during inference.
Related Work & Insights¶
- Chemistry LLMs: MolT5 (Edwards et al., 2022), ChemT5 (Christofidellis et al., 2023), BioT5 (Pei et al., 2023).
- Chain-of-Thought Distillation: Ho et al. (2023), Magister et al. (2023) distill CoT into smaller models.
- Molecular Representations: SMILES (Weininger, 1988), SELFIES.
Rating¶
| Dimension | Score (1-10) |
|---|---|
| Novelty | 8 |
| Practicality | 7 |
| Experimental Thoroughness | 9 |
| Writing Quality | 8 |
| Overall Score | 8 |