Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning¶
Conference: ACL 2025
Authors: Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen (Zhejiang University, UCL)
arXiv: 2506.23056
Code: GitHub
Area: LLM/NLP, Chemical Reasoning, Molecular Structure Elucidation
Keywords: molecular structure elucidation, MCTS, knowledge base, reward model, spectral data, test-time scaling
TL;DR¶
This paper proposes K-MSE (Knowledge-enhanced Molecular Structure Elucidation), a framework that constructs a molecular substructure knowledge base to expand the chemical structure space coverage of LLMs, designs a specialized molecule-spectrum scorer to replace self-evaluation by LLMs, and incorporates Monte Carlo Tree Search (MCTS) to achieve test-time inference scaling. On the MolPuzzle benchmark, it improves the accuracy of GPT-4o-mini and GPT-4o from 3.7% and 27.8% to 27.3% and 39.8%, respectively.
Background & Motivation¶
- Core Problem: Molecular structure elucidation is a fundamental task in chemical experimental analysis — inferring molecular structures from spectral data such as NMR and IR. Even human experts require 10-15 minutes to process one molecule. Although LLMs have the potential to automate this process, they face two major challenges.
- Limitations of Prior Work: (1) LLMs lack comprehensive coverage of the molecular structure space — frequently misidentifying uncommon structures like thiophene as benzene rings (the most common aromatic structure); (2) LLMs cannot accurately evaluate their own reasoning results, lacking the domain knowledge necessary to assess the alignment between predicted molecules and spectral data, which leads to a lack of effective reward signals in tree search reasoning.
- Research Motivation: Enhancing chemical structure coverage through external knowledge + providing accurate rewards via a specialized scorer \(\rightarrow\) incorporating MCTS to achieve test-time inference scaling for LLMs in molecular structure elucidation.
Method¶
Overall Architecture¶
K-MSE consists of three components: 1. Molecular Substructure Knowledge Base \(\mathcal{KB} = \{(s_i, d_i)\}\): Contains substructure SMILES representations and textual descriptions, with cyclic and acyclic substructures extracted from the MOSES molecular database. 2. Molecule-Spectrum Scorer: Consists of a molecular encoder \(g_m\) (GIN + MLP processing molecular graphs and Morgan fingerprints) and a spectral encoder \(g_s\) (Transformer processing chemical shifts, splitting patterns, and coupling constants of C-NMR/H-NMR). 3. MCTS Reasoning Framework: Starts by retrieving relevant substructures from the knowledge base \(\rightarrow\) iteratively performs Selection (UCT) \(\rightarrow\) Expansion (Critique + Rewrite) \(\rightarrow\) Evaluation (by the scorer) \(\rightarrow\) Backpropagation.
Key Designs¶
- Knowledge Base Construction: Automatically extracts molecular substructures from the MOSES database and automatically generates reliable descriptions using LLMs combined with structural info from external tools, balancing both diversity and generalizability.
- Specialized Scorer: The molecular encoder utilizes GIN to encode molecular graphs + MLP to encode Morgan fingerprints. The spectral encoder discretizes NMR chemical shifts and coupling constants into token IDs before feeding them into a Transformer. Training employs the NT-Xent contrastive learning loss to maximize the embedding similarity of matched molecule-spectral pairs.
- Dual Role of the Scorer: It serves both as the MCTS reward model to evaluate candidate molecules (\(R(a') = \text{sim}(g_m(m_{a'}), g_s(n))\)) and as a retrieval bridge for the knowledge base — using the spectral encoder to encode query spectra and the molecular encoder to encode substructures for Top-k retrieval.
Loss & Training¶
The scorer is trained using the NT-Xent contrastive learning loss: maximizing the cosine similarity of correct molecule-spectrum pairs, minimizing the similarity of negative pairs within the batch, where the temperature parameter \(\tau\) controls the distribution sharpness. The MCTS backpropagation uses a weighted update of \(Q(a) = 0.5 \times Q(a') + 0.5 \times Q(a)\).
Experiments¶
Main Results — MolPuzzle Benchmark (216 molecules, zero-shot)¶
| Model | Method | Morgan FTS | MACCS FTS | ACC |
|---|---|---|---|---|
| GPT-4o-mini | baseline | 0.260 | 0.512 | 0.037 |
| GPT-4o-mini | + Self-Refine | 0.287 | 0.523 | 0.069 |
| GPT-4o-mini | + MCTSr | 0.281 | 0.530 | 0.069 |
| GPT-4o-mini | + K-MSE | 0.470 | 0.651 | 0.273 |
| GPT-4o | baseline | 0.493 | 0.690 | 0.278 |
| GPT-4o | + Self-Consistency | 0.551 | 0.732 | 0.347 |
| GPT-4o | + K-MSE | — | — | 0.398 |
| Llama-3.2-11B | baseline | 0.163 | 0.349 | 0.014 |
| Llama-3.2-11B | + K-MSE | 0.298 | 0.465 | 0.111 |
Ablation Study¶
| Ablated Component | Impact on GPT-4o-mini ACC |
|---|---|
| Full K-MSE | 0.273 |
| Remove knowledge base | Obvious decrease — LLMs struggle to identify uncommon substructures |
| Replace specialized scorer with LLM | Significant decrease — LLMs fail to accurately evaluate molecule-spectrum match |
| Remove molecular images from Critique | Decrease — Text-only critique struggles to detect structural errors |
| Remove chemical formulas from Critique | Decrease — Lacks chemical constraint information |
Key Findings¶
- K-MSE achieves substantial improvements across all base models: GPT-4o-mini ACC +23.6%, GPT-4o ACC +12.0%, Llama-3.2-11B ACC +9.7%
- Existing general reasoning enhancement methods (Self-Refine, MCTSr, MAD) have limited effectiveness in molecular structure elucidation — the core bottleneck is the lack of domain knowledge.
- The specialized scorer is far superior to LLM self-evaluation — LLMs lack the domain knowledge to determine molecule-spectrum matches.
- Substructure information in the knowledge base is crucial for handling uncommon molecular structures.
- As a plug-and-play framework, K-MSE can be combined with any LLM.
Highlights & Insights¶
- Finely applies test-time reasoning scaling combined with external knowledge enhancement to the task of molecular structure elucidation for the first time.
- The dual-role design of the scorer, acting as both a reward model and a retrieval bridge, is highly elegant.
- The plug-and-play nature of the framework endows it with high practical value.
- The absolute accuracy improvement of 20%+ on MolPuzzle is extremely significant.
Limitations & Future Work¶
- Evaluation is only conducted on the MolPuzzle benchmark, which has a relatively small scale (216 molecules).
- The coverage of the scorer's training data may limit its generalization ability to rare molecular types.
- The increased number of MCTS iterations incurs significant inference time costs (API calls + scorer inference).
- The knowledge base is static, with online expansion or adaptive update mechanisms left unexplored.
- Only NMR and IR spectra are considered, leaving other commonly used analytical data like mass spectrometry (MS) unaddressed.
Related Work & Insights¶
- LLM Chemical Reasoning: ChemCrow (M. Bran et al., 2024) integrating external tools, ChatDrug (Liu et al., 2024) molecular editing, and STRUCTCHEM (Ouyang et al., 2024) predefined reasoning templates.
- Tree Search Reasoning: Tree-of-Thought (Yao et al., 2023), MCTSr (Zhang et al., 2024a), but they lack domain-specific accurate reward models.
- Molecular Structure Elucidation: MolPuzzle (Guo et al., 2024) first proposed the LLM benchmark for this task.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐⭐ |
| Technical Depth | ⭐⭐⭐⭐⭐ |
| Experimental Thoroughness | ⭐⭐⭐⭐ |
| Writing Quality | ⭐⭐⭐⭐ |
| Value | ⭐⭐⭐⭐ |