MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation¶

Conference: ICLR 2026 arXiv: 2505.15054 Code: GitHub / HuggingFace Area: AI for Chemistry Keywords: molecular recognition, molecule editing, molecule generation, molecule-language alignment, benchmark

TL;DR¶

This paper introduces MolLangBench, a benchmark constructed via automated tools and expert annotation to provide high-quality, unambiguous evaluation datasets for the molecule-language interface. It covers three task types (recognition / editing / generation) and three modalities (SMILES / image / graph), evaluates 16+ commercial LLMs and 5 chemistry-specific models, and reveals that even GPT-5 falls significantly short on basic molecular operations (generation accuracy only 43%).

Background & Motivation¶

Background: Recent work has extensively explored molecule-language alignment, but these approaches typically target downstream chemical tasks (e.g., property prediction, reaction prediction) directly, bypassing fundamental structure-level capabilities. Analogous to the success of vision-language modeling—where VLMs align text with visually observable content—current molecular-language models attempt to align symbolic molecular structures with unobservable chemical properties, a mismatch that makes alignment considerably more difficult.

Limitations of Prior Work: (1) No benchmark systematically evaluates AI capabilities in basic molecular structure operations (recognition, editing, generation); (2) existing molecular benchmarks focus on high-level tasks (drug design, property prediction) while neglecting the prerequisite—whether models truly "understand" molecular structure; (3) existing datasets vary in quality and may contain ambiguities and uncertainties.

Key Challenge: If AI cannot reliably perform basic molecular structure recognition and manipulation, more complex chemical reasoning tasks (drug discovery, materials design) cannot be trusted. The chemist's workflow invariably begins with structural understanding.

Goal: To provide the first systematic, high-quality evaluation tool for fundamental molecule-language capabilities.

Key Insight: Grounded in the actual chemist workflow—first recognize structure, then manipulate structure, then generate structure—yielding a three-tier progressive task design.

Core Idea: Evaluate AI's fundamental molecular structure capabilities using deterministic, unambiguous, high-quality data, thereby exposing deficiencies in current models.

Method¶

Overall Architecture¶

MolLangBench evaluates three core capabilities of increasing difficulty: 1. Molecular Structure Recognition: Given a molecule, answer structural questions in natural language (neighboring atoms, bond types, functional groups, ring structures, stereochemistry) 2. Molecular Editing: Modify a given molecular structure according to natural language instructions 3. Molecular Generation: Generate a molecule from scratch based solely on a textual description

Three molecular representations are supported: SMILES strings, molecular images (2D structure diagrams), and molecular graphs.

Key Designs¶

Automated Construction Pipeline for Recognition Tasks:
- Function: Guarantee deterministic and unambiguous answers
- Mechanism: RDKit is used to automatically compute ground truth (single-hop neighbors, bond types, functional group identification, ring structures, stereochemistry, etc.), covering three categories: local topology, functional groups, and stereochemistry
- Design Motivation: Automated tools ensure each question has a unique, definitive answer, avoiding subjectivity introduced by human annotation
- Sampling Strategy: Label-balanced sampling from 10,000 candidate molecules, with deliberate selection of harder examples (e.g., cases where bonded atoms are non-adjacent in SMILES)
Expert Annotation Pipeline for Editing and Generation Tasks:
- Function: Construct precise mappings between natural language instructions and molecular structures
- Mechanism: A three-stage pipeline—(1) annotators with chemistry backgrounds draft instructions/descriptions; (2) a second annotator conducts peer review with iterative revision until consensus is reached; (3) two independent validators reconstruct the molecular structure from text alone, and a sample is accepted only if both succeed
- Design Motivation: Accurate molecular reconstruction from text alone constitutes the strongest validation of instruction/description unambiguity
- Effort: Over 500 hours of expert annotation and validation
Anti-Leakage and Robustness Design:
- Function: Ensure evaluation results are not affected by data leakage or memorization
- Mechanism: (1) Unique hash canary strings to detect leakage; (2) SMILES enumeration augmentation (enumerating from different starting atoms) to test robustness—editing accuracy across 5 different augmentations is \(0.773\pm0.027\), indicating high consistency

Loss & Training¶

MolLangBench does not involve model training. Evaluation metrics: exact-match accuracy for recognition and editing tasks; accuracy (whether the generated molecule satisfies all specified conditions) for generation tasks, supplemented by Tanimoto similarity (molecular fingerprints) and pass@k metrics.

Key Experimental Results¶

Main Results¶

Evaluation of 16 commercial LLMs (SMILES modality, core test set):

Model	Recognition Acc.	Editing (Valid/Sim./Acc.)	Generation (Valid/Sim./Acc.)
GPT-5	0.862	0.960/0.923/0.855	0.920/0.741/0.430
o3	0.918	0.945/0.903/0.785	0.670/0.546/0.290
o4-mini	0.872	0.930/0.885/0.740	0.820/0.651/0.350
Gemini-2.5-Pro	0.852	0.930/0.881/0.745	0.865/0.737/0.430
Claude-Opus-4.1	0.814	0.950/0.884/0.705	0.920/0.725/0.330
Llama-4-Maverick	0.614	0.895/0.772/0.545	0.875/0.511/0.115
Qwen3-Max	0.486	0.690/0.561/0.360	0.465/0.104/0.000

Ablation Study¶

Chemistry-specific models vs. general-purpose LLMs:

Model Type	Recognition	Editing Acc.	Generation Acc.
ChemDFM-13B	0.300	0.025	0.000
Galactica-120B	0.290	0.040	0.000
HIGHT (graph-language)	0.127	0.000	0.000
GPT-4o (general)	0.593	0.525	0.115

SMILES vs. SELFIES representation (o3 model):

Representation	Recognition	Editing Acc.	Generation Acc.
SMILES	0.918	0.785	0.290
SELFIES	0.528	0.195	0.000

pass@k results (o3 model):

Task	pass@1	pass@3	pass@5
Editing (core)	0.785	0.856	0.900
Generation (core)	0.290	0.485	0.545

Key Findings¶

Generation is extremely challenging: The strongest model, GPT-5, achieves only 43.0%, and pass@5 reaches only 54.5%—current AI is severely limited in constructing molecular structures from textual descriptions.
Six error categories for the o3 model: invalid SMILES syntax (11/66 editing/generation cases), stereochemistry errors (9/15), chain length errors (4/8), substituent misplacement (13/42), ring structure errors (10/23), and extra/missing groups (1/3)—atom counting and enumeration issues arising from BPE tokenization are one root cause.
SELFIES is far inferior to SMILES: Under the same o3 model, generation accuracy with SELFIES drops to 0%—due to the scarcity of SELFIES in LLM training data.
Chemistry-specific models lag comprehensively: ChemDFM, Galactica, and others fall far below general-purpose GPT-4o, indicating that scale effects outweigh domain-specific knowledge.
Structural understanding improves downstream reasoning: GPT-4o achieves approximately 5% improvement on property prediction when first describing structure before predicting properties compared to direct prediction (BBBP: 0.551→0.603, BACE: 0.583→0.632).

Highlights & Insights¶

Fills an important gap: The first comprehensive benchmark derived from the chemist's workflow that systematically evaluates AI's fundamental molecular structure capabilities.
High-quality data construction: 500+ hours of expert annotation and a three-stage verification process guarantee unambiguity—this itself constitutes a core contribution.
Reveals a path deviation: Current molecular-language research may be heading in the wrong direction by skipping structural understanding and directly targeting property prediction, analogous to a VLM performing reasoning without recognizing objects in images.
Accompanying training data: MolLangData provides large-scale training data, forming a complete ecosystem.

Limitations & Future Work¶

The editing/generation core sets each contain only 200 samples, which is relatively small (though the 500-hour annotation cost limits expansion).
Molecules are restricted to fewer than 40 heavy atoms (covering 93% of UniChem molecules); biomacromolecules are not addressed.
Evaluation relies on the Mathpix API to convert generated images back to SMILES, introducing an additional source of error.
Evaluation is primarily conducted on OpenAI models; coverage of open-source models could be more comprehensive.

vs. MoleculeNet: Focuses on property prediction; MolLangBench targets language-molecular structure interaction—different levels of abstraction.
vs. MolX/Uni-MRL: These works perform property prediction and caption annotation while bypassing structural understanding as a prerequisite.
Analogy to GPQA: A "diamond set" of only 198 samples still serves as a standard benchmark for LLM scientific reasoning; quality outweighs scale.
Insight: AI for Science requires evaluation of fundamental capabilities before advanced tasks—this represents a "GLUE moment" for the chemistry domain.

Rating¶

Novelty: ⭐⭐⭐⭐ First comprehensive benchmark for the molecule-language structure interface, with a clearly defined problem that aligns well with real chemical workflows.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 16+ LLMs + 5 chemistry models + 3 modalities + SELFIES + pass@k + error analysis + downstream property experiments—extremely comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-motivated, and thoroughly argued.
Value: ⭐⭐⭐⭐⭐ Provides a much-needed standardized evaluation tool for AI chemistry research, with potential to reorient the field's research priorities.