Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding¶
Conference: ICLR 2026 arXiv: 2602.02742 Code: None Area: Graph Learning / Molecular Understanding Keywords: Graph-LLM Alignment, Dynamic Tokens, Molecular Graph, Q-Former, Entropy Guidance
TL;DR¶
This paper proposes EDT-Former (Entropy-guided Dynamic Token Transformer), which establishes efficient alignment between a frozen graph encoder and a frozen LLM via an entropy-guided dynamic token generation mechanism. Without fine-tuning the LLM backbone, EDT-Former achieves state-of-the-art performance across multiple benchmarks including molecular question answering, molecular instruction following, and property prediction.
Background & Motivation¶
Molecular understanding is central to scientific discovery (e.g., drug design, materials discovery), yet LLMs face inherent challenges when processing molecular graph structures — LLMs excel at sequential text but molecules are graph-structured data encoding atomic connectivity, stereochemical information, and substructural context.
Existing graph-LLM bridging approaches largely adopt the Q-Former architecture from the vision-language domain, using fixed-length static query tokens to compress graph information. However, this paradigm suffers from three core issues:
Information loss from static tokens: Fixed-length token sequences cannot adaptively adjust to molecular complexity — simple molecules may be over-represented while complex ones are under-represented. Q-Former was originally designed for visual tasks and cannot effectively capture the topological information and stereochemical properties inherent in graph-structured data.
Neglect of stereochemistry and substructure: Three-dimensional molecular configurations and functional groups are essential for understanding chemical properties, yet existing fixed-token methods struggle to retain these local and global features.
Expensive LLM fine-tuning: Most methods require fine-tuning the LLM backbone, incurring high computational costs and limiting generalization.
Core Idea: Leverage information entropy to adaptively determine how many tokens each molecule requires and which parts of the molecule each token should attend to, enabling dynamic, content-aware graph-to-text representation conversion.
Method¶
Overall Architecture¶
The EDT-Former pipeline consists of three core components: - Graph Encoder (frozen): Encodes molecular graphs into node-level representations. - EDT-Former Connector: Transforms graph representations into a token sequence of dynamically determined quantity and content. - LLM (frozen backbone; only the embedding layer is fine-tuned): Receives the token sequence and textual instructions to produce outputs.
The overarching goal is to achieve effective molecular graph–language alignment through lightweight fine-tuning of the EDT-Former connector and embedding layer alone, without fine-tuning the LLM backbone.
Key Designs¶
-
Entropy-Guided Dynamic Tokenization: This is the core innovation. Unlike Q-Former's fixed number of static queries, EDT-Former dynamically determines the number and focus of tokens based on the information distribution of the molecular graph. Specifically:
- The information entropy distribution of node representations output by the graph encoder is first computed; high-entropy regions indicate information-rich or highly uncertain molecular fragments.
- The molecular graph is partitioned into multiple molecular patches according to the entropy distribution, each corresponding to an information-dense substructure.
- One or more dynamic tokens are generated per patch, with more tokens allocated to high-entropy regions to preserve richer information.
- Consequently, complex molecules automatically receive more tokens while simple molecules require fewer, balancing computational efficiency and representational quality.
-
Molecular Patch Alignment: The dynamic tokens generated by EDT-Former are aligned with information-rich molecular patches, ensuring each token carries meaningful structural and chemical information. The design motivation draws an analogy to vision: just as ViT patches contain spatially contiguous pixel information, molecular patches should encode topologically contiguous and chemically meaningful atomic clusters. Through this alignment mechanism, tokens retain local substructural features (e.g., functional groups) while integrating global structural information (e.g., molecular scaffold topology) via attention.
-
Parameter-Efficient Alignment Training: EDT-Former adopts a parameter-efficient training scheme:
- The graph encoder is fully frozen, preserving pre-trained general molecular representation capabilities.
- The LLM backbone is fully frozen, preserving language understanding and generation capabilities.
- Only the EDT-Former connector and the LLM embedding layer are trained.
- This design substantially reduces computational cost compared to full fine-tuning while maintaining generalization.
- Training objectives include an alignment loss (aligning token representations with the LLM embedding space) and task-specific losses (e.g., QA accuracy, property prediction error).
Loss & Training¶
Training follows a multi-stage strategy: graph–text alignment pre-training (aligning token representations with the text embedding space) is performed first, followed by downstream task fine-tuning. The loss function combines a contrastive learning loss (pulling matched molecule–text pairs closer and pushing unmatched pairs apart) and a generative loss (e.g., cross-entropy for QA tasks). Crucially, the trainable parameters throughout the entire pipeline are limited to the EDT-Former connector and the embedding layer, drastically reducing the number of trainable parameters.
Key Experimental Results¶
Main Results¶
EDT-Former is evaluated on four categories of molecular understanding benchmarks and achieves or surpasses state-of-the-art performance on all of them:
| Benchmark | Task Type | EDT-Former | Prev. SOTA | Key Finding |
|---|---|---|---|---|
| MoleculeQA | Molecular QA | SOTA | Q-Former variants | Dynamic tokens significantly outperform static tokens |
| Mol-Instructions | Molecular instruction following | SOTA | Methods requiring LLM fine-tuning | Surpasses methods that require LLM fine-tuning without doing so |
| TDC | Property prediction | SOTA | Graph model + LLM fine-tuning | Consistently leads across multiple sub-tasks |
| MoleculeNet | Property prediction | SOTA | Traditional GNNs | Particularly advantageous in low-data regimes |
Ablation Study¶
| Configuration | Key Metric Change | Note |
|---|---|---|
| Fixed tokens vs. dynamic tokens | Dynamic tokens significantly better | Validates the necessity of adaptive token generation |
| With entropy guidance vs. without | With guidance is better | Entropy signals effectively guide token allocation |
| Frozen LLM vs. fine-tuned LLM | Frozen LLM achieves comparable performance | Indicates EDT-Former's alignment quality is sufficiently high |
| Different graph encoders | EDT-Former remains effective across encoders | Framework is generalizable |
Key Findings¶
- Entropy-guided dynamic tokens consistently outperform fixed-length tokens across all tasks, confirming the importance of adaptive representation length.
- EDT-Former surpasses methods requiring full LLM fine-tuning without modifying the LLM backbone, demonstrating the feasibility of efficient graph–language alignment.
- EDT-Former also performs strongly on molecular property prediction tasks that demand precise numerical understanding, indicating that dynamic tokens effectively preserve quantitative chemical information.
- Fine-tuning the embedding layer alone is a critical design choice — omitting all LLM fine-tuning yields noticeably worse results, but fine-tuning the embedding layer substantially closes the gap.
Highlights & Insights¶
- Adapting vision paradigms to molecules: The paper cleverly transplants the Q-Former paradigm from the vision-language domain into molecular understanding while addressing the fundamental deficiencies of direct transfer (static tokens, neglect of topological structure).
- Entropy as an information allocation signal: Using information entropy to determine token count and allocation is an elegant design — high-entropy regions genuinely require finer-grained representation.
- Frozen backbone + lightweight connector paradigm: EDT-Former further validates the effectiveness of the "freeze large model, train small connector" paradigm in the molecular domain.
- General value of variable-length representations: The dynamic token concept is transferable to other graph–language tasks (e.g., protein understanding, materials design).
Limitations & Future Work¶
- The method is currently validated only on 2D molecular graphs; its capability for handling 3D molecular conformations (e.g., protein folding configurations) remains unexplored.
- The choice of upper and lower bounds on the dynamic token count may affect the efficiency–quality trade-off and warrants further sensitivity analysis.
- Although the embedding layer introduces few trainable parameters, sufficient alignment data is still required, which may be a bottleneck in low-resource chemistry domains.
- Comparison with the latest end-to-end molecular foundation models (e.g., Galactica, Mol-GPT) is insufficient.
- Entropy-guided token generation introduces additional computational steps; the impact on inference efficiency in large-scale molecular screening scenarios needs evaluation.
Related Work & Insights¶
- BLIP-2 / Q-Former: EDT-Former draws direct design inspiration from Q-Former in the vision-language domain, while addressing the unique challenges of graph-structured data through dynamic tokenization and entropy guidance.
- MolCA, MoMu, and related molecular-language methods: These works establish the foundation for molecular-language alignment; EDT-Former builds upon them to achieve finer-grained alignment via dynamic tokens.
- GNN + LLM joint frameworks: EDT-Former belongs to the "graph encoder + connector + LLM" paradigm; its core contribution lies in the connector design.
- The proposed approach suggests that in multimodal alignment, connector design — particularly dynamic vs. static and content-aware vs. fixed — may matter more than simply scaling model size.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Entropy-guided dynamic tokens represent an effective innovation, though the overall framework remains a Q-Former variant)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Four benchmark categories with comprehensive ablation studies)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation and well-designed experiments)
- Value: ⭐⭐⭐⭐ (Provides a new, efficient paradigm for multimodal approaches to molecular understanding)