GlycanAA: Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training¶
Conference: ICML 2025
arXiv: 2506.01376
Code: https://github.com/kasawa1234/GlycanAA
Area: Graph Learning
Keywords: Glycan modeling, all-atom graphs, hierarchical message passing, multi-scale pre-training, GNN
TL;DR¶
This work proposes GlycanAA, the first all-atom glycan modeling approach. Glycans are represented as heterogeneous graphs containing atom nodes and monosaccharide nodes. A hierarchical message passing scheme captures multi-scale information ranging from local atomic interactions to global monosaccharide interactions. This is further enhanced by multi-scale masked prediction pre-training (PreGlycanAA), achieving top performance across 11 tasks on the GlycanML benchmark.
Background & Motivation¶
1. Importance of Glycans¶
Glycans are complex macromolecules composed of sugar molecules, playing critical roles in biological processes such as extracellular matrix formation, cell-cell communication, immune response, and cell differentiation.
Solution Approach¶
Goal: Prior methods modeled glycans as monosaccharide-level graphs, ignoring atom-level structures. Small molecule encoders applied directly to glycans yield poor results because of the scale mismatch, leading to insufficient representation capability.
Key Challenge¶
Key Challenge: Utilizing the natural hierarchical structure of glycans: atoms constitute the local structures of monosaccharides, and different monosaccharides constitute the global backbone. A hierarchical message passing model is designed to simultaneously capture both scales.
Method¶
Overall Architecture¶
- Represent glycans as heterogeneous graphs: atom nodes + monosaccharide nodes + different types of edges.
- Hierarchical message passing: three levels of interactions (atom-atom, atom-monosaccharide, monosaccharide-monosaccharide).
- Multi-scale masked prediction pre-training: self-supervised learning on 40,781 unlabeled glycans.
Key Designs¶
1. Heterogeneous Graph Representation¶
- Atom nodes: Encode atom types, charges, and other properties.
- Monosaccharide nodes: Encode monosaccharide types (e.g., Glucose, GlcNAc).
- Edge types: Covalent bonds between atoms, atom-monosaccharide assignment edges, and glycosidic bonds between monosaccharides.
2. Hierarchical Message Passing¶
- Atom-Atom: Propagation within monosaccharides to capture local covalent bond information.
- Atom-Monosaccharide: Aggregation of atom features into monosaccharide representations, enabling local-to-global information flow.
- Monosaccharide-Monosaccharide: Propagation along the glycan backbone to capture global topological information.
3. Multi-Scale Masked Prediction Pre-training¶
- Filtered 40,781 high-quality glycans from the GlyTouCan database.
- Randomly masked a portion of atom and monosaccharide nodes, and trained the model to recover them, thereby learning multi-scale dependencies.
Key Experimental Results¶
Main Results: GlycanML Benchmark Ranking¶
| Method | Type | Average Rank across 11 Tasks | Description |
|---|---|---|---|
| PreGlycanAA | All-atom + Pre-training | 1st | Ours |
| GlycanAA | All-atom | 2nd | Without pre-training |
| SweetNet | Monosaccharide-level GNN | 3rd | Prev. SOTA |
| SchNet | Small molecule encoder | 8th | Scale mismatch |
Ablation Study¶
| Configuration | Ranking Trend | Description |
|---|---|---|
| Full PreGlycanAA | Optimal | Hierarchical passing + Pre-training |
| w/o Pre-training | Decrease | GlycanAA still ranks 2nd |
| w/o Atom-level Passing | Significant decrease | Degenerates to monosaccharide-level |
| w/o Monosaccharide-level Passing | Decrease | Loses global topology |
| Single-scale Masking | Decrease | Multi-scale outperforms single-scale |
Key Findings¶
- All-atom modeling significantly outperforms monosaccharide-level modeling, validating the value of atomic information.
- Pre-training delivers consistent improvements; multi-scale masking is more effective than single-scale masking.
- Small molecule encoders perform poorly on glycans.
Highlights & Insights¶
- Domain-Specific Architecture Design: Leverages the natural hierarchical structure of glycans to design heterogeneous graphs and multi-level message passing.
- Filling the Gap: Proposes the first effective all-atom level glycan encoder.
- Value of Self-Supervised Pre-training: Multi-scale masking allows the model to understand dependencies across different levels.
Limitations & Future Work¶
- The caching truncation occurred in the latter part of the method section, and the complete experimental data tables could not be obtained.
- Modeling glycan-protein interactions could serve as a next step for extension.
- 3D spatial coordinates are not utilized, which could be integrated with geometric GNNs.
- The scale of the pre-training dataset (40K) is still relatively small compared to proteins.
Related Work & Insights¶
- vs SweetNet: Monosaccharide-level GNN that ignores atomic details.
- vs Protein Pre-training (ESM): Applies a similar self-supervised approach to the brand-new glycan domain.
- vs SchNet: Performs poorly when applied directly to glycans; the proposed method is a custom solution.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First all-atom glycan modeling
- Experimental Thoroughness: ⭐⭐⭐⭐ Full coverage of GlycanML 11 tasks
- Writing Quality: ⭐⭐⭐⭐ Clear structure
- Value: ⭐⭐⭐⭐⭐ Opens a new direction for glycan computational biology