Towards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs¶

Conference: NeurIPS 2025 arXiv: 2602.00862 Code: Unavailable Area: Medical Imaging Keywords: protein representation learning, graph neural networks, multiscale, secondary structure, hierarchical graph

TL;DR¶

This paper proposes SSHG (Secondary Structure-based Hierarchical Graph), a framework that constructs two-level hierarchical graph representations from protein secondary structure motifs — an intra-motif residue-level graph and an inter-motif global graph — and employs a two-stage GNN to learn local and global features respectively. Theoretical guarantees of maximal expressiveness are provided, with empirical improvements in both accuracy and computational efficiency on enzyme classification and ligand affinity prediction tasks.

Background & Motivation¶

Graph neural networks have emerged as powerful tools for protein structure learning, enabling the capture of spatial relationships at the residue level. However, existing GNN approaches face two core challenges:

Insufficient multiscale modeling. Proteins are naturally organized hierarchically: primary structure (amino acid sequence) → secondary structure (α-helices, β-sheets, and other motifs) → tertiary/quaternary structure. Existing residue-level GNNs cannot effectively capture features at the secondary structure level. A canonical counterexample is the prion protein: the normal isoform PrP\(^\text{C}\) and the pathological isoform PrP\(^\text{Sc}\) share identical primary sequences but differ drastically in secondary structure — the normal form is α-helix-rich, while the pathological form converts to a β-sheet-enriched structure, causing fatal neurodegenerative disease. Pure residue-level GNNs cannot distinguish these two states.

Inefficient modeling of long-range dependencies. To capture global context, existing methods typically employ large cutoff radii (e.g., 16 Å), generating extremely dense graphs (up to ~15K edges) with substantial computational and memory overhead. Multiscale methods such as HoloProt use surface-based modeling but incur similarly high costs.

The core idea of this paper is to leverage domain knowledge — protein secondary structure — as natural hierarchical nodes: each secondary structure motif (e.g., an α-helix segment) serves as a higher-level node. This design simultaneously exploits biological prior knowledge and geometric information, enabling efficient multiscale protein modeling with very few edges (total edges \(< 3N\), where \(N\) is the number of residues), backed by theoretical guarantees of maximal expressiveness.

Method¶

Overall Architecture¶

SSHG is a modular two-stage GNN framework: 1. The DSSP algorithm partitions the protein sequence into secondary structure motifs (α-helices, β-strands, loops, etc.). 2. A two-level hierarchical graph is constructed: intra-structure graphs (residue-level graphs within each motif) and an inter-structure graph (global graph between motifs). 3. A first-stage GNN learns local features within each motif; a second-stage GNN learns global features across motifs. Any GNN backbone (e.g., GVP-GNN, ProNet, Mamba) can be flexibly selected.

Key Designs¶

DSSP-based secondary structure segmentation and hierarchical graph construction
- Function: Segments the protein sequence into motif subsequences based on secondary structure, and constructs two-level graphs.
- Mechanism: DSSP assigns each residue a secondary structure type token (H = α-helix, E = β-strand, T = turn, etc., 9 types in total); residues with consecutive identical tokens are grouped into the same subsequence \(S_i\). An intra-structure graph \(\mathcal{G}_i\) is built for each \(S_i\) (residues as nodes; edges determined via the SCHull method from α-carbon coordinates). An inter-structure graph \(\mathcal{G}\) is constructed from the geometric centers of all \(S_i\) (motifs as nodes), with edge features encoding the local frame product \(g_i^\top g_j\) representing relative orientations.
- Design Motivation: Secondary structures are well-established functional units in proteins; using them as hierarchical nodes provides both biological grounding and a significant reduction in edge count. The SCHull graph guarantees geometric completeness and sparsity.
Two-stage GNN message passing
- Function: Learns local intra-motif interactions in the first stage, then global inter-motif relationships in the second stage.
- Stage 1: \(T_1\) rounds of message passing are performed independently on each intra-graph \(\mathcal{G}_i\), followed by readout to obtain motif embeddings \(\mathbf{s}_i = \text{readout}_1(\{\!\!\{\mathbf{f}_k^{(T_1)} | k \in \mathcal{V}(\mathcal{G}_i)\}\!\!\})\).
- Stage 2: \(T_2\) rounds of message passing are performed on the inter-structure graph \(\mathcal{G}\) using \(\mathbf{s}_i\) as initial node features, producing global features \(\mathbf{s}_{\text{global}} = \text{readout}_2(\{\!\!\{\mathbf{s}_i^{(T_2)} | i \in \mathcal{V}(\mathcal{G})\}\!\!\})\).
- Design Motivation: The two-stage design allows the first-stage GNN to process small graphs (each motif typically contains only tens of residues) with few layers, while the second-stage GNN can capture long-range dependencies in a single layer due to the substantially reduced node count in the motif graph.
Theoretical guarantee: Maximal Expressiveness Theorem (Theorem 4.2)
- Under injectivity assumptions on UPD, AGG, and readout, the two-stage SSHG GNN can distinguish any two protein structures that are inequivalent under rigid-body motion.
- Sparsity guarantee (Proposition 3.2): the total edge count \(|\mathcal{E}| + \sum_i |\mathcal{E}_i| < 3N\), where \(N\) is the number of residues.
- Design Motivation: This theorem establishes that the hierarchical design does not discard critical structural information — forming the theoretical cornerstone of the entire framework.

Loss & Training¶

Loss functions are selected according to the downstream task: cross-entropy for enzyme classification and MSE for ligand affinity prediction. Data augmentation includes adding Gaussian noise to coordinates (std = 0.1), anisotropic scaling (0.9–1.1), and random masking of amino acid types (probability 0.1–0.2).

Key Experimental Results¶

Main Results¶

Enzyme Reaction Classification (EC)

Method	Test Accuracy (%)	Training Time (s/epoch)	Parameters
GCN	66.5	186	—
GCN+SSHG	71.2	150	—
GVP-GNN	68.5	334	1.0M
GVP-GNN+SSHG	73.6	236	1.0M
IEConv	87.2	—	9.8M
ProNet-Backbone	86.4	210	1.3M
ProNet+SSHG	87.2	140	1.3M
Mamba+SSHG	88.4	157	1.5M

Ligand Binding Affinity Prediction (LBA)

Method	RMSE↓	Pearson↑	Spearman↑	Training Time (s/epoch)
HoloProt-Full	1.464	0.509	0.500	45
ProNet-Backbone	1.458	0.546	0.550	32
ProNet+SSHG	1.435	0.579	0.591	24
Mamba+SSHG	1.399	0.614	0.610	29

Ablation Study¶

Graph Construction Strategy Efficiency Comparison (ProNet backbone)

Configuration	Avg. Edges	Training Time (s/epoch)	Memory (MiB)	Accuracy (%)
cutoff=4	1,034	138	1,290	78.1
cutoff=10	11,316	210	14,548	86.4
cutoff=16	14,881	247	17,768	87.0
+SSHG	1,593	140	1,818	87.2

Two-Stage GNN Parameter Allocation

Stage 1 Params	Stage 2 Params	Training Time	Accuracy (%)
0.69M (balanced)	0.69M	140	87.2
1.03M (local-heavy)	0.34M	136	87.4
0.34M (global-heavy)	1.03M	142	87.1

Key Findings¶

SSHG delivers simultaneous accuracy gains and training speedups across all evaluated backbone architectures — a rare "win-win" outcome.
With only 1,593 edges (vs. 14,881 for cutoff=16), SSHG achieves higher accuracy, confirming the effectiveness of hierarchical sparse graphs.
Memory usage drops from 17,768 MiB to 1,818 MiB (a 90% reduction), which is critical for practical application to large-scale proteins.
Allocating more parameters to the first stage (local motif modeling) slightly outperforms allocating them to the second stage, indicating that fine-grained local representations are more important.

Highlights & Insights¶

The framework design is highly elegant: protein secondary structure — an existing piece of biological knowledge — is used as a natural hierarchical partition, enabling hierarchical graph construction without any learned components. This paradigm of "domain-knowledge-driven graph construction" is worth generalizing to other structured data domains.
The paper achieves a seamless integration of theory and experiment: maximal expressiveness and sparsity are formally guaranteed, while experiments consistently demonstrate accuracy improvements and efficiency gains. The edge count upper bound of \(< 3N\) in Proposition 3.2 renders the complexity analysis concise and compelling.

Limitations & Future Work¶

Validation is currently limited to enzyme classification and ligand affinity prediction; tasks such as protein fold classification and protein–protein interaction remain to be evaluated.
The readout function employs mean pooling rather than the injective aggregation required by theory; stronger aggregation schemes may yield further improvements.
Integration with pretrained protein language models (e.g., ESM-2) has not been explored, representing a promising future direction.
The secondary structure definition relies on the accuracy of the DSSP algorithm; robustness to predicted structures (e.g., AlphaFold outputs) requires further validation.

vs. HoloProt: HoloProt achieves multiscale modeling via large-radius graphs and surface-based representations, with a training time of 300 s/epoch vs. 157 s/epoch for SSHG, and accuracy of 78.9% vs. 88.4% for Mamba+SSHG — a substantial margin in favor of SSHG.
vs. IEConv: IEConv acquires multiscale features through hierarchical pooling applied across multiple layers, with 9.8M parameters vs. 1.3M for SSHG; SSHG achieves comparable accuracy at roughly one-seventh the parameter count.
vs. ProNet: As one of SSHG's backbones, ProNet+SSHG reduces training time by 33% while maintaining equivalent parameter count, with equal or improved accuracy.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The idea of leveraging secondary structure to construct hierarchical graphs is both intuitively natural and theoretically well-founded, representing a significant contribution to the protein GNN literature.
Experimental Thoroughness: ⭐⭐⭐⭐ — Two benchmark tasks, multiple backbones, and comprehensive efficiency analysis are provided, though broader task coverage would strengthen the work.
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical derivations are rigorous, figures are clear, and the logical chain from biological motivation to method design to theoretical guarantees is coherent and complete.
Value: ⭐⭐⭐⭐⭐ — The framework provides a general plug-and-play module from which any GNN backbone can benefit, making it highly practical.