Generative Graph Pattern Machine¶
Conference: NeurIPS 2025 arXiv: 2505.16130 Authors: Zehong Wang, Zheyuan Zhang, Tianyi Ma, Chuxu Zhang, Yanfang Ye (University of Notre Dame, University of Connecticut) Code: https://github.com/Zehong-Wang/G2PM Area: Graph Learning / Graph Pre-training / Transformer Keywords: graph pre-training, generative Transformer, substructure tokenization, masked substructure modeling, random walk, message-passing-free
TL;DR¶
This paper proposes the Generative Graph Pattern Machine (G2PM), a fully message-passing-free generative Transformer framework for graph pre-training. It tokenizes graph instances (nodes/edges/graphs) into substructure sequences via random walks and performs self-supervised pre-training under a Masked Substructure Modeling objective. G2PM comprehensively outperforms existing graph pre-training methods on node/link/graph classification and cross-domain transfer tasks, while exhibiting model and data scaling laws analogous to those observed in NLP and CV.
Background & Motivation¶
Transformers have achieved remarkable success in NLP (LLMs) and CV (ViT) through generative pre-training, yet this paradigm has not produced comparable breakthroughs in the graph domain, severely limiting the development of graph foundation models. Existing graph learning predominantly relies on message-passing GNNs (MPNNs), which suffer from the following fundamental bottlenecks:
Limited expressive power: Upper-bounded by the 1-WL test
Over-smoothing: Node representations converge as the number of layers increases
Over-squashing: Information is compressed and lost at bottleneck edges
Insufficient long-range dependency modeling: The locality of message passing restricts global information flow
Poor scalability: Scaling up model size or data volume fails to yield sustained performance improvements
Furthermore, existing graph pre-training research has primarily adopted contrastive learning, which is known to be inferior to generative objectives (as demonstrated by BERT and MAE) in learning generalizable semantic representations.
Transferring generative Transformer pre-training to graphs presents three unique challenges: - Lack of sequential structure: Graphs possess neither the linear structure of text nor the grid structure of images - Low semantic granularity: Nodes and edges are low-level semantic units, unlike the high-level semantics of words or image patches - Sequence length: Treating nodes as tokens causes sequence length to grow linearly with the number of nodes, making the quadratic complexity of Transformers intractable
Method¶
Core Idea: Substructure Sequentialization¶
The central insight of G2PM is that meaningful graph substructures serve as the semantic building blocks of graphs (e.g., triangles in social networks represent stable relationships; benzene rings in molecular graphs encode chemical stability). Any graph instance can thus be represented as a sequence of substructures, analogous to word sequences in text or patch sequences in images.
Random Walk Tokenizer¶
To avoid the high computational cost of subgraph isomorphism matching (NP-complete), G2PM employs random walks to sample substructure patterns on the fly:
- For each graph instance, \(k\) unbiased random walks of length \(L\) are sampled
- Transition probability: \(P(v_{i+1}|v_0,...,v_i) = \mathbb{1}[(v_i,v_{i+1}) \in \mathcal{E}] / D(v_i)\)
- Each walk corresponds to a substructure pattern (e.g., [1,2,3,1,4,3] corresponds to a diamond structure)
Substructure encoding: the node feature sequence along a walk \(w = [\mathbf{x}_1, ..., \mathbf{x}_m]\) is fed into a lightweight Transformer encoder \(f\) to produce a substructure embedding \(\mathbf{p} = f(w)\).
Key design decisions: - No positional encoding: Experiments show positional embeddings yield no consistent benefit and introduce cubic complexity; they are therefore omitted entirely - Task-agnostic: For node tasks, walks originate from the target node; for edge tasks, from the endpoints; for graph tasks, from randomly sampled nodes - Elimination of message-passing bottlenecks: Random walks naturally capture long-range dependencies, possess expressive power beyond 1-WL, and are immune to over-smoothing and over-squashing
G2PM Backbone¶
A standard Transformer architecture serves as the backbone for both encoder and decoder. The input is a sequence of substructure tokens \(\mathbf{P} = [\mathbf{p}_1, ..., \mathbf{p}_n]\), which are encoded layer by layer through multi-head self-attention and feed-forward networks. Default configuration:
- Hidden dimension 768, 12 attention heads
- Encoder: 3 layers; Decoder: 1 layer
- Approximately 21.3M (encoder) + 7.1M (decoder) parameters
Pre-training Objective: Masked Substructure Modeling (MSM)¶
Inspired by BERT/MAE-style masked modeling, but operating on substructures rather than words or patches:
- Sample a substructure sequence \(\mathbf{P} = [\mathbf{p}_1, ..., \mathbf{p}_n]\) from the graph
- Randomly mask a subset of substructures (high masking ratio by default), retaining visible substructures \(\mathbf{P}_{vis}\)
- The visible substructures are passed through the encoder to produce contextual representations \(\mathbf{H}_{vis}\)
- Learnable [MASK] tokens are inserted at masked positions, and the full sequence is fed to the decoder for reconstruction
Reconstruction target: Rather than reconstructing low-level node features, the model reconstructs high-level semantic embeddings produced by an EMA encoder:
where \(\hat{\mathbf{p}}_i = f_{\text{EMA}}(w_i)\) is the output of the EMA encoder, and sg denotes the stop-gradient operator.
Data Augmentation Strategy¶
A mixed augmentation strategy is adopted, applying one feature-level and one structure-level augmentation simultaneously to each instance: - Feature-level: Feature masking (zeroing out a subset of features), node masking (masking entire nodes) - Structure-level: Substructure corruption (dropping nodes along a walk), substructure injection (replacing nodes with random nodes)
Downstream Adaptation¶
After pre-training, the decoder is discarded and a linear prediction head is appended to the encoder. Final predictions are made by mean-pooling over substructure representations.
Design Space Exploration (Key Ablation Study)¶
The paper conducts a systematic analysis of the design space, yielding several important insights:
| Design Dimension | Best Choice | Insight |
|---|---|---|
| Architecture style | MAE-style (sparse encoder + lightweight decoder) | SimMIM-style performs worse; sparsity encourages the encoder to learn stronger contextual representations |
| Model dimension | 768 (64.5M parameters) | Performance improves consistently with dimension, exhibiting favorable scaling behavior |
| Encoder/decoder depth | Encoder 3 layers / Decoder 1 layer | Deeper decoders lead to overfitting |
| Substructure encoder | Transformer | Outperforms GRU, GIN, and mean pooling |
| Positional encoding | Not used | No consistent benefit; high computational cost |
| Masking ratio | High (~0.7–0.8) | High masking reduces redundant walks and provides a more challenging learning signal |
| Mask token | Learnable | Outperforms fixed/random/sampled tokens |
| Reconstruction target | EMA semantic embeddings | Outperforms low-level features (mean pooling/concatenation) |
| Loss function | L2 loss | Outperforms L1; sensitivity to outliers helps capture subtle semantics |
| EMA momentum | α=0.99, updated every 10 steps | α=0 (no EMA) causes training collapse |
| Augmentation strategy | Mixed augmentation | Single augmentation has limited effect; diversity is critical |
Key Experimental Results¶
Node Classification (Homophilic Graphs)¶
Evaluated on 7 datasets (from Pubmed to ogbn-products), G2PM achieves the best performance among all self-supervised methods with an average rank of 2.0 (second only to supervised GPM): - ogbn-arxiv: 72.31% (60M parameters), with performance improving consistently as parameter count increases - ogbn-products: 80.56% (significantly outperforming GraphMAE2's 79.33%)
Model/Data Scaling¶
- Model scaling: On ogbn-arxiv, performance improves continuously from 68.2% to 72.3% as parameters scale from 0.4M to 64.5M, whereas GraphMAE/BGRL plateau or decline beyond ~3M parameters
- Data scaling: Increasing pre-training data consistently improves performance, while competing methods peak at small data fractions and subsequently decline
Graph Classification¶
G2PM achieves an average rank of 1.0 across 7 datasets (5 molecular + 2 social): - ogbg-HIV: 78.7% (vs. GraphMAE 77.8%) - IMDB-B: 83.0% (vs. GraphMAE 75.5%, a substantial gain) - Without pre-training, the average rank drops to 8.3, demonstrating the critical role of pre-training for graph-level tasks
Link Prediction¶
G2PM outperforms GCN and GraphMAE on all three datasets: Cora, Pubmed, and ogbl-Collab.
Cross-Domain Transfer¶
G2PM demonstrates strong cross-domain transfer capability: - Arxiv → Products: 81.3% (positive transfer +0.7%) - Arxiv → HIV: 76.8% - HIV → Arxiv: 72.6% (positive transfer +0.3%) - HIV → PCBA: 77.9% (positive transfer +2.3%)
Joint pre-training on Arxiv + FB15K237 + ChemBL achieves performance close to or exceeding graph foundation models (OFA, GFT) across three target domains.
Theoretical Insight: Substructure Dependencies¶
The paper analyzes the structural basis for MSM's effectiveness from three perspectives:
- Hierarchical dependency: Smaller substructures (e.g., triangles) are building blocks of larger ones (e.g., cliques); inference can proceed bidirectionally
- Functional reinforcement: Certain substructures co-occur statistically (e.g., high-density triangles correlate with higher-order cliques)
- Functional exclusion: Certain substructures are mutually exclusive (e.g., dense communities are incompatible with star-hub structures)
Limitations & Future Work¶
- The current framework uses unordered substructure sequences, suited for masked prediction; future work may explore ordered sequences to support next-token prediction and further improve scalability
- Random walk serves as an online tokenizer without learnable parameters; future work may design adaptive, learnable substructure tokenizers
- On heterophilic graphs (e.g., Actor), expanding the receptive field does not consistently improve performance, suggesting that key signals in certain graphs are inherently local
Highlights & Insights¶
- Methodological significance: G2PM demonstrates that graph learning can entirely abandon the message-passing paradigm, achieving state-of-the-art results through pure Transformer + random walk tokenization + generative pre-training, thereby charting a new direction for graph foundation models
- Scalability as a core contribution: This is the first self-supervised method in the graph domain to exhibit model and data scaling laws comparable to those in NLP and CV; while GraphMAE/BGRL plateau or decline beyond 3M parameters, G2PM continues to improve at 60M
- Comprehensive design space exploration: The ablation study yields multiple actionable insights (e.g., MAE outperforms SimMIM, high masking ratios are preferable, positional encodings are unnecessary) that offer valuable guidance for future work
- Cross-domain transfer capability deserves attention: By learning transferable substructure patterns, G2PM achieves positive transfer across domain and task boundaries—a capability that message-passing methods struggle to attain
- Relationship to GPM: G2PM can be viewed as a generative pre-training extension of GPM (a prior work from the same group with ViT-style substructure modeling), upgrading contrastive/supervised learning to generative self-supervised learning