Learning Graph Foundation Models on Riemannian Graph-of-Graphs¶

Conference: ICML 2026
arXiv: 2605.09993
Code: https://github.com/USTC-DataDarknessLab/R-GFM
Area: Graph Foundation Models / Self-supervised Representation / Riemannian Geometry
Keywords: Graph Foundation Model, Graph-of-Graphs, Riemannian MoE, adaptive-hop, domain generalization

TL;DR¶

R-GFM treats "subgraphs with different hop counts" as nodes in a higher-level Graph-of-Graphs, and uses a dynamic MoE routing mechanism to assign each GoG to the Riemannian manifold (hyperbolic / Euclidean / spherical) with the best-matched curvature. This simultaneously addresses two inherent limitations of existing graph foundation models: fixed receptive fields and single Euclidean embedding. It achieves up to 49% relative improvement on downstream tasks.

Background & Motivation¶

Background: Graph foundation models (GFMs, e.g., OFA, Prodigy, MDGFM) achieve cross-task and cross-domain transfer via pretraining on massive graphs, representing the "foundation model era" in graph ML.

Limitations of Prior Work: (1) Existing GFMs use fixed-hop subgraph sampling, such as 1-hop or 2-hop neighbors as the receptive field. However, downstream tasks have vastly different hop requirements—homogeneous citation networks only need 1-2 hops, while e-commerce fraud detection requires ≥4 hops to uncover long-chain collusion. Fixed hops inevitably lead to underfitting or noise in some tasks. (2) Existing methods embed all subgraphs into a single Euclidean space, but subgraphs with different hops have drastically different structures (locally dense vs. globally sparse and hierarchical), and a single geometry distorts representations.

Key Challenge: The conflict between fixed-structure receptive fields and heterogeneous hop requirements of downstream tasks; the conflict between single geometry and multi-scale structural heterogeneity.

Goal: (1) Design a pretraining paradigm that can adaptively capture multi-hop structures; (2) Enable the model to dynamically switch Riemannian manifolds in the representation space.

Key Insight: Elevate "subgraphs of different hops" to nodes in a Graph-of-Graphs (GoG), allowing the model to explicitly reason over scales; then use MoE to route each GoG to the expert with the best-matched geometry curvature.

Core Idea: "Structural scale as a first-class citizen"—adaptive-hop GoG addresses scale mismatch, confidence-aware dynamic Riemannian MoE addresses geometry mismatch.

Method¶

Overall Architecture¶

R-GFM consists of four stages: (A) Compute the coefficient of variation (CV) of node degree distribution to determine the candidate set of Riemannian experts, and for each training node \(v\), sample a set of \(1, 2, \ldots, K\)-hop subgraphs \(\{G_v^{(i)}\}_{i=1}^K\); (B) Pretrain a subgraph encoder with contrastive learning to embed each subgraph as \(\mathbf{X}_{\text{sub}}\); (C) Construct a sparse GoG \(\mathcal{G}\) based on subgraph similarity, and encode the GoG using dynamic MoE-based Riemannian routing; (D) Aggregate outputs from all experts to obtain a fused embedding for downstream node classification / link prediction tasks.

Key Designs¶

Adaptive-hop GoG Construction (adaptive hop count + similarity sparsification):
- Function: Avoids fixed receptive fields while controlling GPU memory.
- Mechanism: For each training node \(v\), incrementally increase hop count \(K\) using "online greedy + memory test", reverting to the last feasible \(K\) upon OOM, ensuring \(K \leq \mathcal{B}_{\text{GPU}}\). Subgraph embeddings are pretrained with NT-Xent contrastive loss. GoG edges are constructed by sampling from a distribution based on subgraph cosine similarity: \(\text{Prob}(i,j) = e^{\mathbf{S}[i,j]} / \sum_{u,v} e^{\mathbf{S}[u,v]}\), then sample \(\mathcal{B}_{\text{edge}} = 0.6 \cdot K(K-1)/2\) edges without replacement and symmetrize. Theoretically (Thm 3.2), the embedding noise of multi-hop sampling \(\|\boldsymbol\sigma_V\|_2 \leq \|\boldsymbol\sigma_F\|_2\) is strictly less than fixed-hop, and similarity-sparse GoG yields lower error than "edgeless GoG" or "fully connected GoG" (Thm 3.3).
- Design Motivation: Fixed hops miss long-range information or drown out local details; dense GoG introduces noise, sparse but random GoG lacks structural priors—similarity sparsification balances all three.
Dynamic MoE-based Riemannian Routing (dual dynamics: candidate set + Top-m):
- Function: Adaptively determines the number of Riemannian experts and how many to activate per instance, based on dataset structural heterogeneity.
- Mechanism: Quantify each training set's structural heterogeneity using the coefficient of variation \(\text{CV}(\mathcal{D}_i) = \text{std}(\deg)/\text{mean}(\deg)\), then use sliding statistics \(\mathcal{S}_i = \text{normalize}(\mu_i + \sigma_i)\) to accumulate heterogeneity; candidate expert set size is \(\lceil \mathcal{S}_i \cdot \zeta \rceil\), with curvatures alternately assigned as \(0, -1, +1, -2, +2, \ldots\) (covering hyperbolic, Euclidean, spherical). Routing scores \(\boldsymbol\alpha_{\mathcal{G}} = \text{softmax}(g(\mathcal{G})/\tau)\) are computed with a GCN encoder; confidence \(\text{conf} = (1/\psi) \sum_i \max \alpha^{(i)}\) increases during training, dynamically shrinking the number of activated experts \(m \leftarrow \max(1, m - \text{conf})\).
- Design Motivation: Fixed expert count is either insufficient or overfits; fixed top-\(m\) is also inflexible. Leveraging the observation that "router confidence increases during training" naturally reduces activated experts, improving generalization (theoretically, excess risk bound \(\mathcal{R}(\psi_D) \leq \mathcal{R}(\psi_F)\) is provided).
Theoretical Support for Domain Generalization:
- Function: Proves that R-GFM's error upper bound on unseen target domains is strictly lower than SOTA MDGFM.
- Mechanism: Introduce encoder classes \(\Phi_R\) and \(\Phi_M\) into the domain generalization error bound. Since R-GFM expands the encoder class expressiveness via GoG multi-hop and Riemannian MoE, and controls capacity via similarity sparsification and dynamic top-\(m\), Thm 3.5 shows \(\epsilon_{\text{R-GFM}} < \epsilon_{\text{MDGFM}}\).
- Design Motivation: The core evaluation for GFM is "can it still work on unseen graphs"—formal guarantees are required, not just empirical gains.

Loss & Training¶

In pretraining, the subgraph encoder uses NT-Xent contrastive loss; in the GoG encoding stage, downstream task losses are used (CE for node classification, BCE for link prediction) plus standard MoE load balancing loss. Leave-one-dataset-out transfer: pretrain on other graphs, fine-tune on the target graph (1-shot node classification / 5-shot link prediction).

Key Experimental Results¶

Main Results¶

Method	Wisconsin	Cornell	Citeseer	Cora	Pubmed	Computers	Photos	Texas
GCN	17.46	19.53	26.89	31.98	44.29	39.43	50.39	18.48
GAT	16.86	16.51	25.27	26.81	45.11	38.05	56.51	18.36
GFM (MDGFM, etc. baselines)	(slightly lower than R-GFM)	—	—	—	—	—	—	—
R-GFM	Best	Best	Best	Best	Best	Best	Best	Best

Consistently SOTA on 18 real-world graphs (10 main settings + 4 large-scale training sets + 4 test sets); up to 49% relative improvement on some datasets.

Ablation Study¶

Configuration	Impact
Fixed 1-hop subgraph only	Performance drops, proving adaptive-hop is necessary
Fully connected / edgeless GoG	Both worse than similarity-sparse GoG, consistent with Thm 3.3
Fixed top-\(m\) routing	Worse than confidence-aware dynamic top-\(m\)
Single Euclidean expert	Significant drop on highly heterogeneous datasets
Edge budget \(\mathcal{B}_{\text{edge}} = 0.6 \cdot K(K-1)/2\)	Both sparser and denser settings degrade performance

Key Findings¶

The maximum 49% improvement occurs on datasets with the highest structural heterogeneity, confirming that "dynamic geometry selection mainly benefits heterogeneous graphs."
In graph perturbation robustness tests, R-GFM shows the smallest relative drop under 30% random edge perturbation, attributed to the redundancy of GoG multi-hop information.
Cross-scale generalization: After pretraining on ArXiv_2023 + ogbn-Arxiv + Reddit + PubMed, the model remains robust on test sets like Cora / Ele-Computers / Books-History / Instagram.

Highlights & Insights¶

"Graph of Graphs" is not a new concept, but integrating it with adaptive-hop + Riemannian MoE as the core of a GFM is novel; this addresses two long-standing GFM challenges: hop and geometry.
"Router confidence increases → automatic shrinkage of top-\(m\)" is an elegant use of training dynamics: MoE capacity self-shrinks in late training to improve generalization, a strategy transferable to LLM MoE training.
Using node degree CV to pre-determine expert count avoids trial-and-error; this "capacity prediction based on data statistics" can be generalized to other MoE scenarios.

Limitations & Future Work¶

GoG construction requires traversing multi-hop subgraphs, resulting in higher time and memory complexity than fixed-hop GFM; further acceleration is needed for ultra-large graphs (million-node scale).
Riemannian manifolds currently only consider three constant curvatures; more complex mixed-curvature or learnable curvature settings remain unexplored.
The subgraph similarity threshold and edge budget 0.6 are empirical values, lacking a task-adaptive mechanism.
Transfer performance on domain-prior-rich data such as molecular graphs and knowledge graphs is not fully validated.

vs MDGFM: MDGFM is also GFM + theoretical analysis, but uses a single receptive field and single geometry; R-GFM is dynamic in both dimensions.
vs Graph MoE (GMoE, etc.): Their top-\(m\) is fixed and experts lack geometric priors; R-GFM introduces curvature as an inductive bias.
vs Hyperbolic GNNs (HGNN / HGCN): Hyperbolic methods use only a single negative curvature space; R-GFM uses MoE to adaptively mix curvatures, covering more structures.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of adaptive-hop GoG + Riemannian MoE is fresh, though each component has precedents
Experimental Thoroughness: ⭐⭐⭐⭐ 18 datasets + cross-domain generalization + perturbation robustness + theoretical guarantees, comprehensive coverage
Writing Quality: ⭐⭐⭐⭐ Clear structure, tight alignment of theory and method, intuitive illustrations
Value: ⭐⭐⭐⭐ Provides actionable solutions to two major GFM pain points, open-source code lowers reproduction barriers