Graver: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning¶

Conference: NeurIPS 2025 arXiv: 2511.05592 Code: GitHub Area: Model Compression Keywords: graph foundation model, few-shot fine-tuning, graphon, vocabulary generation, MoE

TL;DR¶

This paper proposes Graver, a framework that decouples ego-graphs to extract transferable subgraph vocabularies, models their distributions via graphon experts, and routes relevant vocabularies to augment support samples through MoE-CoE, addressing the instability caused by structural mismatch in few-shot fine-tuning of graph foundation models (GFMs).

Background & Motivation¶

Background: Graph foundation models (GFMs) aim to achieve universal graph learning across domains and tasks via a pre-train-then-fine-tune paradigm. Prior methods such as GCOPE, MDGPT, and SAMGPT have made progress in multi-domain pre-training and cross-domain transfer.

Limitations of Prior Work: Few-shot prompt fine-tuning of GFMs suffers from severe instability — performance and adaptation efficiency are highly sensitive to the random selection of support samples. When selected support nodes exhibit structural patterns consistent with the pre-training graphs (e.g., triangles), performance is strong; when patterns diverge (e.g., ladder structures), performance degrades significantly, resulting in high variance.

Key Challenge: Under few-shot settings with extremely limited labeled samples (e.g., one-shot), the model cannot adequately capture structural patterns of the target domain, while the transfer of pre-trained knowledge is further constrained by domain discrepancy.

Goal: How to achieve robust and efficient GFM fine-tuning under randomly sampled support sets?

Key Insight: Generative augmentation — leveraging "graph vocabularies" (transferable subgraph patterns) learned during pre-training to augment support samples, thereby reducing dependence on specific support selections.

Core Idea: Extract transferable subgraph vocabularies from pre-training graphs, model their distributions via graphon generators, and use MoE-CoE routing at fine-tuning time to embed relevant vocabularies into support samples for contextual augmentation.

Method¶

Overall Architecture¶

A three-stage framework: 1. Pre-training stage: Multi-domain alignment → ego-graph decoupling for vocabulary extraction → self-supervised contrastive pre-training 2. Vocabulary modeling stage: Graphon experts separately model structural tokens and feature tokens 3. Fine-tuning stage: MoE-CoE routing generates vocabularies → augments support samples → prompt fine-tuning

Key Designs¶

1. Factor-aware Ego-Graph Decoupling

The ego-graph of node $u$ is decomposed into $K$ factor-aware subgraphs (vocabularies): $$\alpha_{v \to k}^{(t)} \propto \text{Softmax}_k(\langle \mathbf{h}_{u,k}^{\mathcal{S}(t)}, \mathbf{h}_{v,k}^{\mathcal{S}(t)} \rangle / \tau)$$

Neighbors are assigned to $K$ channels via soft routing
Mutual information regularization ensures semantic independence across channels: $\mathcal{R}_{\text{MI}}^u = \sum_{i \neq j} I(\mathbf{h}_{u,i}^{\mathcal{S}}; \mathbf{h}_{u,j}^{\mathcal{S}})$
Proposition 1 (theoretical transferability guarantee): The upper bound on semantic discrepancy is determined by the optimal matching distance over vocabulary combinations

2. Graphon Generative Experts

Structural tokens: Per-class adjacency matrices $\{\mathcal{A}_i^{(c)}\}$ are collected to estimate a nonparametric graphon $W_c^{\mathcal{A}}: [0,1]^2 \mapsto [0,1]$
Feature tokens: The feature distribution conditioned on latent positions is estimated as $W_c^{\mathcal{X}}: [0,1] \mapsto \mathbb{R}^d$
Conditional generation: Latent positions $\mathbf{u}_i \sim \mathcal{U}[0,1]$ are sampled, then $\tilde{\mathcal{A}}[i,j] \sim \text{Bern}(W_c^{\mathcal{A}}(\mathbf{u}_i, \mathbf{u}_j))$
Proposition 2 (distributional convergence): $\|\mathbf{g}_c^{\text{gen}} - \mathbf{g}_c^{\text{emp}}\|_{\text{TV}} \to 0$ as $N_c \to \infty$

3. MoE-CoE Routing Network

A two-level hierarchical routing scheme: - MoE layer (which domain): $\mathbf{S}_{\text{M}} = \text{Softmax}(\mathbf{W}_{\text{M}}^\top \cdot \phi(\hat{\mathbf{X}}_i^{\mathcal{T}}))$, selects relevant source domains - CoE layer (which class): $\mathbf{S}_{\text{C}} = \text{Softmax}(\mathbf{W}_{\text{C}}^\top \cdot \phi(\hat{\mathbf{X}}_i^{\mathcal{T}} \| \tilde{\mathcal{X}}))$, combines class-level tokens - Vocabulary-augmented support samples: $\tilde{G}_i^{\mathcal{T}} = G_i^{\mathcal{T}} \oplus \tilde{\mathbf{g}}_i^{\text{gen}}$

Loss & Training¶

Pre-training: Contrastive link prediction + MI regularization: $\mathcal{L}_{\text{pre}} = \mathcal{L}_{\text{contrastive}} + \lambda \sum_u \mathcal{R}_{\text{MI}}^u$
Fine-tuning: Class prototype matching + MoE-CoE sparse activation regularization: $\mathcal{L}_{\text{ftn}} = \mathcal{L}_{\text{cls}} + \mu \cdot \mathcal{L}_{\text{MoE-CoE}}$
Pre-trained parameters $\Theta^*$ are frozen; only graph prompts $\mathcal{P}_\Omega$ and MoE-CoE weights are updated

Key Experimental Results¶

Main Results: One-shot Node Classification (Cross-Dataset)¶

Method	Cora	CiteSeer	PubMed	arXiv	Tech	Home	Wiki-CS
GCN	28.40±4.62	29.25±3.39	40.33±6.90	61.59±5.13	53.89±3.35	36.74±2.53	28.58±5.39
SAMGPT	46.79±6.54	38.65±6.35	51.92±9.50	73.60±—	—	—	—
Graver	Best	Best	Best	Best	Best	Best	Best

Average gain: +2.8% on node classification, +3.2% on graph classification (over runner-up)
Average relative reduction in standard deviation: 54.0% (node) and 54.6% (graph)

Ablation Study¶

Removed Component	Effect
w/o ego-graph decoupling	Performance drops; insufficient vocabulary discriminability
w/o graphon generation	Cannot augment support; degenerates to standard prompt tuning
w/o MoE-CoE	No selective routing; may introduce negative transfer
w/o MI regularization	Channel semantic overlap; reduced transferability

Key Findings¶

Graver consistently outperforms 15 state-of-the-art baselines in both one-shot and five-shot settings
Graver's advantage is more pronounced in the harder cross-domain setting, owing to the domain adaptation capability of vocabulary augmentation
The most significant improvement is in robustness — standard deviation decreases by over 54%, indicating insensitivity to support sample selection
LLM-enhanced semantic alignment further enables zero-shot transfer

Highlights & Insights¶

First work to apply generative vocabularies for GFM fine-tuning augmentation: Unlike approaches that augment training data or graph structure, Graver augments the support set itself
Elegant application of graphon theory: Continuous graphon functions serve as generative models for graph vocabularies, combining theoretical elegance (distributional convergence guarantees) with practical utility
Sophisticated MoE-CoE design: Two-level routing handles "which domain" and "which class" separately, effectively mitigating negative transfer
Vivid "root + affix" analogy: Structural tokens = roots (determine semantic type); feature tokens = affixes (determine domain characteristics)

Limitations & Future Work¶

The vocabulary count $K$ and graphon resolution are critical hyperparameters with limited sensitivity analysis
Graphon estimation may be unstable on large-scale sparse graphs
Multi-domain pre-training incurs high computational cost (simultaneously processing 7 datasets with LLM augmentation)
Ego-graph decoupling requires multiple iterations, adding pre-training overhead
Evaluation is limited to citation, shopping, and web page domains; scientific domains such as molecular and protein graphs are not assessed

GFT's computation-tree vocabularies are restricted to tree structures, whereas Graver's ego-graph vocabularies are more general
$\mathcal{G}$-Mixup also employs graphons but for data augmentation; Graver uses them for vocabulary modeling
The approach conceptually aligns with in-context learning in NLP: guiding adaptation by embedding contextual vocabularies
Insight: Prompt tuning for graphs may require more structural prior injection than text or image counterparts

Rating¶

⭐⭐⭐⭐ (4/5)

The method is elegantly designed with solid theoretical support (two Propositions), comprehensive experiments, and significant robustness improvements. However, the overall system complexity is high, which may limit practical applicability; validation across a broader range of domains is also lacking.