Graver: Generative Graph Vocabularies for Robust Graph Foundation Models Fine-tuning¶
Conference: NeurIPS 2025 arXiv: 2511.05592 Code: GitHub Area: Model Compression Keywords: graph foundation model, few-shot fine-tuning, graphon, vocabulary generation, MoE
TL;DR¶
This paper proposes Graver, a framework that decouples ego-graphs to extract transferable subgraph vocabularies, models their distributions via graphon experts, and routes relevant vocabularies to augment support samples through MoE-CoE, addressing the instability caused by structural mismatch in few-shot fine-tuning of graph foundation models (GFMs).
Background & Motivation¶
Background: Graph foundation models (GFMs) aim to achieve universal graph learning across domains and tasks via a pre-train-then-fine-tune paradigm. Prior methods such as GCOPE, MDGPT, and SAMGPT have made progress in multi-domain pre-training and cross-domain transfer.
Limitations of Prior Work: Few-shot prompt fine-tuning of GFMs suffers from severe instability — performance and adaptation efficiency are highly sensitive to the random selection of support samples. When selected support nodes exhibit structural patterns consistent with the pre-training graphs (e.g., triangles), performance is strong; when patterns diverge (e.g., ladder structures), performance degrades significantly, resulting in high variance.
Key Challenge: Under few-shot settings with extremely limited labeled samples (e.g., one-shot), the model cannot adequately capture structural patterns of the target domain, while the transfer of pre-trained knowledge is further constrained by domain discrepancy.
Goal: How to achieve robust and efficient GFM fine-tuning under randomly sampled support sets?
Key Insight: Generative augmentation — leveraging "graph vocabularies" (transferable subgraph patterns) learned during pre-training to augment support samples, thereby reducing dependence on specific support selections.
Core Idea: Extract transferable subgraph vocabularies from pre-training graphs, model their distributions via graphon generators, and use MoE-CoE routing at fine-tuning time to embed relevant vocabularies into support samples for contextual augmentation.
Method¶
Overall Architecture¶
A three-stage framework: 1. Pre-training stage: Multi-domain alignment → ego-graph decoupling for vocabulary extraction → self-supervised contrastive pre-training 2. Vocabulary modeling stage: Graphon experts separately model structural tokens and feature tokens 3. Fine-tuning stage: MoE-CoE routing generates vocabularies → augments support samples → prompt fine-tuning
Key Designs¶
1. Factor-aware Ego-Graph Decoupling
The ego-graph of node \(u\) is decomposed into \(K\) factor-aware subgraphs (vocabularies): $\(\alpha_{v \to k}^{(t)} \propto \text{Softmax}_k(\langle \mathbf{h}_{u,k}^{\mathcal{S}(t)}, \mathbf{h}_{v,k}^{\mathcal{S}(t)} \rangle / \tau)\)$
- Neighbors are assigned to \(K\) channels via soft routing
- Mutual information regularization ensures semantic independence across channels: \(\mathcal{R}_{\text{MI}}^u = \sum_{i \neq j} I(\mathbf{h}_{u,i}^{\mathcal{S}}; \mathbf{h}_{u,j}^{\mathcal{S}})\)
- Proposition 1 (theoretical transferability guarantee): The upper bound on semantic discrepancy is determined by the optimal matching distance over vocabulary combinations
2. Graphon Generative Experts
- Structural tokens: Per-class adjacency matrices \(\{\mathcal{A}_i^{(c)}\}\) are collected to estimate a nonparametric graphon \(W_c^{\mathcal{A}}: [0,1]^2 \mapsto [0,1]\)
- Feature tokens: The feature distribution conditioned on latent positions is estimated as \(W_c^{\mathcal{X}}: [0,1] \mapsto \mathbb{R}^d\)
- Conditional generation: Latent positions \(\mathbf{u}_i \sim \mathcal{U}[0,1]\) are sampled, then \(\tilde{\mathcal{A}}[i,j] \sim \text{Bern}(W_c^{\mathcal{A}}(\mathbf{u}_i, \mathbf{u}_j))\)
- Proposition 2 (distributional convergence): \(\|\mathbf{g}_c^{\text{gen}} - \mathbf{g}_c^{\text{emp}}\|_{\text{TV}} \to 0\) as \(N_c \to \infty\)
3. MoE-CoE Routing Network
A two-level hierarchical routing scheme: - MoE layer (which domain): \(\mathbf{S}_{\text{M}} = \text{Softmax}(\mathbf{W}_{\text{M}}^\top \cdot \phi(\hat{\mathbf{X}}_i^{\mathcal{T}}))\), selects relevant source domains - CoE layer (which class): \(\mathbf{S}_{\text{C}} = \text{Softmax}(\mathbf{W}_{\text{C}}^\top \cdot \phi(\hat{\mathbf{X}}_i^{\mathcal{T}} \| \tilde{\mathcal{X}}))\), combines class-level tokens - Vocabulary-augmented support samples: \(\tilde{G}_i^{\mathcal{T}} = G_i^{\mathcal{T}} \oplus \tilde{\mathbf{g}}_i^{\text{gen}}\)
Loss & Training¶
- Pre-training: Contrastive link prediction + MI regularization: \(\mathcal{L}_{\text{pre}} = \mathcal{L}_{\text{contrastive}} + \lambda \sum_u \mathcal{R}_{\text{MI}}^u\)
- Fine-tuning: Class prototype matching + MoE-CoE sparse activation regularization: \(\mathcal{L}_{\text{ftn}} = \mathcal{L}_{\text{cls}} + \mu \cdot \mathcal{L}_{\text{MoE-CoE}}\)
- Pre-trained parameters \(\Theta^*\) are frozen; only graph prompts \(\mathcal{P}_\Omega\) and MoE-CoE weights are updated
Key Experimental Results¶
Main Results: One-shot Node Classification (Cross-Dataset)¶
| Method | Cora | CiteSeer | PubMed | arXiv | Tech | Home | Wiki-CS |
|---|---|---|---|---|---|---|---|
| GCN | 28.40±4.62 | 29.25±3.39 | 40.33±6.90 | 61.59±5.13 | 53.89±3.35 | 36.74±2.53 | 28.58±5.39 |
| SAMGPT | 46.79±6.54 | 38.65±6.35 | 51.92±9.50 | 73.60±— | — | — | — |
| Graver | Best | Best | Best | Best | Best | Best | Best |
- Average gain: +2.8% on node classification, +3.2% on graph classification (over runner-up)
- Average relative reduction in standard deviation: 54.0% (node) and 54.6% (graph)
Ablation Study¶
| Removed Component | Effect |
|---|---|
| w/o ego-graph decoupling | Performance drops; insufficient vocabulary discriminability |
| w/o graphon generation | Cannot augment support; degenerates to standard prompt tuning |
| w/o MoE-CoE | No selective routing; may introduce negative transfer |
| w/o MI regularization | Channel semantic overlap; reduced transferability |
Key Findings¶
- Graver consistently outperforms 15 state-of-the-art baselines in both one-shot and five-shot settings
- Graver's advantage is more pronounced in the harder cross-domain setting, owing to the domain adaptation capability of vocabulary augmentation
- The most significant improvement is in robustness — standard deviation decreases by over 54%, indicating insensitivity to support sample selection
- LLM-enhanced semantic alignment further enables zero-shot transfer
Highlights & Insights¶
- First work to apply generative vocabularies for GFM fine-tuning augmentation: Unlike approaches that augment training data or graph structure, Graver augments the support set itself
- Elegant application of graphon theory: Continuous graphon functions serve as generative models for graph vocabularies, combining theoretical elegance (distributional convergence guarantees) with practical utility
- Sophisticated MoE-CoE design: Two-level routing handles "which domain" and "which class" separately, effectively mitigating negative transfer
- Vivid "root + affix" analogy: Structural tokens = roots (determine semantic type); feature tokens = affixes (determine domain characteristics)
Limitations & Future Work¶
- The vocabulary count \(K\) and graphon resolution are critical hyperparameters with limited sensitivity analysis
- Graphon estimation may be unstable on large-scale sparse graphs
- Multi-domain pre-training incurs high computational cost (simultaneously processing 7 datasets with LLM augmentation)
- Ego-graph decoupling requires multiple iterations, adding pre-training overhead
- Evaluation is limited to citation, shopping, and web page domains; scientific domains such as molecular and protein graphs are not assessed
Related Work & Insights¶
- GFT's computation-tree vocabularies are restricted to tree structures, whereas Graver's ego-graph vocabularies are more general
- \(\mathcal{G}\)-Mixup also employs graphons but for data augmentation; Graver uses them for vocabulary modeling
- The approach conceptually aligns with in-context learning in NLP: guiding adaptation by embedding contextual vocabularies
- Insight: Prompt tuning for graphs may require more structural prior injection than text or image counterparts
Rating¶
⭐⭐⭐⭐ (4/5)
The method is elegantly designed with solid theoretical support (two Propositions), comprehensive experiments, and significant robustness improvements. However, the overall system complexity is high, which may limit practical applicability; validation across a broader range of domains is also lacking.