GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation¶
Conference: NeurIPS 2025 arXiv: 2502.01113 Code: https://github.com/rmanluo/gfm-rag Area: Graph Learning / RAG / Knowledge Graphs Keywords: Graph Foundation Model, RAG, Knowledge Graph, Multi-hop Reasoning, GNN
TL;DR¶
This paper proposes GFM-RAG, the first graph foundation model-driven retrieval-augmented generation framework, which performs single-pass multi-hop reasoning over knowledge graphs via a query-dependent GNN. With only 8M parameters, GFM-RAG achieves zero-shot generalization to unseen datasets and substantially outperforms state-of-the-art methods on multi-hop QA retrieval benchmarks.
Background & Motivation¶
Background: RAG is the dominant paradigm for injecting external knowledge into LLMs. Traditional RAG encodes documents as independent vectors for retrieval, which performs poorly on multi-hop questions requiring cross-document reasoning. GraphRAG methods (e.g., HippoRAG, LightRAG) explicitly model inter-knowledge relationships via graph structures.
Limitations of Prior Work: (a) Traditional vector retrieval fails to capture complex inter-document relationships; (b) multi-step retrieval methods (e.g., IRCoT) improve performance through iterative LLM reasoning but incur prohibitive computational overhead (several seconds per query); (c) existing GraphRAG methods (e.g., HippoRAG using Personalized PageRank) rely heavily on graph structure, which is often noisy and incomplete; (d) existing GNN-based methods require training from scratch for each new dataset, lacking generalizability.
Key Challenge: How can multi-hop reasoning capability be achieved within a single-step retrieval while maintaining cross-dataset generalizability?
Goal: To design a transferable Graph Foundation Model (GFM) that performs multi-hop reasoning retrieval in a single forward pass and generalizes directly to unseen datasets after pretraining.
Key Insight: The multi-hop message passing of a query-dependent GNN is theoretically equivalent to multi-hop logical reasoning on graphs. By mapping queries, entities, and relations into a unified semantic space, the model becomes universally applicable across different graphs.
Core Idea: A unified semantic space combined with query-dependent message-passing GNN is pretrained on large-scale KGs to produce a cross-dataset transferable graph foundation model retriever.
Method¶
Overall Architecture¶
GFM-RAG operates in three stages: (1) KG-index construction: entities and relations are extracted from documents to build a knowledge graph index; (2) GFM Retriever: a query-dependent GNN reasons over the KG and outputs relevance scores for each entity with respect to the query; (3) Document ranking and generation: documents are ranked by weighted entity scores and fed into an LLM to generate the final answer.
Input: user query \(q\) + document corpus \(\mathcal{D}\) Output: top-K relevant documents \(\mathcal{D}^K\) and LLM-generated answer \(a\)
Key Designs¶
-
KG-index Construction:
- Function: Extracts (entity, relation, entity) triples from documents via LLM-driven OpenIE to construct the knowledge graph index.
- Mechanism: In addition to directly extracted triples \(\mathcal{T}\), equivalence edges \(\mathcal{T}^+\) are added via entity resolution (embedding similarity), e.g., linking "USA" ↔ "United States of America", to enhance graph connectivity.
- Design Motivation: Analogous to the hippocampal memory indexing theory, the KG-index serves as an "artificial hippocampus" that stores inter-knowledge associations, addressing the loss of relational structure inherent in independent vector encoding.
-
Query-dependent GNN (GFM Retriever):
- Function: Performs query-conditioned message passing over the KG to compute relevance scores of each entity with respect to the query.
- Mechanism:
- Initialization: The query is encoded as \(\bm{q} \in \mathbb{R}^d\) by a sentence embedding model; entities mentioned in the query are initialized with \(\bm{q}\), while all others are initialized to zero vectors.
- Message Passing: \(L\) layers of query-dependent message passing are applied; relation embeddings are initialized using the same sentence model and updated via layer-specific MLPs; the message function employs non-parametric DistMult, with sum aggregation and linear layer updates.
- Output: A final MLP with sigmoid maps entity representations to relevance scores \(P_q \in \mathbb{R}^{|\mathcal{E}| \times 1}\).
- Design Motivation: Query-dependent message passing is theoretically proven to be equivalent to multi-hop logical reasoning (NBFNet), where \(L\) layers of message passing correspond to \(L\)-hop reasoning. The unified semantic space (query/entity/relation initialized with the same embedding model) enables cross-graph generalization.
- Novelty: Unlike conventional graph-specific GNNs, this approach achieves cross-graph transfer through semantic initialization.
-
Two-Stage Training:
- Function: Self-supervised pretraining followed by supervised fine-tuning.
- Mechanism:
- Stage 1 — KG Completion Pretraining: Head or tail entities in triples are randomly masked, and the GNN is trained to predict the masked entity, enhancing graph reasoning capability.
- Stage 2 — Document Retrieval Fine-tuning: The model is trained on annotated retrieval datasets where queries are natural language questions and target entities are derived from annotated supporting documents.
- Loss Function: A weighted combination of BCE loss and ranking loss, \(\mathcal{L} = \alpha \mathcal{L}_{BCE} + (1-\alpha) \mathcal{L}_{RANK}\), where ranking loss mitigates gradient vanishing caused by sparse positive samples.
- Training Scale: 60 KGs, 14M+ triples, 700k documents.
-
Document Ranking:
- Function: Converts entity-level scores into document-level scores.
- Mechanism: Top-\(T\) entities are selected and weighted by inverse document frequency (analogous to IDF); document scores are computed as \(P_d = M^\top F_e\) via an entity-to-document inverted index \(M\).
- Design Motivation: High-frequency entities appearing in many documents have low discriminative power; inverse frequency weighting reduces their influence.
Loss & Training¶
- Joint optimization with BCE + ranking loss, \(\alpha = 0.3\).
- Negative samples are randomly drawn from the KG.
- Training uses 8 × A100 GPUs, batch size 4, learning rate \(5 \times 10^{-4}\).
- The model has only 8M parameters, with 6 message-passing layers and hidden dimension 512.
Key Experimental Results¶
Main Results — Multi-hop Retrieval¶
| Dataset | Metric | GFM-RAG | IRCoT+HippoRAG (SOTA) | Gain |
|---|---|---|---|---|
| HotpotQA | R@2 | 78.3 | 67.0 | +16.9% |
| MuSiQue | R@2 | 49.1 | 45.3 | +8.4% |
| 2Wiki | R@2 | 90.8 | 75.8 | +19.8% |
| HotpotQA | R@5 | 87.1 | 83.0 | +4.9% |
| MuSiQue | R@5 | 58.2 | 57.6 | +1.0% |
| 2Wiki | R@5 | 95.6 | 93.9 | +1.8% |
Multi-hop QA¶
| Dataset | Metric | Ours | IRCoT+GFM-RAG | Prev. SOTA |
|---|---|---|---|---|
| HotpotQA | EM | 51.6 | 56.0 | 48.7 (FLARE) |
| MuSiQue | EM | 30.2 | 36.6 | 21.9 (IRCoT+HippoRAG) |
| 2Wiki | EM | 69.8 | 72.5 | 48.9 (Adaptive-RAG) |
Efficiency Analysis¶
| Method | HotpotQA Time (s) | R@5 |
|---|---|---|
| ColBERTv2 | 0.035 | 79.3 |
| HippoRAG | 0.255 | 77.7 |
| IRCoT+HippoRAG | 3.162 | 83.0 |
| GFM-RAG | 0.107 | 87.1 |
Ablation Study¶
| Configuration | Key Findings |
|---|---|
| w/o pretraining | Significant performance drop; pretraining is critical for generalization |
| BCE loss only | Inferior to BCE+Ranking; positive sparsity issue confirmed |
| w/o entity resolution | Reduced KG connectivity hinders multi-hop reasoning |
| Different sentence models | Performance is insensitive; confirms framework generality |
Key Findings¶
- GFM-RAG surpasses all multi-step methods in a single pass, achieving ~30× efficiency over IRCoT+HippoRAG.
- Zero-shot generalization across 7 domain-specific RAG datasets, outperforming HippoRAG by an average of 18.9%.
- Model performance follows a neural scaling law: \(z \propto 0.24 x^{0.05} + 0.11 y^{0.03}\), indicating that larger data and model scales can yield further gains.
Highlights & Insights¶
- Elegant unified semantic space design: Query, entity, and relation representations are all initialized using the same sentence embedding model, enabling the GNN to transfer naturally to any new graph — a key design for achieving a "graph foundation model."
- Theoretical guarantee of single-step multi-hop equivalence: \(L\) layers of query-dependent message passing are theoretically equivalent to \(L\)-hop logical reasoning, avoiding the LLM overhead of multi-step retrieval.
- Path interpretability: Multi-hop reasoning paths can be extracted via gradient backpropagation, enhancing trustworthiness.
- Inverse-frequency-weighted document ranking draws on TF-IDF intuition to efficiently convert entity scores into document scores.
Limitations & Future Work¶
- KG construction depends on LLMs: OpenIE extraction quality directly impacts KG quality; results vary across different LLMs and may degrade for low-resource languages.
- 8M parameters vs. scaling law: Although a scaling law is demonstrated, training is only conducted at the 8M scale; whether larger-scale training encounters bottlenecks remains unknown.
- Entity resolution is a bottleneck: Current embedding-similarity-based resolution may fail for synonymous entities with dissimilar embeddings.
- KG construction overhead: Rebuilding the KG-index for each new dataset requires LLM calls, which is non-trivial in cost.
- Potential improvements: Combining GFM-RAG's graph reasoning with dense retrieval for hybrid retrieval; exploring larger-scale pretraining.
Related Work & Insights¶
- vs. HippoRAG: HippoRAG uses Personalized PageRank for graph retrieval, relying entirely on graph structure; GFM-RAG learns to reason via GNN, making it more robust to noisy and incomplete graphs.
- vs. IRCoT: IRCoT requires multi-step iterative LLM reasoning with high overhead; GFM-RAG achieves equivalent multi-hop reasoning in a single GNN pass, at 30× greater efficiency.
- vs. ULTRA/GFT and other graph foundation models: These works focus on graph tasks (node classification, link prediction); GFM-RAG is the first graph foundation model targeting RAG.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First work to apply a graph foundation model to RAG, with an elegant unified semantic space design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 3 multi-hop QA benchmarks + 7 domain-specific datasets + efficiency analysis + scaling law + ablation study.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and detailed method description, though notation is occasionally dense.
- Value: ⭐⭐⭐⭐⭐ Provides a powerful and generalizable solution for GraphRAG; zero-shot generalization achieved with only 8M parameters.