GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation¶

Conference: NeurIPS 2025 arXiv: 2502.01113 Code: https://github.com/rmanluo/gfm-rag Area: Graph Learning / RAG / Knowledge Graphs Keywords: Graph Foundation Model, RAG, Knowledge Graph, Multi-hop Reasoning, GNN

TL;DR¶

This paper proposes GFM-RAG, the first graph foundation model-driven retrieval-augmented generation framework, which performs single-pass multi-hop reasoning over knowledge graphs via a query-dependent GNN. With only 8M parameters, GFM-RAG achieves zero-shot generalization to unseen datasets and substantially outperforms state-of-the-art methods on multi-hop QA retrieval benchmarks.

Background & Motivation¶

Background: RAG is the dominant paradigm for injecting external knowledge into LLMs. Traditional RAG encodes documents as independent vectors for retrieval, which performs poorly on multi-hop questions requiring cross-document reasoning. GraphRAG methods (e.g., HippoRAG, LightRAG) explicitly model inter-knowledge relationships via graph structures.

Limitations of Prior Work: (a) Traditional vector retrieval fails to capture complex inter-document relationships; (b) multi-step retrieval methods (e.g., IRCoT) improve performance through iterative LLM reasoning but incur prohibitive computational overhead (several seconds per query); (c) existing GraphRAG methods (e.g., HippoRAG using Personalized PageRank) rely heavily on graph structure, which is often noisy and incomplete; (d) existing GNN-based methods require training from scratch for each new dataset, lacking generalizability.

Key Challenge: How can multi-hop reasoning capability be achieved within a single-step retrieval while maintaining cross-dataset generalizability?

Goal: To design a transferable Graph Foundation Model (GFM) that performs multi-hop reasoning retrieval in a single forward pass and generalizes directly to unseen datasets after pretraining.

Key Insight: The multi-hop message passing of a query-dependent GNN is theoretically equivalent to multi-hop logical reasoning on graphs. By mapping queries, entities, and relations into a unified semantic space, the model becomes universally applicable across different graphs.

Core Idea: A unified semantic space combined with query-dependent message-passing GNN is pretrained on large-scale KGs to produce a cross-dataset transferable graph foundation model retriever.

Method¶

Overall Architecture¶

GFM-RAG operates in three stages: (1) KG-index construction: entities and relations are extracted from documents to build a knowledge graph index; (2) GFM Retriever: a query-dependent GNN reasons over the KG and outputs relevance scores for each entity with respect to the query; (3) Document ranking and generation: documents are ranked by weighted entity scores and fed into an LLM to generate the final answer.

Input: user query \(q\) + document corpus \(\mathcal{D}\) Output: top-K relevant documents \(\mathcal{D}^K\) and LLM-generated answer \(a\)

Key Designs¶

KG-index Construction:
- Function: Extracts (entity, relation, entity) triples from documents via LLM-driven OpenIE to construct the knowledge graph index.
- Mechanism: In addition to directly extracted triples \(\mathcal{T}\), equivalence edges \(\mathcal{T}^+\) are added via entity resolution (embedding similarity), e.g., linking "USA" ↔ "United States of America", to enhance graph connectivity.
- Design Motivation: Analogous to the hippocampal memory indexing theory, the KG-index serves as an "artificial hippocampus" that stores inter-knowledge associations, addressing the loss of relational structure inherent in independent vector encoding.
Query-dependent GNN (GFM Retriever):
- Function: Performs query-conditioned message passing over the KG to compute relevance scores of each entity with respect to the query.
- Mechanism:
  - Initialization: The query is encoded as \(\bm{q} \in \mathbb{R}^d\) by a sentence embedding model; entities mentioned in the query are initialized with \(\bm{q}\), while all others are initialized to zero vectors.
  - Message Passing: \(L\) layers of query-dependent message passing are applied; relation embeddings are initialized using the same sentence model and updated via layer-specific MLPs; the message function employs non-parametric DistMult, with sum aggregation and linear layer updates.
  - Output: A final MLP with sigmoid maps entity representations to relevance scores \(P_q \in \mathbb{R}^{|\mathcal{E}| \times 1}\).
- Design Motivation: Query-dependent message passing is theoretically proven to be equivalent to multi-hop logical reasoning (NBFNet), where \(L\) layers of message passing correspond to \(L\)-hop reasoning. The unified semantic space (query/entity/relation initialized with the same embedding model) enables cross-graph generalization.
- Novelty: Unlike conventional graph-specific GNNs, this approach achieves cross-graph transfer through semantic initialization.
Two-Stage Training:
- Function: Self-supervised pretraining followed by supervised fine-tuning.
- Mechanism:
  - Stage 1 — KG Completion Pretraining: Head or tail entities in triples are randomly masked, and the GNN is trained to predict the masked entity, enhancing graph reasoning capability.
  - Stage 2 — Document Retrieval Fine-tuning: The model is trained on annotated retrieval datasets where queries are natural language questions and target entities are derived from annotated supporting documents.
  - Loss Function: A weighted combination of BCE loss and ranking loss, \(\mathcal{L} = \alpha \mathcal{L}_{BCE} + (1-\alpha) \mathcal{L}_{RANK}\), where ranking loss mitigates gradient vanishing caused by sparse positive samples.
- Training Scale: 60 KGs, 14M+ triples, 700k documents.
Document Ranking:
- Function: Converts entity-level scores into document-level scores.
- Mechanism: Top-\(T\) entities are selected and weighted by inverse document frequency (analogous to IDF); document scores are computed as \(P_d = M^\top F_e\) via an entity-to-document inverted index \(M\).
- Design Motivation: High-frequency entities appearing in many documents have low discriminative power; inverse frequency weighting reduces their influence.

Loss & Training¶

Joint optimization with BCE + ranking loss, \(\alpha = 0.3\).
Negative samples are randomly drawn from the KG.
Training uses 8 × A100 GPUs, batch size 4, learning rate \(5 \times 10^{-4}\).
The model has only 8M parameters, with 6 message-passing layers and hidden dimension 512.

Key Experimental Results¶

Main Results — Multi-hop Retrieval¶

Dataset	Metric	GFM-RAG	IRCoT+HippoRAG (SOTA)	Gain
HotpotQA	R@2	78.3	67.0	+16.9%
MuSiQue	R@2	49.1	45.3	+8.4%
2Wiki	R@2	90.8	75.8	+19.8%
HotpotQA	R@5	87.1	83.0	+4.9%
MuSiQue	R@5	58.2	57.6	+1.0%
2Wiki	R@5	95.6	93.9	+1.8%

Multi-hop QA¶

Dataset	Metric	Ours	IRCoT+GFM-RAG	Prev. SOTA
HotpotQA	EM	51.6	56.0	48.7 (FLARE)
MuSiQue	EM	30.2	36.6	21.9 (IRCoT+HippoRAG)
2Wiki	EM	69.8	72.5	48.9 (Adaptive-RAG)

Efficiency Analysis¶

Method	HotpotQA Time (s)	R@5
ColBERTv2	0.035	79.3
HippoRAG	0.255	77.7
IRCoT+HippoRAG	3.162	83.0
GFM-RAG	0.107	87.1

Ablation Study¶

Configuration	Key Findings
w/o pretraining	Significant performance drop; pretraining is critical for generalization
BCE loss only	Inferior to BCE+Ranking; positive sparsity issue confirmed
w/o entity resolution	Reduced KG connectivity hinders multi-hop reasoning
Different sentence models	Performance is insensitive; confirms framework generality

Key Findings¶

GFM-RAG surpasses all multi-step methods in a single pass, achieving ~30× efficiency over IRCoT+HippoRAG.
Zero-shot generalization across 7 domain-specific RAG datasets, outperforming HippoRAG by an average of 18.9%.
Model performance follows a neural scaling law: \(z \propto 0.24 x^{0.05} + 0.11 y^{0.03}\), indicating that larger data and model scales can yield further gains.

Highlights & Insights¶

Elegant unified semantic space design: Query, entity, and relation representations are all initialized using the same sentence embedding model, enabling the GNN to transfer naturally to any new graph — a key design for achieving a "graph foundation model."
Theoretical guarantee of single-step multi-hop equivalence: \(L\) layers of query-dependent message passing are theoretically equivalent to \(L\)-hop logical reasoning, avoiding the LLM overhead of multi-step retrieval.
Path interpretability: Multi-hop reasoning paths can be extracted via gradient backpropagation, enhancing trustworthiness.
Inverse-frequency-weighted document ranking draws on TF-IDF intuition to efficiently convert entity scores into document scores.

Limitations & Future Work¶

KG construction depends on LLMs: OpenIE extraction quality directly impacts KG quality; results vary across different LLMs and may degrade for low-resource languages.
8M parameters vs. scaling law: Although a scaling law is demonstrated, training is only conducted at the 8M scale; whether larger-scale training encounters bottlenecks remains unknown.
Entity resolution is a bottleneck: Current embedding-similarity-based resolution may fail for synonymous entities with dissimilar embeddings.
KG construction overhead: Rebuilding the KG-index for each new dataset requires LLM calls, which is non-trivial in cost.
Potential improvements: Combining GFM-RAG's graph reasoning with dense retrieval for hybrid retrieval; exploring larger-scale pretraining.

vs. HippoRAG: HippoRAG uses Personalized PageRank for graph retrieval, relying entirely on graph structure; GFM-RAG learns to reason via GNN, making it more robust to noisy and incomplete graphs.
vs. IRCoT: IRCoT requires multi-step iterative LLM reasoning with high overhead; GFM-RAG achieves equivalent multi-hop reasoning in a single GNN pass, at 30× greater efficiency.
vs. ULTRA/GFT and other graph foundation models: These works focus on graph tasks (node classification, link prediction); GFM-RAG is the first graph foundation model targeting RAG.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First work to apply a graph foundation model to RAG, with an elegant unified semantic space design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on 3 multi-hop QA benchmarks + 7 domain-specific datasets + efficiency analysis + scaling law + ablation study.
Writing Quality: ⭐⭐⭐⭐ Clear structure and detailed method description, though notation is occasionally dense.
Value: ⭐⭐⭐⭐⭐ Provides a powerful and generalizable solution for GraphRAG; zero-shot generalization achieved with only 8M parameters.