Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors¶

Conference: ACL 2025
arXiv: 2408.06778
Code: https://github.com/idiap/fnf-tg
Area: Graph Learning
Keywords: Knowledge Graphs, Inductive Link Prediction, Graph Transformers, Text-Graph Fusion, Ego-graph Encoding

TL;DR¶

This paper proposes the Fast-and-Frugal Text-Graph (FnF-TG) Transformer, which uniformly encodes textual descriptions and graph structure (ego-graph) via the Transformer's self-attention mechanism. It outperforms SOTA models using large BERT and MPNNs on inductive link prediction tasks using only a small BERT, while extending to a fully inductive setting (where relations can also be inductive) for the first time.

Background & Motivation¶

Background: Knowledge graphs (KGs) are core components in many tasks such as information extraction, question answering, and reasoning. Text-attributed KGs provide richer knowledge representations by associating phrase descriptions with each entity and relation. Inductive link prediction requires the model to make predictions for entities unseen during training.

Limitations of Prior Work: - Early KG methods (e.g., TransE, DistMult) only support transductive prediction and cannot handle unseen entities. - Inductive methods that leverage textual descriptions require powerful but expensive text encoders. - Methods such as StATIK use MPNNs to process graph structure, but the expressive power of MPNNs is limited. - Existing methods assume a fixed set of relations (transductive relations) and cannot handle relations unseen during training.

Key Challenge: How to effectively fuse textual and graph structural information while maintaining high efficiency (fast and frugal)? Traditional methods either rely on large text encoders (expensive) or suffer from ineffective graph encoding (restricted performance).

Goal: (1) Encode the ego-graph (1-hop neighborhood) more effectively to reduce reliance on large text encoders; (2) achieve a fully inductive setting where relations can also be computed from text descriptions.

Key Insight: Leverage the inherent graph-processing capability of the Transformer's self-attention mechanism, directly injecting relation embeddings into the attention calculation, and simultaneously processing text and graph structure within a unified Transformer architecture.

Core Idea: Encode the ego-graph with a Graph Transformer, using relation embeddings as scaling factors in self-attention, letting even a small BERT surpass the SOTA.

Method¶

Overall Architecture¶

FnF-TG consists of three components: 1. Knowledge Graph (KG): Provides triples $(h_{KG}, r_{KG}, t_{KG})$ with their textual descriptions and extracts the ego-graph. 2. Text Transformer Encoder (TT): Encodes text descriptions into vector representations. 3. Graph Transformer Encoder (GT): Fuses text embeddings and graph structural information.

Key Designs¶

1. Text Transformer Encoder (TT)¶

Function: Encodes the text description of each entity and relation in the KG into dense vectors.
Mechanism: Uses the [CLS] vector from BERT, followed by two linear projection layers: $$\mathbf{x}_{TT} = \sigma(\text{BERT}_{\text{size}}(x_{KG})_{[\text{CLS}]} \mathbf{W}_0) \mathbf{W}_1$$ where $\sigma$ is the SiLU activation function, and $\mathbf{W}_0, \mathbf{W}_1 \in \mathbb{R}^{d \times d}$.
Query Encoding: The tail-prediction query $(h, r, \cdot)$ concatenates the head entity text and relation text $[h||r]_{KG}$; the head-prediction query $(\cdot, r, t)$ uses inverse relation text $[t||r^{-1}]_{KG}$.
Design Motivation: Allows flexible choice among various BERT sizes (base/medium/small/mini/tiny) to achieve a speed-performance trade-off.

2. Graph Transformer Encoder (GT) — Core Innovation¶

Function: Fuses the structural information of the ego-graph and the text embeddings output by the TT.
Mechanism: Injects graph relation embeddings into the Transformer's self-attention computation. For each pair of nodes $i, j$, the attention score is: $$e_{ij} = \frac{\mathbf{x}_i \mathbf{W}_Q \cdot \text{diag}(\mathbf{1} + \text{LN}(\mathbf{r}_{ij})\mathbf{W}_R) \cdot (\mathbf{x}_j \mathbf{W}_K)^\top}{\sqrt{d}}$$ where $\mathbf{r}_{ij}$ is the relation embedding (obtained by TT-encoding the relation text), mapped to a diagonal matrix plus an identity matrix via $\text{diag}(\mathbf{1} + ...)$.
Segment Embedding: Adds learnable $s_{\text{[CENTRE]}}$ and $s_{\text{[NEIGHBOUR]}}$ to distinguish the center node from its neighbors.
Leakage Prevention: Ensures that the target triple $(h,r,t)$ information does not leak into the ego-graph encoding.
Design Motivation: Leverages the intrinsic graph-processing capabilities of the Transformer (from the G2GT series), being more effective than MPNNs. Relation embeddings are computed from text (non-fixed set), enabling fully inductive settings.

3. Fully Inductive Relation Representation¶

Function: Enables relations to be computed from text descriptions as well, handling relations unseen during training.
Mechanism: The relation embedding $\mathbf{r}_{ij}$ is obtained by encoding the relation text with the TT module, serving both as the relation representation for link prediction and as input to the GT self-attention.
Design Motivation: Breaks the limitation of fixed relation sets in existing methods.

Loss & Training¶

Objective Function: TransE structural objective $f_{\text{TransE}}(h,r,t) = -\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_1$.
Loss Function: Margin-based ranking loss: $$Loss = \sum_{(t,t') \in (T \times T')} \max(0, 1 - f(t) + f(t'))$$
Negative Sampling: Two-sided reflexive sampling, where both head and tail entities serve as potential negative samples.
Controlled Experimental Setup: Unified computation budget (NVIDIA RTX3090 24GB), embedding size 768, batch size 128.

Key Experimental Results¶

Main Results¶

Results on WN18RR$_{\text{IND}}$ and FB15k-237$_{\text{IND}}$ test sets:

Model	WN18RR MRR	WN18RR H@10	FB15k MRR	FB15k H@10
BLP (BERT-base)	0.285	0.580	0.195	0.363
StATIK (BERT-base)	0.516	0.690	0.224	0.381
FnF-T (BERT-base, text-only)	0.373	0.647	0.266	0.453
FnF-TG (BERT-base)	0.732	0.875	0.316	0.524
FnF-TG (BERT-medium)	0.737	0.873	0.314	0.515
FnF-TG (BERT-small)	0.727	0.867	0.316	0.518
FnF-TG (BERT-tiny)	0.638	0.808	0.288	0.475

Results on Wikidata5M$_{\text{IND}}$ test set (transfer setting):

Model	MRR	H@1	H@3	H@10
StATIK (BERT-base)	0.770	0.765	0.771	0.779
FnF-TG (BERT-base)	0.799	0.741	0.833	0.911

Ablation Study¶

Ablation of controlled experimental setup (FnF-T, WN18RR$_{\text{IND}}$ MRR):

Cumulative Improvements	WN18RR MRR	FB15k MRR
BLP Original	0.285	0.195
+ Inductive Relations	0.281	0.219
+ Batch-tied negative samples	0.300	0.221
+ Larger embedding size	0.339	0.254
+ Larger batch size	0.366	0.260
+ Better sampling	0.373	0.266

Impact of text encoder sizes (WN18RR$_{\text{IND}}$): - Text-only: base=0.373 → tiny=0.193 (gap of 0.180) - Graph-aware: base=0.732 → tiny=0.638 (gap of only 0.094)

Key Findings¶

Huge performance boost from the GT encoder: MRR improves from 0.373 (text-only) to 0.732 (+0.359) on WN18RR, significantly outperforming StATIK's 0.516.
Small BERT + GT > Large BERT without GT: BERT-tiny + GT (0.638) substantially outperforms BERT-base text-only (0.373).
Graph structure encoding reduces reliance on large text encoders: The performance drop from base to tiny is 0.18 for text-only, but only 0.094 with GT.
Importance of controlled experiments: Realizing standard adjustments (e.g., embedding size, batch size) improves BLP from 0.285 to 0.373.
Diminished yet leading performance in transfer settings: Although the gap narrows on smaller test graphs with fewer neighbors, FnF-TG still outperforms StATIK.

Highlights & Insights¶

The 'Fast and Frugal' concept: Reduces the need for expensive text encoders through superior graph encoding, achieving a win-win in speed and cost.
Relation embedding injection in attention: The design using $\text{diag}(\mathbf{1} + \text{LN}(\mathbf{r}_{ij})\mathbf{W}_R)$ as a scaling factor is highly elegant, preserving original attention behavior when relations are absent.
Insights from controlled experiments: Many claimed SOTA advantages actually stem from differences in experimental settings. The paper honestly highlights this finding.
Extension to a fully inductive setup: Enables relations to be inductive for the first time, opening new avenues for KG link prediction.
Structural information as an effective surrogate: The inductive bias of explicit graph relations serves as an effective substitute for extracting identical information from text with heavy encoders.

Limitations & Future Work¶

Only encodes the 1-hop neighborhood (ego-graph); extending to multi-hop neighborhood could further improve performance.
The benchmark datasets for the fully inductive setting (unseen relations) are relatively new, and evaluation might not be exhaustive.
Uses TransE as the structural objective; more complex scoring functions (such as RotatE) may yield additional gains.
Limited performance gains in the Wikidata5M$_{\text{IND}}$ transfer setting, suggesting a need for better strategies in handling small neighborhood graphs.
Has not explored integration with Large Language Models (LLMs) such as GPT or LLaMA.

StATIK: A SOTA method using BERT + MPNN. This work replaces the MPNN with a pure Transformer and achieves superior performance.
G2GT series (Mohammadshahi & Henderson): Pioneering works injecting graph structure into Transformer attention, which this paper inherits and extends.
BLP (Daza et al.): A text-only inductive method, upon which this paper builds a fairer baseline.
Insight: The Transformer's attention mechanism is naturally suited for processing graph structure, eliminating the need for dedicated GNN designs.

Rating¶

Novelty: ⭐⭐⭐⭐ — The injection of relation embeddings into self-attention is simple and effective. The fully inductive setup is a novel contribution.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Solid and extensive controlled experiments, multiple datasets, various BERT sizes, and rigorous ablations.
Writing Quality: ⭐⭐⭐⭐⭐ — Well-structured; the transparency and honesty in handling controlled experiments are highly commendable.
Value: ⭐⭐⭐⭐ — The 'Fast and Frugal' philosophy holds strong practical value for real-world deployment.