Relational Graph Transformer¶

Conference: ICLR 2026 arXiv: 2505.10960 Code: GitHub Area: Graph Learning Keywords: Graph Transformer, Relational Deep Learning, Multi-Element Tokenization, Heterogeneous Temporal Graph, Positional Encoding

TL;DR¶

This paper proposes RelGT, the first graph Transformer specifically designed for relational databases. Through multi-element tokenization (a 5-tuple of feature/type/hop distance/time/local structure encodings) and a local–global hybrid attention mechanism, RelGT consistently outperforms GNN baselines across all 21 tasks in the RelBench benchmark, with improvements of up to 18%.

Background & Motivation¶

Enterprise data (financial transactions, e-commerce records, healthcare, etc.) is predominantly stored in relational databases. Relational Deep Learning (RDL) converts multi-table data into heterogeneous temporal graphs (Relational Entity Graphs, REGs) and applies GNNs to learn representations. However, GNNs exhibit inherent limitations:

Insufficient structural expressiveness: Message passing fails to capture complex structural patterns, e.g., two transactions that are both 2-hop neighbors are only indirectly connected via a shared customer.

Limited long-range dependency modeling: In a 2-layer GNN, product nodes can never directly interact (requiring 4 hops: transaction → customer → transaction → product).

Existing graph Transformers are ill-suited for REGs: - Traditional positional encodings (Laplacian PE, node2vec) do not generalize to large-scale heterogeneous graphs. - They lack the ability to model temporal dynamics and schema constraints. - Existing tokenization schemes discard critical structural information.

Method¶

Overall Architecture¶

RelGT comprises two core components:

Multi-Element Tokenization (§3.1): Each node in the REG is decomposed into 5 element encodings that are then concatenated.
Hybrid Transformer Network (§3.2): Local attention + global centroid attention.

Pipeline: seed node → temporally-aware sampling of $K$ neighbors → 5-element tokenization → local Transformer → global centroid attention → prediction head.

Key Designs¶

Multi-Element Tokenization (5-Tuple Representation)¶

Each sampled node $v_j$ is represented as a 5-tuple $(x_{v_j}, \phi(v_j), p(v_i, v_j), \tau(v_j) - \tau(v_i), \text{GNN-PE}_{v_j})$:

Node features $x_{v_j}$: A multi-modal encoder processes numerical/categorical/text/image column attributes → $h_{\text{feat}} \in \mathbb{R}^d$
Node type $\phi(v_j)$: Table-level one-hot → projected via a learnable matrix → $h_{\text{type}} \in \mathbb{R}^d$
Relative hop distance $p(v_i, v_j)$: Shortest-path hop count from seed to neighbor → one-hot encoding → $h_{\text{hop}} \in \mathbb{R}^d$
Relative time $\tau(v_j) - \tau(v_i)$: Timestamp difference → linear transformation → $h_{\text{time}} \in \mathbb{R}^d$
Subgraph GNN PE: A lightweight GNN operates on the sampled subgraph with random initial features → $h_{\text{pe}} \in \mathbb{R}^d$

Final combination: $$h_{\text{token}}(v_j) = O \cdot [h_{\text{feat}} \| h_{\text{type}} \| h_{\text{hop}} \| h_{\text{time}} \| h_{\text{pe}}]$$

where $O \in \mathbb{R}^{5d \times d}$ is a learnable mixing matrix.

Elegant design of Subgraph GNN PE: Random node feature initialization breaks symmetry to enhance expressiveness (Sato et al., 2021), while $Z_{\text{random}}$ is re-sampled at each training step (randomization strategy) to approximately preserve permutation equivariance.

Core advantage: No expensive global PE precomputation over the full graph is required; all encodings are local and lightweight.

Transformer Network: Local + Global¶

Local module: Full pairwise self-attention over $K$ sampled tokens for the seed node ($L$ Transformer layers), covering a broader receptive field than GNN message passing. Pooling is performed via a learnable linear combination.

\[h_{\text{local}}(v_i) = \text{Pool}(\text{FFN}(\text{Attention}(v_i, \{v_j\}_{j=1}^K))_L)\]

Global module: The seed node attends to $B$ learnable centroid tokens (centroids are dynamically updated during training via EMA K-Means), capturing database-level patterns that span beyond the local subgraph.

\[h_{\text{global}}(v_i) = \text{Attention}(v_i, \{c_b\}_{b=1}^B)\]

Final representation: $$h_{\text{output}}(v_i) = \text{FFN}([h_{\text{local}}(v_i) \| h_{\text{global}}(v_i)])$$

Loss & Training¶

Task-specific loss: Selected according to the downstream task (MAE for regression, AUC for classification, etc.)
End-to-end training: Replaces the GNN component within the RDL pipeline (Robinson et al., 2024)
Model scale: 10–20M parameters, learning rate 1e-4
Sampling parameters: $K=300$ local neighbors, $B=4096$ global centroids
Depth: $L \in \{1, 4, 8\}$ searched when training nodes < 1M; fixed at $L=4$ otherwise
Batch size: 256 for < 1M nodes; 1024 for > 1M nodes
Dropout: $\{0.3, 0.4, 0.5\}$
Temporally-aware sampling enforces $\tau(v_j) \leq \tau(v_i)$ to prevent data leakage

Key Experimental Results¶

Main Results¶

Benchmark: RelBench (7 datasets, 21 tasks) spanning e-commerce, clinical, social, and sports domains; training set sizes range from 1.3K to 5.4M.

Regression tasks (MAE↓):

Dataset	Task	RDL (GNN)	HGT	HGT+PE	RelGT	Gain
rel-avito	ad-ctr	0.041	0.046	0.048	0.035	15.85%
rel-trial	site-success	0.400	0.443	0.440	0.326	18.43%
rel-amazon	item-ltv	50.05	55.87	55.85	48.92	2.26%
rel-hm	item-sales	0.056	0.064	0.064	0.054	4.29%

Classification tasks (AUC↑):

Dataset	Task	RDL (GNN)	HGT	HGT+PE	RelGT	Gain
rel-f1	driver-top3	0.755	0.708	0.763	0.835	10.56%
rel-avito	user-clicks	0.659	0.638	0.646	0.683	3.64%
rel-stack	user-engagement	0.902	0.885	0.882	0.905	0.35%

Overall statistics (±1% threshold): 10 tasks with clear improvement / 9 on par / 2 with marginal degradation.

Ablation Study¶

Removed Component	ad-ctr	user-clicks	site-success	Trend
w/o global module	-6.00%	+7.85%	-19.08%	Task-dependent
w/o GNN PE	-1.14%	-15.15%	—	Consistently drops
w/o node type	-7.14%	+5.01%	—	Mixed
w/o hop encoding	-3.43%	+5.77%	—	Mixed
w/o relative time	-9.14%	+8.37%	—	Mixed

Key Findings¶

Subgraph GNN PE is the only component critical across all tasks: Its removal consistently degrades performance, as it is the sole explicit encoding of local structure (parent–child relationships, cycles, etc.).
The global module exhibits strong task dependency: Removing it causes a 19% drop on site-success (which requires global context), yet yields a 7.9% improvement on user-clicks (where local information suffices).
HGT+PE is inferior to RelGT: Even augmenting HGT with Laplacian PE does not match RelGT, demonstrating that multi-element decomposition outperforms a single PE scheme.
No expensive precomputation required: All encodings are computed on sampled subgraphs, saving orders of magnitude in computation compared to full-graph Laplacian decomposition.

Highlights & Insights¶

Multi-element tokenization paradigm: Extends the NLP Transformer's "token + position" concept to a 5-element representation, decoupling the encoding of different information dimensions — superior to compressing all information into a single PE.
Random-feature GNN PE: Elegantly leverages random initialization to enhance expressiveness, combined with per-step re-sampling to maintain equivariance — a theoretically and practically sound design.
Engineering-friendly: Directly replaces the GNN component in the RDL pipeline while keeping all other infrastructure unchanged.
Global centroid mechanism: EMA K-Means dynamically updates centroids during training without requiring additional preprocessing steps.

Limitations & Future Work¶

Recommendation tasks are not covered (9 of the 30 RelBench tasks are excluded), as recommendation requires specialized treatment such as pair-wise learning.
Temporal encoding relies on a simple linear transformation; more advanced schemes (e.g., periodic functions, learnable temporal kernels) could be incorporated.
The fixed sampling size of $K=300$ may be suboptimal for extremely large or small local structures.
Global centroids introduce noise for certain tasks; an adaptive on/off mechanism could be considered.
Exhaustive hyperparameter search was not conducted, suggesting that reported results may have further room for improvement.

vs. GraphGPS: GPS targets homogeneous static graphs and cannot handle heterogeneous/temporal settings; RelGT is specifically designed for REGs.
vs. HGT: HGT handles heterogeneity but lacks effective PE and temporal modeling; RelGT's 5-element representation provides comprehensive coverage.
vs. RelGNN / ContextGNN: These methods enhance GNNs but lack the flexibility of the Transformer's full pairwise attention.
Insight: The multi-element tokenization paradigm is generalizable to other multi-dimensional heterogeneous graph scenarios (e.g., knowledge graphs, molecular networks, code dependency graphs).

Rating¶

Novelty: ★★★★☆ — Multi-element tokenization and random GNN PE are significant contributions.
Technical Depth: ★★★★☆ — Careful design with theoretical justification for each component.
Experimental Thoroughness: ★★★★★ — 21 tasks, multiple baselines, and comprehensive ablations.
Writing Quality: ★★★★☆ — Clear structure with excellent figures.