Large Language Models are Good Relational Learners¶
Conference: ACL 2025
arXiv: 2506.05725
Code: GitHub
Area: Relational Data Learning / LLM & Structured Data
Keywords: Relational Deep Learning, Graph Neural Networks, RAG, Graph Prompt Tuning, Relational Databases
TL;DR¶
The authors propose the Rel-LLM framework, which utilizes a GNN encoder to extract structured subgraph representations from relational databases and injects them as soft prompts into a frozen LLM. It achieves SOTA performance on relational deep learning (RDL) tasks on the RelBench benchmark and supports zero-shot prediction.
Background & Motivation¶
Background: LLMs perform exceptionally well in NLP, CV, information retrieval, and other fields, but still fall short in processing and reasoning over relational databases (RDBs). Approximately 73% of the world's data is stored in relational databases, where tables are interconnected via primary-foreign keys, forming complex network structures.
Limitations of Prior Work: Existing methods "flatten" relational databases into text documents to input into LLMs, which suffers from three major issues: (1) loss of relation structures between tables; (2) nested joins leading to entity redundancy; (3) serialization of large databases often exceeding the context length limits of LLMs.
Key Challenge: LLMs excel at textual reasoning but struggle with explicit relational structures, whereas GNNs excel at modeling graph structures but lack semantic understanding and generalization capabilities.
Goal: How to enable LLMs to effectively utilize structured information within relational databases while preserving relationship semantics between tables.
Key Insight: Modeling relational databases as heterogeneous graphs, encoding local subgraphs using GNNs, and mapping graph embeddings onto the latent space of the LLM via a projection layer to serve as soft prompts.
Core Idea: Utilizing GNNs to capture relational structures combined with a RAG framework to inject them into the LLM, achieving structure-aware relational reasoning.
Method¶
Overall Architecture¶
Rel-LLM consists of four components: (1) temporal-aware subgraph sampling to ensure causal consistency; (2) a heterogeneous GNN encoder to extract structural feature representations of entities; (3) a projection layer + denormalized prompt construction to organize graph embeddings into structured prompts processable by the LLM; (4) a frozen LLM that receives graph prompts and text embeddings for joint reasoning.
Key Designs¶
- Relational Entity Graph (REG): Converts the relational database into a heterogeneous graph \(G = (\mathcal{V}, \mathcal{E}, \phi, \psi)\), where each row of data is a node, and primary-foreign key relationships are edges. Node and edge types are determined by table names and relations. Initial node embeddings are generated by a multimodal column encoder.
- Temporal-Aware Subgraph Sampling: Centered on the target entity and using the prediction time \(t^*\) as the cutoff, only neighboring nodes with timestamps earlier than \(t^*\) are sampled to avoid temporal information leakage.
- Heterogeneous GraphSAGE Encoder: Employs heterogeneous GraphSAGE with sum aggregation for \(L\)-layer message passing to obtain node embeddings \(\mathbf{h}_i^{(L)}\), which are then mean-pooled to obtain subgraph-level representations \(\mathbf{h}_g^{(L)}\).
- MLP Projection Layer: Projects graph embeddings from the GNN space \(\mathbb{R}^{d_g}\) to the LLM hidden space \(\mathbb{R}^{d_l}\) to achieve modality alignment.
- Denormalized Prompt Construction: Rooted at the target entity, recursively unfolding along primary-foreign key links (breadth-first, depth \(\zeta\), maximum \(n_{\text{nest}}\) entities per layer) to organize graph embeddings of associated entities into a nested JSON structure, reducing multi-hop reasoning requirements.
- Three Answer Generation Strategies: (1) Pure Text Generation—directly outputs readable text; (2) Token Distribution—outputs probability distributions for probabilistic tasks; (3) MLP Transformation—uses a lightweight network to project LLM hidden representations into the task space. Different tasks are suited to different strategies.
Loss & Training¶
- Masked Table Modeling: Randomly selects a portion of nodes to be masked, replacing original features with a learnable mask token, and then tasks the LLM with reconstructing the properties (column name-value pairs) of the masked entities. Column order is randomly permuted to enhance robustness.
- The pre-training loss is a standard autoregressive NLL: \(\mathcal{L}_{\text{pretrain}} = -\frac{1}{|\mathcal{V}_{\text{mask}}|} \sum_{v_i} \sum_t \log p_\theta(y_i^{(t)} | y_i^{(<t)}, \hat{\mathbf{h}}_{\text{mask}})\)
- Only the GNN encoder \(\phi_1\), projection layer \(\phi_2\), and mask token are optimized, while the LLM parameters \(\theta\) remain frozen.
Key Experimental Results¶
Experimental Setup¶
- Benchmark: RelBench—contains 7 datasets, 30 prediction tasks (entity classification + entity regression)
- Backbone LLM: Llama 3.2-1B (128K context)
- Baselines: LightGBM, RDL (GNN + deep tabular model), ICL (LLM in-context learning), ICL+MLP
Main Results¶
Entity Classification (AUROC ↑):
| Dataset | LightGBM | RDL | ICL+MLP | Rel-Zero | Rel-LLM |
|---|---|---|---|---|---|
| rel-amazon user-churn (Test) | 52.22 | 70.42 | 66.56 | 60.07 | 71.89 |
| rel-event user-repeat (Test) | 68.04 | 76.89 | 76.72 | 68.12 | 79.26 |
| rel-stack user-engagement (Test) | 63.39 | 90.59 | 87.09 | 69.46 | 91.21 |
| Overall Average (Test) | 63.66 | 75.83 | 76.83 | 63.42 | 77.82 |
- Rel-LLM outperforms or matches SOTA on all datasets, achieving an average AUROC of 77.82.
- Although the performance of zero-shot Rel-Zero is lower than the fine-tuned version, it significantly outperforms the LightGBM baseline.
Entity Regression (MAE ↓): - On tasks such as rel-hm item-sales, Rel-LLM achieves the lowest MAE. - Compared to ICL+MLP, Rel-LLM achieves a 5-15% improvement on most tasks.
Key Findings¶
- The GNN encoder effectively preserves relational structural information, avoiding information loss caused by text serialization.
- The graph prompt tuning approach does not require modification of LLM parameters, making training costs significantly lower than full fine-tuning.
- The masked table modeling during the pre-training phase endows the model with zero-shot transfer capabilities.
- Different tasks are suited to different answer generation strategies; classification tasks favor token distribution, while regression tasks favor MLP transformation.
Highlights & Insights¶
- Structure-Preserving RAG: Unlike traditional RAG, Rel-LLM does not retrieve text fragments but rather retrieves graph-structured subgraphs, preserving relational semantics.
- Efficient Fine-Tuning: Freezing the LLM and training only the GNN and the projection layer makes parameter efficiency extremely high.
- Denormalization to Nested JSON: Cleverly "translates" graph structures into a format understandable by LLMs, with JSON format being proven to work best for tabular data encoding.
- Temporal Consistency: Strictly avoids information leakage through temporal-aware sampling, making it suitable for time-series forecasting scenarios.
- Zero-Shot Capability: Can obtain reasonable predictions on new tasks after pre-training without requiring further fine-tuning.
Limitations & Future Work¶
- It relies on small models like Llama 3.2-1B; whether it performs better on larger models remains unverified.
- Denormalization depth and the number of nested levels are hyperparameters that require manual adjustment based on database structures.
- It cannot be directly applied to unstructured data that lacks clear primary-foreign key relationships.
- It has only been validated on Relbench, lacking test cases in more practical application scenarios.
Related Work¶
- Relational Tabular Learning: Benchmarks such as CTU, SJTUTable, and RelBench have driven deep learning research on relational data.
- LLMs for Tabular Data: Existing methods serialize tables into text but face challenges of context length limits and loss of structural information.
- Graph Prompt Learning: Injecting GNN embeddings as soft prompts for LLMs is a recent trend in graph-language multimodality.
Rating¶
⭐⭐⭐⭐ — Clever methodology design and comprehensive experiments. The study makes a meaningful exploration in the important yet overlooked direction of relational databases + LLMs. The combination of GNN + RAG + frozen LLM shows excellent scalability.
Additional Details¶
- Randomly permuting column order in pre-training serves as an effective data augmentation strategy, preventing the model from only learning reconstruction under a specific column sequence.
- The early experimental finding that JSON format outperforms Markdown and CSV for relational data encoding comes from Singha et al., 2023.
- In some tasks, the performance of ICL+MLP is close to Rel-LLM, suggesting that the representation capabilities of LLMs can be partially unleashed even through simple text serialization.