Relatron: Automating Relational Machine Learning over Relational Databases¶

Conference: ICLR 2026 arXiv: 2602.22552 Code: https://github.com/amazon-science/Automating-Relational-Machine-Learning Area: Graph Learning / AutoML Keywords: Relational Databases, Graph Neural Networks, Deep Feature Synthesis, Architecture Selection, Homophily

TL;DR¶

This work systematically compares relational deep learning (RDL/GNN) and deep feature synthesis (DFS) on predictive tasks over relational databases, finding that neither dominates uniformly and performance is highly task-dependent. The authors propose Relatron — a task-embedding-based meta-selector that leverages RDB task homophily and affinity embeddings for automatic architecture selection, achieving up to 18.5% improvement in joint architecture–hyperparameter search.

Background & Motivation¶

Background: Predictive modeling over relational databases (RDB) follows two main paradigms: DFS (programmatically composing aggregation primitives to generate feature tables, then applying a tabular learner) and RDL (end-to-end training of GNNs on heterogeneous entity–relation graphs). Both outperform relation-agnostic baselines.

Limitations of Prior Work: It remains entirely unknown which paradigm is superior in which setting. Practitioners lack principled guidance for choosing between DFS and RDL. Validation performance is often an unreliable proxy for model selection — more extensive search can actually lead to worse test performance (the "over-tuning" effect).

Key Challenge: (a) No single architecture dominates across all tasks; (b) there is a substantial gap between the configuration selected by validation and the test-optimal configuration, particularly when temporal splits introduce distribution shift.

Goal: Given an RDB task, automatically select between RDL and DFS and determine the specific architecture configuration.

Key Insight: A large-scale architecture search is conducted to build a "performance bank," followed by analysis of the factors driving the RDL–DFS performance gap. RDB task homophily and training scale emerge as key predictors.

Core Idea: High homophily → linear aggregation in DFS suffices; low homophily → nonlinear aggregation in RDL is advantageous. A meta-classifier is trained on task embeddings (homophily + affinity + scale) to enable automatic macro- and micro-architecture selection.

Method¶

Overall Architecture¶

Construct a decomposed design space for RDL and DFS → conduct large-scale architecture search to build a performance bank → analyze drivers of the performance gap → design task embeddings → train the meta-selector Relatron → apply loss landscape metrics for post-selection.

Key Designs¶

RDB Task Homophily (Definition 1):
- Function: Measures label consistency along meta-paths in an RDB task.
- Mechanism: Defines self-loop meta-paths \(m\) on an augmented heterogeneous graph and computes \(H(\mathcal{G};m) = \frac{1}{|\mathcal{E}_m|}\sum \mathcal{K}(\hat{y}_u, \hat{y}_v)\). Dot-product similarity is used for classification tasks and Pearson correlation for regression. Adjusted homophily is also supported to correct for class imbalance.
- Design Motivation: Spearman \(\rho = -0.43\) (\(p < 0.05\)) indicates a strong correlation between homophily and the RDL–DFS performance gap. Lower homophily corresponds to a larger RDL advantage.
Anchor Affinity Embeddings:
- Function: Captures structural, feature-based, and temporal properties of a task.
- Mechanism: Path affinity (single forward pass of randomly initialized GraphSAGE/NBFNet + linear fit), feature affinity (zero-training validation performance via TabPFN), temporal affinity (statistics of label evolution over time), and \(\log(N_{train})\) training scale.
- Design Motivation: Homophily alone captures message-passing preference but additional signals are needed for path model preference, feature quality, and temporal dynamics.
Loss Landscape Post-Selection:
- Function: Selects more robust checkpoints from the top candidates identified by validation performance.
- Mechanism: Three metrics — first-order \(P_1\) (worst-case finite-difference slope), second-order \(P_2\) (largest eigenvalue of the Hessian), and energy barrier \(P_{bar}\) (maximum loss ridge along a ray). Preference is given to flatter minima.
- Design Motivation: The validation–test gap is reflected in the loss landscape geometry; flatter minima are more robust to distribution shift.

Loss & Training¶

The meta-classifier is trained on the performance bank with leave-one-out (LOO) evaluation, using homophily, statistical, and temporal features. Search efficiency: computational cost is only \(1/10\) that of Fisher information matrix-based methods.

Key Experimental Results¶

Main Results¶

Method	LOO Accuracy (val selection)	LOO Accuracy (test selection)	Avg. Compute Time
Model-free (ours)	87.5%	79.2%	0.48 min
Training-free model	66.7%	66.7%	5 min
Autotransfer (anchor)	66.7%	66.7%	50 min
Simple heuristic	70.8%	75.0%	0 min

Relatron achieves up to 18.5% improvement over strong baselines in joint HPO, at 10× lower computational cost.

Ablation Study¶

Configuration	Kendall corr. (w/o g)	Kendall corr. (w/ g)	Note
Model-free	0.066	0.163	Best task similarity
Training-free	-0.038	-0.030	Negative correlation
Autotransfer	-0.049	-0.011	Expensive and negatively correlated

Key Findings¶

RDL does not consistently outperform DFS: Performance is highly task-dependent, with each paradigm showing clear advantages in distinct settings.
Macro-selection resolves most of the problem: Once the correct paradigm (RDL/DFS) is chosen, the validation–test gap narrows substantially.
Homophily is the strongest predictor: Adjusted homophily yields a Spearman \(\rho = -0.43\) with the RDL–DFS performance gap.
Over-tuning effect: Larger search budgets can degrade performance — Relatron's macro-selection effectively mitigates this.
Validation is unreliable: Under temporal splits, the configuration selected by validation diverges significantly from the test-optimal configuration.

Highlights & Insights¶

The underestimated value of DFS: On suitable tasks, DFS can fully outperform sophisticated GNNs; the key is matching the method to task properties.
Theoretical explanation for the homophily-driven RDL advantage: Under low homophily, linear aggregation conflates positive and negative signals, whereas RDL can learn relational weights that flip contribution signs.
Loss landscape post-selection serves as a practical generalization metric transferable to other AutoML settings.

Limitations & Future Work¶

Task embedding correlations are overall modest (Kendall \(\tau\) at most 0.163), limiting the effectiveness of transfer HPO.
Foundation models (e.g., KumoRFM) are not included.
The performance bank is limited in scale (< 20 tasks); meta-learning would benefit from larger coverage.
Loss landscape metrics are only applicable for within-family comparisons.

vs. KumoRFM: Relational foundation models achieve strong performance, but implementation details are not publicly available. Relatron targets efficient from-scratch training scenarios.
vs. Autotransfer: Fisher information matrix-based task embeddings are computationally expensive and perform poorly on RDB tasks.
vs. Griffin: Cross-table attention frequently underperforms GNNs.

Rating¶

Novelty: ⭐⭐⭐⭐ The definition of RDB task homophily is novel, though the overall methodological framework follows standard meta-learning.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 17 tasks, large-scale architecture search, performance bank, and multi-faceted ablations — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear structure with in-depth theoretical analysis.
Value: ⭐⭐⭐⭐⭐ Addresses a critical practical pain point in RDB ML; the performance bank has long-term research value.