Robust Tabular Foundation Models¶

Conference: AAAI 2026 arXiv: 2512.03307 Code: To be confirmed Area: Self-Supervised Learning / Tabular Foundation Models Keywords: tabular foundation model, adversarial training, synthetic data, distributionally robust optimization

TL;DR¶

This paper proposes RTFM — a model-agnostic adversarial training framework that performs min-max optimization over the parameter space of a synthetic data generator, maximizing the "optimality gap" between a tabular foundation model (TFM) and classical tree-based models. Using fewer than 100,000 additional synthetic datasets, RTFM significantly improves TabPFN V2 across multiple tabular benchmarks.

Background & Motivation¶

The Struggle of Deep Learning on Tabular Data¶

Despite remarkable successes in computer vision and natural language processing, deep learning has long struggled to surpass gradient-boosted trees (XGBoost, CatBoost, etc.) on structured tabular data. Multiple large-scale benchmark studies have confirmed this gap, spurring a wave of novel deep learning approaches for tabular tasks.

The Rise of Tabular Foundation Models (TFMs)¶

Models such as TabPFN adopt an in-context learning (ICL) paradigm: the model receives labeled training examples and test samples as an input sequence and performs zero-shot prediction on new datasets in milliseconds. The core training strategy involves pretraining on large volumes of synthetic datasets generated by structural causal models (SCMs). While TabPFN V2 already outperforms tree models on many datasets, it still lags behind on certain dataset types.

Core Insight: The Limitation of Fixed Priors¶

Existing TFMs (TabPFN, Mitra, TabICL) sample SCM parameters from a fixed prior distribution to generate training data. However, a fixed prior inevitably underrepresents certain regions of the parameter space — for instance, specific feature dimensionalities, proportions of categorical features, or degrees of nonlinearity — leading to performance degradation on real-world datasets sharing similar characteristics.

The key insight of this paper is that, since the data generator's parameters are explicitly parameterizable, the training process can be examined through the lens of adversarial robustness: an adversary actively seeks regions of the parameter space where the model performs worst, and training is then concentrated on those regions.

Method¶

Overall Architecture¶

RTFM is a two-phase iterative optimization framework (as illustrated in Figure 1), consisting of a maximization phase (parameter search) and a minimization phase (model training), alternating until convergence:

Maximization Phase: The model weights \(\mathbf{W}\) are frozen, and a black-box optimizer (Optuna + TPE) searches the SCM parameter space \(\mathcal{P}\) for configurations that maximize the optimality gap.
Minimization Phase: Based on the discovered parameters and their optimality gaps, a softmax sampling distribution \(Q\) is constructed; training data is generated according to \(Q\) and used to fine-tune the model to reduce the gap.

Definition of the Optimality Gap¶

Conventional adversarial training directly maximizes the model's loss, which risks steering optimization toward regions where no model can learn effectively — not a meaningful target for improvement. The central innovation of RTFM is to maximize the optimality gap rather than the absolute loss:

\[\delta_\theta(\mathbf{W}) = \mathbb{E}_{\phi \sim p(\Phi;\theta)}\left[\mathcal{L}_{PFN}(\mathbf{W};\phi) - H_\phi(Z_y|Z_x)\right]\]

where \(H_\phi(Z_y|Z_x)\) is the conditional cross-entropy lower bound achievable by the Bayes-optimal predictor. Since the conditional entropy is intractable to estimate in practice, the paper approximates it using the minimum cross-entropy loss among several strong baseline models (XGBoost, CatBoost, Random Forest, etc.):

\[\widehat{\delta}_\theta(\mathbf{W}) = \mathbb{E}_{\phi}\left[\mathcal{L}_{PFN}(\mathbf{W};\phi) - \min_{k \in [e]} \mathcal{L}_{PFN}(f_k, \phi)\right]\]

This yields a lower bound on the optimality gap. A positive lower bound indicates genuine room for improvement.

Distributionally Robust Optimization (DRO) Formulation¶

Maximizing over a single parameter configuration is prone to overfitting. The paper elevates the problem to a distributionally robust optimization framework: the adversary selects a distribution \(Q\) over the parameter space rather than a single configuration, subject to a minimum entropy constraint \(H(Q) \geq H_{min}\) to prevent the distribution from collapsing to a point mass.

The paper proves (Appendix C) that the optimal solution to the DRO problem takes the form of a softmax distribution:

\[q_i^* \propto \exp(\eta \cdot \widehat{\delta}_{\theta_i}(\mathbf{W}))\]

The temperature parameter \(\eta\) is uniquely determined by \(H_{min}\) and the optimality gaps of candidate parameters, and can be efficiently solved via one-dimensional search (e.g., binary search). In practice, \(H_{min} = c \log(n_{trials})\) is used, where \(c \in (0,1)\) is a hyperparameter.

Implementation of the Maximization Phase¶

The Tree-structured Parzen Estimator (TPE) from the Optuna framework is used as the black-box optimizer, conducting \(n_{trials}=100\) search trials.
For each proposed parameter configuration \(\theta_i\), \(n_{ds}=20\) synthetic datasets are sampled and evaluated against \(e=7\) baseline models; the average optimality gap is computed.
A key speedup: fitting each (dataset, baseline model) pair is fully independent, enabling parallelization across \(n_{ds} \times e = 140\) CPU cores, reducing each trial to a matter of seconds.

Implementation of the Minimization Phase¶

The softmax sampling distribution \(Q\) is constructed from the pairs \(\{(\theta_i, \widehat{\delta}_{\theta_i})\}\) obtained in the maximization phase.
Each training batch samples \(\theta_i\) from \(Q\), then samples a generator and dataset from \(p(\Phi;\theta_i)\).
Training uses a learning rate of \(1 \times 10^{-5}\), batch size 64, and 3,000 steps per round.
The full max-min cycle runs for 30 epochs.
Self-distillation mechanism: After the 5th epoch, the original TabPFN model is added to the baseline pool to prevent catastrophic forgetting of its original capabilities.

Parameter Space¶

SCMs are implemented as randomly initialized MLPs. Tunable parameters include the number of layers \(l\), hidden layer size \(h\), activation function \(a\), proportion of categorical features \(r_{cat}\), and others. The distributions of these hyperparameters are individually parameterized (e.g., \(r_{cat} \sim \text{TruncNorm}(\mu_{r_{cat}}, 0, 1)\)), and all parameterized means form the overall parameter vector \(\theta\).

Key Experimental Results¶

Table 1: TabPertNet Benchmark Results¶

Metric	Log. Reg.	MLP	Random Forest	CatBoost	XGBoost	TabPFN	TabPFN (RTFM)
Mean Rank AUC	5.1	4.6	4.0	3.8	4.6	3.2	2.7
Mean Norm. AUC	0.4253	0.5005	0.6481	0.6663	0.5222	0.7483	0.8167
Rank-1 Wins	1	8	5	7	5	11	17

On TabPertNet, RTFM improves TabPFN's mean normalized AUC from 0.7483 to 0.8167 (+6.8%), and increases Rank-1 wins from 11 to 17.

Table 2: TabArena Benchmark Results¶

Metric	Log. Reg.	MLP	Random Forest	CatBoost	XGBoost	TabPFN	TabPFN (RTFM)
Mean Rank AUC OVO	4.9	6.3	4.8	3.4	4.5	2.2	1.9
Mean Norm. AUC OVO	0.4277	0.1801	0.5761	0.7749	0.5918	0.9031	0.9298
Rank-1 Wins	2	0	0	2	0	5	12

RTFM similarly leads across all metrics on TabArena, with Rank-1 wins jumping from 5 to 12. Wilcoxon signed-rank tests confirm that the improvements over the original TabPFN are statistically significant on both benchmarks (TabPertNet: \(p=0.0023\), TabArena: \(p=0.0103\)).

Highlights & Insights¶

Optimality Gap vs. Absolute Loss: The most critical design insight — adversarial training should maximize the gap relative to the achievable optimum, not the absolute loss. This avoids wasting training resources on data distributions that are inherently unlearnable.
Exceptional Synthetic Data Efficiency: Significant improvements are achieved using only 90,000 additional synthetic datasets — less than 1% of TabPFN's original pretraining data volume — demonstrating that precisely targeting weaknesses is far more efficient than large-scale data accumulation.
Model Agnosticism: The RTFM framework is independent of any specific model architecture and is in principle applicable to any TFM (e.g., Mitra, TabICL) as well as extensible to regression tasks.
"Jump" Phenomenon: On datasets where TabPFN originally trailed tree models (rank > 2), RTFM causes a direct leap to rank-1 performance in approximately 20–21% of cases, indicating that adversarial training effectively remedies specific weaknesses.
Self-Distillation Against Forgetting: Adding the original model to the baseline pool after the 5th epoch is an elegant design choice that mitigates the performance regression sometimes induced by adversarial training.

Limitations & Future Work¶

Classification Tasks Only: Although the framework is theoretically extensible to regression, experiments are conducted exclusively on classification benchmarks, leaving regression performance unknown.
SCMs Restricted to MLPs: Only MLP-based data generators are employed; tree-structured SCMs are not included, which may limit the diversity of generated data and coverage of regions where tree models excel.
Non-Negligible Computational Cost: The maximization phase requires fitting \(n_{trials} \times n_{ds} \times e = 14{,}000\) baseline models per round. Although highly parallelizable, this still demands substantial resources — 256 CPU cores and an A100 GPU.
Bias in Optimality Gap Estimation: Approximating the Bayes-optimal value using the best performance of a finite set of baselines is inherently a lower-bound estimate, potentially underestimating the true gap and causing certain genuinely weak regions to be overlooked.
Validation Limited to TabPFN V2: The generalizability of the framework to other TFMs (e.g., Mitra, TabICL) has not been empirically verified.

TabPFN series (Hollmann et al. 2023, 2025): Pioneers the Prior-Fitted Network paradigm for tabular foundation models; RTFM directly builds upon this framework with adversarial fine-tuning.
Wu & Bergman 2025: A concurrent work that also adjusts SCM weights in a GAN-style training loop, but focuses narrowly on weight adjustment for a specific class of SCMs. RTFM provides a more general adversarial optimization framework over the parameter space.
DRO (Distributionally Robust Optimization) (Rahimian & Mehrotra 2019): One of the theoretical pillars of RTFM. Applying DRO to synthetic data generation represents a novel contribution to the field.
Madry et al. 2019: The seminal adversarial training work; RTFM transfers its core idea from adversarial perturbations in input space to adversarial search in the data generator's parameter space.

Takeaway: The idea of conducting adversarial search over the parameter space of synthetic data generators is broadly applicable — not limited to tabular data. Any foundation model pretrained on synthetic data could potentially adopt this framework to identify and strengthen its weak spots.

Rating¶

⭐⭐⭐⭐ (4/5)

Rationale: The paper presents a theoretically grounded and practically efficient adversarial training framework for TFMs. The DRO formulation offers a solid theoretical contribution, and the experimental improvements are statistically significant. Points are deducted primarily for the narrow experimental scope (classification only, TabPFN only, MLP-SCM only), which provides insufficient empirical support for the claimed generality of the framework.