MultiTab: A Scalable Foundation for Multitask Learning on Tabular Data¶

Conference: AAAI 2026 arXiv: 2511.09970 Code: Armanfard-Lab/MultiTab Area: Recommender Systems / Tabular Data Keywords: Multitask Learning, Transformer, Tabular Data, Masked Attention, Synthetic Data Benchmark

TL;DR¶

This paper proposes MultiTab-Net — the first multitask Transformer architecture for tabular data — which alleviates task competition via a multitask masked attention mechanism, and substantially outperforms existing MLP-based multitask models and single-task Transformer models across datasets from recommendation, census, and physics domains.

Background & Motivation¶

The Gap Between Tabular Data and Multitask Learning¶

Tabular data is the most abundant data type in the world, widely used in finance, healthcare, and e-commerce. Many real-world scenarios naturally require simultaneous prediction of multiple related targets:

Healthcare: The same patient record can be used to predict risks of both diabetes and hypertension.

E-commerce: Predicting not only clicks but also add-to-cart and purchase events.

Recommender Systems: Jointly optimizing CTR and CVR.

Multitask Learning (MTL) leverages inter-task correlations through shared representations to improve generalization and efficiency. However, existing tabular MTL work suffers from two limitations:

Narrow scope: Prior work mainly focuses on large-scale recommender systems (MMoE, PLE, STEM), with insufficient exploration of broader tabular scenarios.

Backbone limitations: Nearly all methods are MLP-based, which lacks explicit feature interaction modeling and scales poorly in data-rich settings.

Why Transformers?¶

MLPs implicitly learn feature interactions through fully connected layers, lacking explicit interaction modeling.
Transformers dynamically model feature-to-feature and sample-to-sample dependencies via self-attention.
The advantages of Transformers in large-data regimes have been extensively validated in NLP and CV.
Tabular Transformers (FT-T, SAINT) have demonstrated strong capabilities for modeling inter-feature and inter-sample relationships.

Core Challenge: Task Competition¶

Extending Transformers to multitask settings introduces task tokens that alter the structure of the attention matrix. Unconstrained interaction among task tokens may cause the seesaw phenomenon — dominant tasks monopolize shared capacity, degrading overall performance.

Method¶

Overall Architecture¶

MultiTab-Net builds upon SAINT, with two core innovations: 1. Multi-token design: Each task is assigned a dedicated task token, rather than sharing a single CLS token. 2. Multitask masked attention: Task token interactions are selectively restricted in inter-feature attention to alleviate task competition.

Input processing pipeline: - \(d\) feature tokens + \(t\) task tokens → each projected to vectors of dimension \(e\) via embedding networks. - Concatenated to form \(x\in\mathbb{R}^{(d+t)\times e}\) → passed through \(N\) encoder blocks. - Each encoder block consists of: Inter-Feature Attention + Inter-Sample Attention + FFN. - Final \(t\) task tokens are independently processed by task-specific MLPs to produce predictions.

Key Designs¶

1. Multitask Masked Attention¶

Function: Selectively suppresses interactions between certain tokens in inter-feature attention.

Mechanism: A mask \(M_A\) is added to the pre-activation attention scores of attention matrix \(A_i\):

\[A_i = \text{softmax}\left(\frac{Q_iK_i^\top}{\sqrt{d_k}} + M_A\right)\]

\(M_A\) assigns \(-\infty\) to masked positions, which become zero after softmax.

Three candidate masking strategies:

Strategy	Description	Effect
F↛T	Feature-to-task attention is masked	Unstable; sometimes degrades performance
T↛T	Cross-task token attention is masked	Best; consistently optimal
F↛T & T↛T	Combination of both	Sub-optimal

Design Motivation: T↛T masking prevents task tokens from directly influencing each other, thereby alleviating task competition. However, preserving T→F (task tokens attending to features) is necessary, as tasks must extract information from features. Masking F→T yields unstable results, suggesting that feature tokens also need to be aware of task context.

2. Multi-Token vs. Single-Token¶

Function: Assigns each task an independent learnable task token, rather than a shared BERT-style CLS token.

Core advantage (confirmed by ablation): - With 2 tasks, the gap between single- and multi-token is small. - With 8 tasks (Higgs dataset), multi-token + T↛T masking achieves \(\Delta_m = 1.23\%\), while the best single-token configuration yields \(\Delta_m = -6.35\%\). - A single shared token cannot adequately capture task-specific information across multiple tasks.

3. MultiTab-Bench: Synthetic Multitask Dataset Generator¶

Function: Generates synthetic multitask tabular data with controllable task correlation, task complexity, and number of tasks.

Mechanism: Weight matrices are constructed via eigendecomposition. Given a desired correlation matrix \(\mathbf{P}\), the eigendecomposition \(\mathbf{P}=\mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^T\) is performed, and the weight matrix \(\mathbf{W}=\mathbf{Q}\boldsymbol{\Lambda}^{1/2}\mathbf{U}^T\) is constructed, where \(\mathbf{U}\) consists of orthonormal vectors. It can be shown that the cosine similarity between weight vectors of different tasks equals exactly \(P_{ij}\).

Label generation: \(y_i=\sum_{k=1}^{d_i}(\mathbf{w}_i^T\mathbf{x})^k+\epsilon_i\)

Advantages over MMoE's synthetic data: - Supports an arbitrary number of tasks (MMoE is limited to 2). - Supports different polynomial degrees per task (controlling relative difficulty). - Supports task-specific noise.

Loss & Training¶

Binary/multi-class cross-entropy for classification tasks; RMSE for regression tasks.
Adam optimizer with weight decay.
Early stopping: monitors average AUC for classification tasks and average EV for regression tasks.
All results are averaged over 5 random seeds.

Key Experimental Results¶

Main Results¶

Model	AliExpress \(\Delta_m\)↑	ACS Income \(\Delta_m\)↑	Higgs \(\Delta_m\)↑
STL (baseline)	0.0000	0.0000	0.0000
MTL	0.1129	0.0612	-0.6531
MMoE	0.0873	0.0893	-0.3525
PLE	0.2778	0.0892	-0.0314
STEM	0.1763	0.0725	0.0571
SAINT	0.1146	0.0948	-1.6514
MultiTab-Net	0.5512	0.1064	1.2337

MultiTab-Net achieves the highest multitask gain across all datasets. Notably, on the 8-task Higgs dataset, most baselines yield negative \(\Delta_m\) (i.e., MTL hurts performance relative to STL), whereas MultiTab-Net achieves a positive gain of 1.23%.

Ablation Study¶

Configuration	AliExpress \(\Delta_m\)	ACS Income \(\Delta_m\)	Higgs \(\Delta_m\)
Single-token, no mask	0.2669	0.0893	-6.3491
Multi-token, no mask	0.2579	0.0783	1.1182
Multi-token, F↛T	0.3698	0.0951	0.9626
Multi-token, T↛T	0.5512	0.1064	1.2337
Multi-token, F↛T & T↛T	0.2975	0.1007	1.0197

Key Finding: T↛T masking is consistently optimal across all datasets.

Computational Efficiency¶

Model	AliExpress (Params/FLOPs M)	ACS Income	Higgs
SAINT (recent single-task Transformer)	3.62/9.70	0.49/1.35	5.50/15.02
STEM (latest MTL model)	1.55/3.11	0.69/1.29	1.25/2.51
MultiTab-Net	1.80/4.85	0.28/0.77	0.70/1.90

Compared to SAINT, MultiTab-Net achieves approximately 2× and 8× efficiency improvements on ACS Income and Higgs, respectively — roughly proportional to the number of tasks.

Key Findings¶

The advantage of the multi-token design grows with the number of tasks (negligible at 2 tasks; substantial at 8 tasks).
Synthetic data experiments confirm MultiTab-Net's consistent superiority across varying task correlations, complexities, and task counts.
Under non-uniform task complexity settings, MultiTab-Net's advantage becomes even more pronounced.

Highlights & Insights¶

Filling a gap: MultiTab-Net is the first multitask Transformer for tabular data, bringing the benefits of attention mechanisms to the intersection of MTL and tabular learning.
Simple yet effective masking: The intuition behind T↛T masking is straightforward — prevent tasks from interfering with each other — and it introduces virtually no additional computational overhead.
Practical value of MultiTab-Bench: Supporting arbitrary task counts, adjustable correlations, and difficulty levels, it provides a standardized synthetic benchmark for MTL research.
Inspired by STEM: STEM constrains cross-task updates during backpropagation via stop-gradient; MultiTab-Net achieves a more direct form of task isolation at the attention level during the forward pass.

Limitations & Future Work¶

Limited dataset scale and diversity: Only 3 public datasets are evaluated, with task types primarily restricted to classification and regression; tasks such as ranking are not explored.
Unfair comparison with XGBoost: XGBoost has limited native multitask support; the multioutput variant is only applicable when all tasks share the same output type.
Static masking strategy: T↛T masking uniformly suppresses all cross-task interactions, without considering that information sharing between certain task pairs may be beneficial.
Dynamic masking not explored: Adaptive masking learned from task correlations remains an open direction.
Scalability unverified: Performance with more than 8 tasks remains unknown.

MMoE (Ma et al. 2018): Multi-gate mixture-of-experts architecture; pioneered the tabular MTL direction.
PLE (Tang et al. 2020): Shared and task-specific experts; mitigates the seesaw phenomenon.
STEM (Su et al. 2024): Stop-gradient constraints; directly inspired the masked attention design in MultiTab-Net.
SAINT (Somepalli et al. 2021): Inter-sample attention; serves as the architectural foundation of MultiTab-Net.
Insight: Task isolation at the attention level may represent a generalizable paradigm for multitask learning, worth exploring in CV and NLP settings.

Rating¶

Novelty: ⭐⭐⭐⭐ (First multitask Transformer for tabular data; masking design is simple but effective)
Experimental Thoroughness: ⭐⭐⭐⭐ (3 public datasets + synthetic data; greater diversity would strengthen the work)
Writing Quality: ⭐⭐⭐⭐ (Clear structure; mathematical derivation in the MultiTab-Bench section is rigorous)
Value: ⭐⭐⭐⭐ (Fills an important gap; open-source code; synthetic benchmark has independent value)