Hybrid Autoencoders for Tabular Data: Leveraging Model-Based Augmentation in Low-Label Settings¶
Conference: NeurIPS 2025 arXiv: 2511.06961 Code: None Area: Self-Supervised Learning / Tabular Data Keywords: tabular data, self-supervised learning, hybrid autoencoder, oblivious soft decision tree, low-label learning
TL;DR¶
This paper proposes TANDEM (Tree-And-Neural Dual Encoder Model), a hybrid autoencoder architecture that jointly trains a neural network encoder and an Oblivious Soft Decision Tree (OSDT) encoder, and introduces a sample-level stochastic gating network as a learnable data augmentation mechanism. TANDEM achieves superior performance over strong baselines—including tree-based and deep learning methods—in low-label tabular data settings.
Background & Motivation¶
Background: Tabular data is the dominant data format in domains such as healthcare and finance. Gradient-boosted decision trees (GBDT/XGBoost/CatBoost) typically outperform deep neural networks on tabular data and remain the preferred choice in practice.
Limitations of Prior Work: (1) Neural networks exhibit spectral bias, tending to fit smooth low-frequency functions and struggling to capture complex high-frequency patterns in tabular data; (2) Self-supervised learning (SSL) for tabular data faces a data augmentation challenge—common augmentations such as noise injection or feature value swapping can easily destroy critical feature relationships; (3) Masked autoencoders (MAE) also have limitations on heterogeneous tabular data.
Key Challenge: SSL is particularly valuable in low-label settings, yet effective augmentation strategies for tabular data are lacking, and conventional augmentation methods tend to generate unrealistic samples.
Goal: To learn effective self-supervised representations on low-label tabular data that surpass conventional methods on downstream classification and regression tasks.
Key Insight: Replace data augmentation with model-based augmentation—leveraging the inductive biases of tree models to guide neural networks toward better representations.
Core Idea: Use an OSDT encoder as a "model-based augmentor" during training, transferring the tabular-friendly inductive biases of tree models to the neural network encoder via a shared decoder and alignment losses.
Method¶
Overall Architecture¶
TANDEM is a dual-encoder, shared-decoder masked autoencoder: - Input \(x \in \mathbb{R}^D\) is first passed through a sample-level stochastic gating network (STG) to produce a feature mask \(g(x) \in [0,1]^D\) - The masked view \(\tilde{x} = x \odot g(x)\) is fed in parallel to: (i) a fully connected neural encoder → \(z^{NN}\), and (ii) an OSDT ensemble encoder → \(z^{OSDT}\) - A shared decoder \(h\) reconstructs \(\hat{x}^{NN}\) and \(\hat{x}^{OSDT}\) from the respective latent representations - After pretraining, only the neural encoder and a lightweight classification/regression head are used at inference
Key Designs¶
-
Oblivious Soft Decision Tree (OSDT) Encoder:
- Function: Serves as a differentiable tree encoder that extracts structured representations from tabular data
- Design Motivation: Tree models are naturally suited to tabular data, capable of capturing sharp high-frequency patterns and conditional feature interactions, thereby compensating for the spectral bias of neural networks
- Mechanism: An ensemble of oblivious decision trees of fixed depth \(L\), with a shared projection vector \(w_\ell\) at each layer. Soft routing probabilities are computed as: \(p_{\text{leaf}}(x) = \prod_{\ell=1}^{L} [\sigma_\ell^+(x)]^{b_\ell} \cdot [\sigma_\ell^-(x)]^{1-b_\ell}\). The final representation is the mean leaf distribution across all trees: \(z^{OSDT}(x) = \frac{1}{T}\sum_{t=1}^T f_t^{OSDT}(x) \in \mathbb{R}^{2^L}\)
- Novelty: The OSDT encoder is used only during training and discarded at inference, avoiding the generalization limitations of tree models
-
Stochastic Gating Network (SGN) as Sample-Level Augmentation:
- Function: Learns a feature mask for each input sample, enabling sample-level feature selection
- Design Motivation: Replaces conventional fixed augmentations (noise, swapping, etc.) with a learnable input transformation that preserves semantic structure
- Mechanism: The gating network \(f_\theta(x)\) outputs parameters \(\mu(x)\); gates are sampled via truncated Gaussian perturbation: \(g(x) = \max(0, \min(1, 0.5 + \mu(x) + \epsilon))\), \(\epsilon \sim \mathcal{N}(0, \sigma^2)\)
- Novelty: The neural encoder uses a single global gate, whereas the OSDT encoder employs an independent gate \(g_\ell^{OSDT}(x)\) at each tree layer, enabling hierarchical feature selection
-
Joint Training Objective:
- Reconstruction loss: \(\mathcal{L}_{\text{recon}} = \frac{1}{N}\sum(\|x - \hat{x}^{OSDT}\|_2^2 + \|x - \hat{x}^{NN}\|_2^2)\)
- Alignment loss: \(\mathcal{L}_{\text{align}} = \frac{1}{N}\sum\|\hat{x}^{OSDT} - \hat{x}^{NN}\|_2^2\) (reconstruction output consistency)
- Latent Representation Similarity (LRS) loss: \(\mathcal{L}_{\text{LRS}} = \frac{1}{N}\sum(1 - \frac{\langle z^{NN}, z^{OSDT} \rangle}{\|z^{NN}\| \cdot \|z^{OSDT}\|})\) (cosine distance)
Loss & Training¶
- Pretraining for 100 epochs, batch size 128, RMSprop optimizer
- Hyperparameters selected via Optuna over 50 trials based on validation loss
- Downstream evaluation: single-layer MLP; the encoder is frozen for 25 epochs and then fine-tuned for 25 epochs
- The gating network is frozen during fine-tuning
Key Experimental Results¶
Main Results¶
Classification (19 datasets, 400 labels):
| Method | Mean Accuracy | Mean Rank |
|---|---|---|
| MLogReg | 0.6380 | 6.16 |
| MLP | 0.6721 | 4.84 |
| XGBoost | 0.6706 | 4.47 |
| CatBoost | 0.6731 | 4.16 |
| TabPFN | 0.7012 | 2.56 |
| TANDEM | 0.7124 | 1.58 |
Regression (13 datasets, 400 labels):
| Method | Mean MSE | Mean Rank |
|---|---|---|
| CatBoost | 0.3318 | 4.00 |
| XGBoost | 0.3405 | 4.15 |
| MLP | 0.3877 | 4.38 |
| TANDEM | 0.3234 | 3.38 |
TANDEM achieves the best mean metric and best average rank on both classification and regression.
Ablation Study¶
Classification ablation (400 labels):
| Variant | Mean Accuracy | Mean Rank |
|---|---|---|
| SS-AE (standard autoencoder) | 0.6815 | 4.45 |
| SS-AE + Gating | 0.6941 | 3.61 |
| OSDT AE + Gating (tree only) | 0.6600 | 4.71 |
| TANDEM (no gate) | 0.6966 | 2.92 |
| TANDEM (no LRS + Align) | 0.6971 | 2.79 |
| TANDEM (full) | 0.7124 | 1.74 |
- Removing either encoder or the gating network consistently degrades performance
- The full TANDEM model is uniformly best
- The OSDT-only encoder variant performs worst (0.6600), indicating that tree models alone are insufficiently flexible as encoders
Key Findings¶
- The dual-encoder architecture significantly outperforms single-encoder variants, validating the value of complementary inductive biases
- The learnable gating network is more effective as an augmentation mechanism than fixed augmentations
- TANDEM is robust across a wide label range of 50–1000 labeled samples
- Spectral analysis reveals that the two encoders capture distinct and complementary frequency components
- TANDEM achieves the best results even on classification benchmarks where TabPFN is strongest
Highlights & Insights¶
- Model-based augmentation replacing data augmentation: The core insight is to inject tree model inductive biases into neural network training as a form of augmentation, rather than relying on unreliable tabular data augmentations
- Neural-network-only inference: The OSDT encoder is used only during training and incurs no inference overhead, maintaining flexible compatibility with downstream tasks
- Dual role of the gating network: It simultaneously acts as a feature selector and an augmentor, providing effective input transformations without disrupting semantic structure
- Complementary spectral analysis: The dual-encoder design is explained from a frequency-domain perspective—the neural network captures low-frequency components while the tree captures high-frequency components
Limitations & Future Work¶
- Pretraining requires approximately 2,000 unlabeled samples per class, which may be impractical in extreme low-data scenarios
- The OSDT depth \(L\) and number of trees \(T\) are critical hyperparameters that require careful tuning
- On individual datasets, TANDEM can still underperform certain baselines (e.g., MSE of 1.0057 vs. CatBoost's 0.6565 on the BF regression dataset)
- Integration with Transformer-based tabular methods (e.g., SAINT, FT-Transformer) remains unexplored
Related Work & Insights¶
- NODE (popov2019node): Foundational work on differentiable oblivious decision trees
- TabPFN (hollmann2023tabpfn): A probabilistic inference-based classification method pretrained on synthetic data; primary competitor
- VIME / SCARF / SubTab: Existing tabular SSL methods, all of which underperform TANDEM
- STG (yamada2020): Original proposal of the stochastic gating network
- Insight: Exploiting the complementary inductive biases of heterogeneous models is an effective strategy for improving SSL representation quality, and the approach is generalizable to other model combinations
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-encoder design with model-based augmentation is novel, and the perspective of treating gating as augmentation is distinctive
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 19 classification + 13 regression datasets, 100 repetitions, label range of 50–1000, comprehensive ablations
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear; method is presented systematically and completely
- Value: ⭐⭐⭐⭐ Practically significant for low-label tabular learning, with strong generalizability