Feature-aware Modulation for Learning from Temporal Tabular Data¶

Conference: NeurIPS 2025 arXiv: 2512.03678 Code: https://github.com/LAMDA-Tabular/Tabular-Temporal-Modulation Area: Time Series Keywords: Temporal distribution shift, feature modulation, Yeo-Johnson transform, concept drift, tabular data

TL;DR¶

This paper addresses distribution shift in temporal tabular data by proposing a feature-aware temporal modulation mechanism. Through learnable transformations conditioned on temporal context, it dynamically adjusts per-feature shift (\(\beta\)), scale (\(\gamma\)), and skewness (\(\lambda\)) to align feature semantics across time. On the TabReD benchmark, it is the first approach to enable deep learning methods to systematically outperform GBDT.

Background & Motivation¶

Background: In tabular learning, tree-based models (XGBoost, CatBoost, LightGBM) have long dominated due to their robustness. Recent deep models (FT-Transformer, TabR, TabM, etc.) have narrowed the gap, but still struggle to surpass GBDT under real-world temporal distribution shift. Existing methods generally assume i.i.d. data and overlook the fact that data distributions evolve over time.

Limitations of Prior Work: In real-world scenarios, the semantics of features change over time. For example, the definition of "high income" shifts with inflation — an annual salary of 500K may have indicated high income a decade ago but only represents average income today. Likewise, the meaning of "prime location" evolves as cities develop, even when coordinates remain fixed. Static models cannot capture such semantic drift; simple adaptive methods (e.g., directly concatenating temporal embeddings) may overfit short-term patterns and generalize poorly.

Key Challenge: Static models offer strong generalization but cannot adapt to temporal changes; adaptive models focus on immediate adjustment but may sacrifice long-term stability. This constitutes a fundamental dilemma between robustness and adaptability.

Goal: How can a balance between generalization and adaptability be achieved? What are the key factors?

Key Insight: The authors observe that semantic drift in features can be characterized by changes in distributional statistics (mean, standard deviation, skewness). Dynamically adjusting these statistics based on temporal context enables alignment of feature semantics across different time periods.

Core Idea: Apply lightweight feature modulation (shift + scale + nonlinear transformation) conditioned on temporal embeddings to align feature semantics across time, thereby achieving immunity to concept drift.

Method¶

Overall Architecture¶

The input consists of timestamped tabular features \((\mathbf{x}, t)\). A temporal embedding \(\psi(t)\) is first extracted from timestamp \(t\). A lightweight modulator then generates per-feature transformation parameters \((\gamma, \beta, \lambda)\), which are used to apply a Yeo-Johnson nonlinear transformation followed by an affine scaling to the raw features. The semantically aligned features are subsequently fed into any backbone network (MLP, TabM, etc.) for prediction. Modulation can be applied at the input layer, intermediate layers, and output layer respectively.

Key Designs¶

Feature-aware Temporal Modulation Function:
- Function: Dynamically reshapes each feature's distribution based on temporal context.
- Mechanism: For each feature \(x_i\), the modulation function is \(\tilde{x}_i = \gamma_i(\psi(t)) \cdot \text{YJ}(x_i; \lambda_i(\psi(t))) + \beta_i(\psi(t))\), where \(\gamma\) controls scaling (corresponding to standard deviation alignment), \(\beta\) controls shift (corresponding to mean alignment), and \(\lambda\) controls nonlinear shape via the Yeo-Johnson transform (corresponding to skewness alignment). All modulation parameters are generated by a lightweight MLP conditioned on the temporal embedding \(\psi(t)\).
- Design Motivation: Unlike FiLM, which performs only linear affine modulation, the incorporation of the Yeo-Johnson transform enables handling of nonlinear evolution in feature distribution shapes. The three statistics (mean, standard deviation, skewness) are sufficient to capture most temporal distribution drift observed in real-world datasets.
Yeo-Johnson Nonlinear Transform:
- Function: Provides a differentiable power transform applicable to both positive and negative values.
- Mechanism: The YJ transform is defined as: for \(x \geq 0\), \(\text{YJ}(x;\lambda) = ((x+1)^\lambda - 1)/\lambda\); for \(x < 0\), \(\text{YJ}(x;\lambda) = -((-x+1)^{2-\lambda}-1)/(2-\lambda)\). The parameter \(\lambda\) controls the transformation strength: \(\lambda=1\) yields the identity, \(\lambda<1\) compresses the right tail, and \(\lambda>1\) stretches the right tail.
- Design Motivation: Compared to Box-Cox (which handles only positive values), YJ accommodates arbitrary real-valued features. Since \(\lambda\) is dynamically generated from the temporal embedding, the same feature can undergo different degrees of nonlinear correction at different time periods.
Multi-layer Modulation Strategy:
- Function: Applies modulation at different stages of the network (raw input / intermediate representations / output logits).
- Mechanism: Modulation modules can be flexibly inserted at three positions: the raw input, intermediate hidden representations, and the final prediction output. Each position uses an independent modulator but shares the same temporal embedding. Sharing the temporal embedding across all modulators ensures parameter efficiency and temporal consistency.
- Design Motivation: Ablation studies show that three-layer modulation achieves the best performance (+2.09%), but applying modulation only at the input layer already yields 87.4% of the total gain (+1.83%), indicating that early semantic alignment is the most critical. This also implies that the method can be integrated into existing models at negligible cost — only a modulation module before the input layer is needed.

Loss & Training¶

Cross-entropy loss is used for classification tasks and MSE for regression. AdamW optimizer with early stopping (patience = 16 epochs) is employed. Hyperparameter tuning uses Optuna (100 trials), with results averaged over 15 random seeds per configuration. The temporal embedding dimension is fixed at 128 and includes periodic encodings for year/month/day/hour along with a trend component.

Key Experimental Results¶

Main Results¶

Method	Type	Avg. Rank	HI (AUC↑)	CT (RMSE↓)
CatBoost	Static GBDT	4.375	0.9639	0.4792
TabM	Static Deep	7.250	0.9640	0.4813
MLP + Temporal Embedding	Adaptive	14.375	0.9471	0.4801
TabM + Temporal Embedding	Adaptive	5.125	0.9629	0.4791
TabM + Temporal Modulation (Ours)	Adaptive	3.500	0.9641	0.4773
MLP + Temporal Modulation (Ours)	Adaptive	11.000	0.9593	0.4782

Ablation Study¶

Modulation Position	Input	Middle	Output	Absolute Gain	Relative Gain
No modulation	✗	✗	✗	0.00%	—
Input only	✓	✗	✗	+1.83%	87.4%
Output only	✗	✗	✓	+1.02%	48.7%
Middle only	✗	✓	✗	+0.26%	12.6%
All	✓	✓	✓	+2.09%	100%

Key Findings¶

First systematic superiority of deep models over GBDT: TabM + modulation (rank 3.5) outperforms CatBoost (rank 4.375), which the authors claim is the first such result under a temporal distribution shift setting.
Input-layer modulation contributes 87.4% of the total performance gain, confirming that early semantic alignment is most critical. Removing input-layer modulation leaves the remaining two layers recovering only 56.8% of the gain.
Even the simplest MLP with full-layer modulation outperforms most deep learning methods, demonstrating that the modulation mechanism itself is far more valuable than backbone complexity.
The key advantage of temporal modulation over temporal embedding lies in decoupling the temporal and input modalities: embedding-based approaches inject temporal information directly into the feature space and may introduce interference, whereas modulation influences features indirectly through parameterized transformations, naturally avoiding scaling issues.
Pilot study visualizations clearly illustrate: before modulation, feature distributions across time periods differ substantially → after modulation, distributions align → the model can learn consistent decision boundaries in a unified representation space.

Highlights & Insights¶

The "objective semantics vs. subjective semantics" analytical perspective is highly insightful. Objective semantics (e.g., coordinates, salary values) do not change over time, whereas subjective semantics ("high income," "prime location") depend on distributional context. The modulation mechanism essentially restores temporal consistency in subjective semantics. This analytical framework can be generalized to distribution shift problems in other domains.
The introduction of the Yeo-Johnson transform, compared to simple FiLM (linear affine only), adds minimal computational overhead while enabling handling of nonlinear changes in distribution shape. This approach of "using parameterized statistical transforms for feature engineering" is transferable to other domains.
The extreme lightweight design is impressive: adding a small MLP modulator only at the input layer already yields 87.4% of the total gain, demonstrating that well-designed inductive biases matter more than architectural complexity.

Limitations & Future Work¶

Full-layer modulation is incompatible with PLR embeddings (which map numerical values to trigonometric functions, disrupting the interpretability of distributional semantics), limiting integration with SOTA models such as TabR and ModernNCA.
Validation is limited to the TabReD benchmark, which contains only 8 datasets. Further evaluation on more diverse domains and larger-scale datasets is needed.
The temporal embedding design relies on predefined periodic priors (year/month/day/hour), which may be less effective for temporal drift without clear periodicity.
Analysis of the learned values of modulation parameters \((\gamma, \beta, \lambda)\) is absent — it remains unclear whether they genuinely align with the observed trends in feature statistics.

vs. FiLM (Perez et al. 2018): FiLM uses conditional affine transformations to modulate features in visual reasoning. This paper extends FiLM in two key respects: (1) replacing scale+shift with the Yeo-Johnson transform to add nonlinear distribution reshaping; (2) conditioning on temporal embeddings rather than question embeddings.
vs. Temporal Embedding Methods (Cai et al. 2025): Temporal embeddings concatenate time information directly to the input, relying on end-to-end learning to discover temporal patterns. This paper argues that such an approach suffers from scaling issues and modal coupling. Modulation acts indirectly through parameterized transforms, achieving cleaner decoupling.
vs. HyperNetworks (Ha et al. 2017): HyperNetworks generate full model weights, incurring high computational cost and low data efficiency. The proposed modulation scheme generates only \(3m\) parameters (\(m\) being the feature dimension), serving as an extremely lightweight alternative.

Rating¶

Novelty: ⭐⭐⭐⭐ The "semantic alignment" perspective is novel; Yeo-Johnson temporal modulation is a meaningful extension of FiLM.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison on TabReD with ablations and pilot study visualizations, though limited to a single benchmark.
Writing Quality: ⭐⭐⭐⭐ Motivation analysis is clear; the "objective vs. subjective semantics" examples are intuitive and accessible.
Value: ⭐⭐⭐⭐ First method to enable deep models to surpass GBDT on temporal tabular tasks; lightweight and practically applicable.