Skip to content

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

Conference: ICML 2026
arXiv: 2605.04363
Code: https://github.com/seunghan96/DistPFN (available)
Area: Tabular Foundation Models / In-Context Learning / Test-Time Adaptation / Label Shift
Keywords: TabPFN, label shift, posterior adjustment, temperature scaling, plug-in correction

TL;DR

For TabPFN-type "tabular foundation models" that feed the training set directly as in-context input to attention, this work proposes posterior correction—finding that such models severely overfit the majority class in the training set. The authors introduce DistPFN: a one-line posterior reweighting \(\tilde{p}(y) \propto \hat{p}(y)^2 / p_{train}(y)\), which lifts TabPFN-v2 accuracy under strong label shift (\(\beta=5\)) from 72.7% to 76.9% on 253 OpenML datasets—without retraining, estimating test priors, or modifying the architecture.

Background & Motivation

Background: Tabular classification has long been dominated by tree models like XGBoost/LightGBM/CatBoost. In 2023, TabPFN introduced the paradigm of "feeding the entire training set as a prompt to a pretrained Transformer, producing all test predictions in a single forward pass," bringing in-context learning to tabular data. TabPFN-v2 (Nature 2025) further achieved SOTA scale and generalization via dual-axis attention, spawning a series of "tabular foundation models" (Tabular FM) such as LoCalPFN, TabICL, TabFlex, MixturePFN, etc.

Limitations of Prior Work: The authors identify a critical, widely overlooked flaw in this family of models—extreme sensitivity to the class prior in the training set. On imbalanced datasets like CostaMadre1, TabPFN-v2 predicts 98.3% of test samples as the majority class, even when train and test distributions are identical. Among the 253 OpenML datasets examined, 84.6% are imbalanced, meaning this flaw affects most real-world tabular tasks. Moreover, even slight label shift between train and test distributions causes a dramatic performance drop.

Key Challenge: Classical label shift correction methods (EME, BBE, Logit Adjustment, Balanced Softmax) either require retraining or estimating the test set label prior—compromising TabPFN's zero-shot advantage or being infeasible in real deployments. Moreover, enabling these methods in standard (no shift) settings actually degrades performance (EME/BBE on LoCalPFN drop 1.5/1.1 points w/o shift). Thus, the core issue is: existing methods require extra data, retraining, or harm standard performance.

Goal: (1) Provide a completely training-free plug-in posterior correction, (2) no need to estimate test priors, (3) preserve base model performance w/o shift, (4) deliver increasing gains as shift strength grows.

Key Insight: The fundamental difference between TabPFN-type models and traditional models is that the training set distribution is explicitly encoded in attention, not implicitly in model weights. Thus, \(p_{train}(y)\) is directly accessible and computable in TabPFN, whereas classical label shift methods must estimate priors from weights. Recognizing this, the solution becomes straightforward.

Core Idea: Divide the model output posterior \(\hat{p}(y)\) by the training prior \(p_{train}(y)\), then normalize: \(\tilde{p}(y) \propto \hat{p}(y)^2 / p_{train}(y)\)—"dampening the pull of the training distribution, amplifying evidence from the test sample itself."

Method

Overall Architecture

The method comprises three compact components: (1) a one-line posterior adjustment formula, DistPFN; (2) a temperature scaling variant, DistPFN-T, which adaptively controls adjustment strength via cross-entropy; (3) an inverse-frequency resampling benchmark construction to systematically quantify "shift strength \(\beta\) vs accuracy" curves on OpenML. The pipeline: obtain logits from a single TabPFN forward pass → softmax → multiply/divide by adjustment factor \(\alpha\) → renormalize → output. All adjustment occurs at inference, fully plug-in, with no changes to TabPFN parameters or architecture.

Key Designs

  1. DistPFN: Posterior/Prior Ratio as Adjustment Factor:

    • Function: Corrects the model's biased posterior \(\hat{p}_{TabPFN}(y)\), which is pulled by the training distribution, toward a less majority-skewed distribution.
    • Mechanism: \(\tilde{p}_{DistPFN}(y) = \mathrm{Norm}\!\left(\hat{p}_{TabPFN}(y) \cdot \frac{\hat{p}_{TabPFN}(y)}{p_{train}(y)}\right) = \mathrm{Norm}\!\left(\frac{\hat{p}_{TabPFN}(y)^2}{p_{train}(y)}\right)\), where \(p_{train}(y)\) is the class frequency in the training set, and \(\mathrm{Norm}(\cdot)\) denotes class-wise normalization. The intuition is classic—this is a variant of the "removing training prior" idea from Saerens et al. 2002 (\(p(y|x) \propto p(y|x)/p(y)\)), but uses \(\hat{p}^2\) in the numerator to retain model prediction information and avoid overcorrection ("partial correction" per the paper).
    • Design Motivation: Classical prior correction assumes \(\hat{p}(y) \approx p_{train}(y)\) and fully divides out the training prior; in practice, \(\hat{p}(y)\) does not collapse entirely to \(p_{train}(y)\), so full correction overcompensates. The \(\hat{p}^2/p_{train}\) form is empirically validated by oracle experiments as "near-optimal" compromise.
  2. DistPFN-T: Adaptive Temperature Scaling via KL/CE:

    • Function: Dynamically adjusts correction strength based on the "deviation between model prediction and training prior"—greater deviation indicates greater train-test distribution mismatch, warranting more aggressive adjustment, but first smooths overconfident predictions to prevent overcorrection.
    • Mechanism: Set temperature \(\tau = \mathrm{CE}(\hat{p}_{TabPFN}(y), p_{train}(y))\) (cross-entropy between training prior and current prediction), then apply temperature scaling: \(\hat{p}_{TabPFN\text{-}T}(y=c) = \mathrm{softmax}(\hat{p}_{TabPFN}(y=c)/\tau)\), finally compute \(\tilde{p}_{DistPFN\text{-}T}(y) = \mathrm{Norm}\!\left(\hat{p}_{TabPFN}(y) \cdot \frac{\hat{p}_{TabPFN\text{-}T}(y)}{p_{train}(y)}\right)\).
    • Design Motivation: Fixed \(\hat{p}^2/p_{train}\) in DistPFN suffices for weak shift but can overcorrect under strong shift. Using \(\tau\) as a self-monitoring signal: (a) greater deviation from training prior → larger \(\tau\) → smoother predictions after scaling → milder but sustained adjustment; (b) further amplifies minority in majority cases, and softens in minority cases, providing "counterbalance the bias" in both directions.
  3. Inverse-Frequency Resampling Benchmark:

    • Function: Controls label shift strength via a scalar \(\beta \geq 0\) by modifying only the training set, enabling systematic "shift strength vs accuracy" curves on 253 OpenML datasets.
    • Mechanism: Assign each class \(c_k\) a sampling weight \(w_k = (1/p(y=c_k))^\beta\), normalize \(\tilde{w}_k = w_k / \sum_j w_j\), and oversample (not undersample) the training set according to \(\tilde{w}_k\) to avoid information loss. \(\beta = 0\) yields uniform resampling; increasing \(\beta\) increasingly favors rare classes, making the training distribution diverge from the test distribution.
    • Design Motivation: Existing label shift benchmarks (e.g., TableShift) offer only a few real shift points, insufficient for continuous curves; this inverse-frequency approach enables large-scale, controlled evaluation at \(\beta \in \{0, 0.1, 0.5, 1, 2, 5\}\) × 253 datasets × 5 seeds, providing cleaner signals than real OOD benchmarks.

Loss & Training

No training or fine-tuning is required; the entire method is inference-time probability reweighting. The only "hyperparameter" is whether to use DistPFN-T (i.e., enable temperature scaling).

Key Experimental Results

253 OpenML datasets (50/50 train/test split, averaged over 5 seeds), 6 \(\beta\) levels, reporting both w/o shift and w/ shift averages. Compared against 16 baselines (including LogReg/SVM/MLP/kNN/RF/LightGBM/CatBoost/FT-Transformer/TabM/RealMLP/LoCalPFN/TabICL/TabPFN-v2, etc.).

Main Results

Method \(\beta=0\) \(\beta=0.1\) \(\beta=0.5\) \(\beta=1\) \(\beta=2\) \(\beta=5\) Mean (w/ shift)
CatBoost 0.803 0.774 0.771 0.751 0.718 0.665 0.717
RealMLP 0.794 0.760 0.758 0.745 0.720 0.677 0.717
TabPFN-v2 (base) 0.818 0.797 0.796 0.790 0.782 0.759 0.775
+ DistPFN 0.818 0.799 0.797 0.795 0.791 0.783 0.789
+ DistPFN-T 0.818 0.799 0.798 0.797 0.796 0.789 0.792
+ DistPFN-Oracle (upper bound) 0.818 0.803 0.802 0.800 0.797 0.792 0.796
TabICL (base) 0.806 0.783 0.781 0.770 0.747 0.704 0.742
TabICL + DistPFN-T 0.806 0.786 0.786 0.783 0.780 0.771 0.777
LoCalPFN (base) 0.816 0.794 0.793 0.788 0.778 0.753 0.771
LoCalPFN + DistPFN-T 0.816 0.798 0.797 0.796 0.794 0.787 0.791

Key observations: At \(\beta=5\), TabPFN-v2 + DistPFN-T lifts base from 75.9% to 78.9%; TabICL from 70.4% to 77.1% (+6.7pp); LoCalPFN from 75.3% to 78.7%. The consistent gains across three different FMs indicate model-agnostic effectiveness.

Ablation Study

Configuration w/o shift w/ shift (mean) Notes
TabPFN-v2 (base) 0.818 0.775 Baseline
+ EME (Saerens 2002, EM for test prior) 0.801 0.786 Drops 1.7pp w/o shift
+ BBE (Lipton 2018, black-box test prior) 0.805 0.789 Drops 1.3pp w/o shift
+ DistPFN 0.818 0.789 No loss w/o shift
+ DistPFN-T 0.818 0.792 No loss w/o shift + largest gain w/ shift
+ DistPFN-Oracle (true \(p_{test}(y)\)) 0.818 0.796 Upper bound, only 0.4pp gap
TableShift Diabetes OOD base 0.589 → DistPFN-T 0.600 Real OOD also gains
TableShift Acsincome OOD base 0.795 → DistPFN-T 0.799

Key Findings

  • Greater shift, greater gain: As train-test KL divergence increases, DistPFN-T yields monotonically increasing accuracy improvements per dataset, confirming its direct effectiveness against label shift rather than incidental regularization.
  • Approaches oracle: Using true test prior (DistPFN-Oracle) achieves 78.4% at \(\beta=5\); using predicted posterior (DistPFN-T) achieves 78.9%—the latter is slightly higher due to temperature scaling being smoother and less prone to overcorrection than "hard division by true \(p_{test}\)".
  • No loss in no-shift is the biggest selling point: EME/BBE both drop 1–2pp w/o shift, making practitioners hesitant to enable them in deployment; DistPFN/DistPFN-T strictly preserve base performance w/o shift (since for \(\beta=0\), \(\hat{p}/p_{train} \approx 1\), so the adjustment factor is unity), making them safe to enable by default.
  • Single vs multiple instance nearly identical: TabPFN supports both single-sample and batch modes; adjustment factors computed per-sample or as test-set averages yield nearly identical results, indicating robustness to implementation details.
  • 84.6% of OpenML datasets are inherently imbalanced (minority/majority ratio < 1.0), so even in \(\beta=0\) "no shift" settings, the majority-class bias is a systemic issue for TabPFN-type models, not a corner case.

Highlights & Insights

  • The observation that "training prior is explicitly visible" is the paper's fulcrum—once it is clear that TabPFN encodes the entire training set in-context, \(p_{train}(y)\) no longer needs to be estimated, bypassing the entire engineering effort of "test prior estimation" in classical label shift literature. Table 2 cleanly separates explicit/implicit models, making this "paradigm difference" perspective highly instructive.
  • The "partial correction" \(\hat{p}^2/p_{train}\) is an engineeringly tasteful choice: Full prior correction (\(\hat{p}/p_{train}\)) overcorrects in practice, as models do not truly encode the training distribution in \(\hat{p}\); squaring the numerator retains model confidence, making the correction "gentle but not extreme".
  • DistPFN-T's use of \(\tau = \mathrm{CE}(\hat{p}, p_{train})\) as temperature is an elegant self-monitoring design: it relies solely on model outputs and known training priors, being fully self-contained.
  • The inverse-frequency oversample benchmark is a reusable methodological contribution—future label shift evaluations can directly use this \(\beta\) controller to draw continuous curves on fixed test sets, offering more controllability than real OOD data.

Limitations & Future Work

  • The method is theoretically designed for explicit-prior models (TabPFN family + kNN); although Table 2 claims it is "technically applicable to tree models," \(p_{train}\) is less "explicit" there, and actual gains are not shown in the main results, suggesting less clean applicability.
  • The paper acknowledges this is partial, not full, correction; even with temperature scaling, there remains a 0.4pp gap to oracle. Fully closing this gap may require some lightweight online test prior estimation, which would compromise the "zero training, zero estimation" appeal.
  • Additional limitations: (1) Only classification is tested, not regression label distribution shift; (2) Adjustment occurs at the softmax output—if the model's logits are poorly calibrated (deep model overconfidence), the adjustment factor may be exaggerated; (3) No analysis for extreme class counts (e.g., 100+ classes, long-tail); experiments focus on moderate class counts (≤10) in tabular tasks; (4) Temperature \(\tau\) computed via CE may overflow under extreme distributions, requiring numerical clamping in deployment.
  • Potential improvements: Extend this idea to calibration (using priors to correct confidence), apply to RAG-LLM output distribution correction, or combine with conformal prediction for uncertainty-aware, shift-robust predictions.
  • vs EME (Saerens 2002) / BBE (Lipton 2018): Both require iterative test prior estimation and inevitably degrade performance w/o shift; this work does not estimate test priors, strictly preserves performance w/o shift, and delivers larger gains w/ shift—a clear advance in the same line.
  • vs Logit Adjustment / Balanced Softmax: These require loss modification during training, unsuitable for frozen FMs like TabPFN-v2; this work is an inference-time plug-in, usable with any pretrained FM.
  • vs Drift-Resilient TabPFN (Helli 2024): That method targets temporal shift during pretraining and requires retraining; this work targets label shift via test-time adaptation, making the two orthogonal and combinable.
  • vs General TTA (test-time training): Most TTA methods require backpropagation on test samples, incurring high computational cost and risk; this work is pure forward pass + probability reweighting, with negligible extra computation.

Rating

  • Novelty: ⭐⭐⭐⭐ The observation that "TabPFN explicitly encodes training priors" + posterior/prior ratio + temperature scaling is a simple yet sharp combination, previously unexplored.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 253 OpenML datasets × 6 \(\beta\) × 5 seeds × 3 FMs × real TableShift OOD × EME/BBE comparison × oracle upper bound—extremely dense evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Table 2's explicit/implicit split and Table 3's method comparison are clear; the derivation is logically smooth.
  • Value: ⭐⭐⭐⭐ One-line plug-in, directly deployable on any TabPFN-v2/LoCalPFN/TabICL, high industrial value, and opens up the research direction of "leveraging explicit priors in FMs".