Skip to content

ConTextTab: A Semantics-Aware Tabular In-Context Learner

Conference: NeurIPS 2025 arXiv: 2506.10707 Code: SAP-samples/sap-rpt-1-oss Area: Tabular Learning / In-Context Learning Keywords: Tabular Learning, In-Context Learning, Semantic Encoding, Foundation Model, Zero-Shot Prediction

TL;DR

ConTextTab integrates semantic embeddings (text encodings of column names and categorical values) into a table-native ICL architecture, and pretrains on large-scale real-world tabular data (T4, ~2.18M tables). It achieves a new SOTA on the semantics-rich CARTE benchmark while remaining competitive with existing methods on non-semantic benchmarks.

Background & Motivation

  • Background: Table-native ICL methods such as TabPFN and TabICL perform well on small-to-medium scale tabular prediction tasks, but rely entirely on synthetic data for training and cannot leverage semantic information such as column names or categorical labels present in real-world data.
  • Limitations of Prior Work: LLM-based approaches like TabuLa-8B possess deep semantic understanding, but text serialization leads to poor token efficiency (at most 32 rows of context) and loses the 2D structure of tabular data.
  • Key Challenge: Table-native methods are efficient but semantics-free, whereas LLM-based methods are semantics-aware but inefficient.
  • Goal: To combine the advantages of both paradigms — injecting semantic understanding into a table-native ICL framework while training on real-world data.

Method

Overall Architecture

ConTextTab extends the TabPFN architecture along the following pipeline: multimodal embedding layer → alternating attention backbone → task-specific output heads. Each column is encoded by a type-specific encoder (text / date / numeric), column names are injected as "positional encodings" via text embeddings, and the overall design preserves row- and column-wise permutation equivariance.

Key Designs

  1. Multimodal Semantic Feature Encoding:

    • Text / Categorical Columns: Each cell is encoded into a vector using a pretrained text embedding model (default: all-MiniLM-L6-v2), then projected to the target dimension \(d\) via a learnable linear layer. Categorical columns follow the same path, preserving label semantics.
    • Date Columns: The day, month, and year components are each embedded separately and summed, balancing relative magnitude and recognition of special dates (e.g., holidays).
    • Numeric Columns: Values are clipped at the 2%–98% quantile range, standardized to zero mean and unit variance (with value range \((-7.1, 7.1)\) guaranteed by Chebyshev's inequality), then multiplied by a learnable vector and added a bias. NaN values are replaced by 0, with the bias serving as an "is-NaN" flag.
    • Column Names: Encoded using the same text embedding model, projected via an independent linear layer, and added to the corresponding cell embeddings.
    • All embeddings are passed through LayerNorm before entering the backbone, fully preserving row- and column-wise permutation equivariance.
  2. Alternating Attention Backbone and Weight Sharing:

    • The architecture follows TabPFN's alternating "horizontal" (across columns) and "vertical" (across rows) self-attention structure.
    • Cross-column attention is unmasked; cross-row attention uses a causal mask (query rows attend only to context rows).
    • Weight sharing is enabled by default: a single transformer block shares parameters across all layers, interpretable as a depth-unrolled RNN. This reduces the parameter count from 172M to 16M trainable parameters with no observed performance degradation.
  3. Large-Scale Real-World Data Training Strategy:

    • Training uses the T4 dataset, filtered to 2.18M tables (median size: 750 rows × 9 columns).
    • For each training instance, 1,000 rows are sampled; 50–900 rows serve as queries and the remainder as context.
    • A target column is selected at random (excluding date columns, numeric columns with >50% NaN, and columns with >20% unique values).
    • Non-numeric columns are upsampled to approximately balance regression and classification tasks.
    • An optional curriculum learning stage uses TabDPT data (123 tables, median 11k rows × 34 columns) to extend the training row count to 4,000.

Loss & Training

  • Classification: Standard cross-entropy loss with an MLP output head.
  • Regression: L2 loss on clipped, standardized float predictions; inverse-transformed at inference.
  • Alternative — Supervised Clustering Head: Cosine similarities between query–context row pairs are computed and compared against a same-class/different-class adjacency matrix via element-wise binary cross-entropy loss, with no upper bound on the number of classes.
  • Alternative — Soft Binning: Numeric values are quantile-binned and soft-encoded via linear interpolation between adjacent bins, converting regression to classification; predictions are obtained as probability-weighted bin means.
  • Training: AdamW, lr=\(10^{-4}\), linear warmup for 1,000 steps, gradient accumulation to batch size 256, gradient clipping, 4–10M steps (2–5 epochs).
  • Inference: 8-fold bagging (8 bootstrap samples of context), default context size \(c=8192\), up to 500 columns.

Key Experimental Results

Main Results

Evaluation covers 91 regression + 112 classification tasks, with dataset sizes ranging from 400 to ~400k training samples and 5 to 3k columns.

Benchmark ConTextTab Performance Key Comparison
CARTE (semantics-rich) New SOTA, consistently best across all sample sizes Significantly outperforms TabPFN / TabICL / TabDPT (\(p<0.05\))
OpenML-CC18 (classification) Competitive No significant difference from the best model
TALENT-Tiny (mixed) Competitive No significant difference from the best model
TabReD (large-scale) Competitive Tuned tree ensembles have an advantage on large datasets
OpenML-CTR23 (regression) Slightly weaker No significant difference from tuned ensemble trees

Ablation Study

Ablation Finding
Training data scale Critical to model performance; data volume is the key factor
Weight sharing Reduces parameters from 172M to 16M with no performance impact
Text embedding model all-MiniLM-L6-v2 achieves a good speed–accuracy trade-off
ISAB attention Applied to the first \(m=3\) cross-row attention layers to reduce inference cost on large tables
Curriculum learning A second training stage on large-table data yields further improvements

Key Findings

  • Decisive gap from semantic understanding: On CARTE, TabPFN (without semantics) performs worse than untuned gradient boosted trees, while ConTextTab surpasses all single-model methods.
  • Clear low-data advantage: In sub-sampling experiments on CARTE (128 rows to full dataset), ConTextTab outperforms AutoGluon at ≤2,048 rows.
  • Parity on non-semantic benchmarks: On traditional benchmarks such as OpenML-CC18 and TALENT-Tiny, performance is competitive with TabPFN and tuned tree ensembles, with no significant gaps.
  • Large-scale dataset challenge: Tuned tree ensembles retain an advantage on large datasets (e.g., TabReD), in some cases even surpassing AutoGluon, indicating room for improvement in context scaling for ICL methods.

Highlights & Insights

  • Clear methodological contribution: This is the first work to systematically integrate semantic embeddings into a table-native ICL framework and train on real-world data, with a concise and effective design.
  • Surprising finding on weight sharing: A 172M → 16M parameter reduction with no performance loss suggests that the effective parameter space for tabular ICL may be far smaller than the total parameter count.
  • Permutation equivariance: Semantic encoding naturally preserves this property, reducing the need for bagging strategies in TabPFN (e.g., random category-to-ID mappings).
  • Elegant supervised clustering head: Using cosine similarity with an adjacency matrix for classification imposes no upper bound on class count and preserves label semantics, offering an elegant alternative to the standard cross-entropy head.
  • Reasonable training efficiency: On a single H100 GPU at ~10 tables/s, full training takes 4–12 days.

Limitations & Future Work

  • No breakthrough on non-semantic benchmarks: Performance on traditional numeric tabular benchmarks merely matches rather than surpasses existing methods; improved numeric encoding or larger models are needed.
  • Large-scale dataset bottleneck: ICL methods still lag behind tuned ensemble trees on datasets with hundreds of thousands of samples; context scaling remains a critical bottleneck.
  • AutoGluon retains overall advantage: As a multi-model ensemble, AutoGluon generally outperforms single models, indicating headroom for single-model performance.
  • Fixed text embedding model: The lightweight MiniLM may lose information in complex semantic settings; stronger embedding models or end-to-end training warrant exploration.
  • Inference cost: The overhead of 8-fold bagging with 8,192 context rows is non-trivial and must be weighed in deployment scenarios.
  • TabPFN / TabICL: Table-native ICL baselines trained on synthetic data; this paper extends their architecture with semantic capabilities.
  • TabuLa-8B: An LLM-based ICL method that curated the T4 dataset (reused in this work), but is constrained to 32-row contexts.
  • CARTE: A semantics-aware tabular pretraining method requiring task-specific fine-tuning; its benchmark is where this paper's main contributions shine.
  • TabDPT: Trains on real data with retrieval-based context selection, inspiring the curriculum learning strategy in this work.
  • Insight: Semantic information is severely underutilized in tabular learning; simply injecting text embeddings yields decisive advantages in semantics-rich settings. This also suggests that future tabular foundation models should move toward multimodal fusion.

Rating

Dimension Score (1–10) Notes
Novelty 7 First systematic integration of semantic embeddings into table-native ICL with real-world training, though individual components are not entirely new
Technical Depth 8 Multimodal encoding is carefully designed; alternative architectures such as the supervised clustering head and ISAB demonstrate thorough exploration
Experimental Thoroughness 9 Five major benchmarks, 203 datasets, extensive baselines, ablation studies, and sub-sampling experiments — very comprehensive
Writing Quality 8 Clear structure, well-articulated motivation, and detailed method descriptions
Value 7 Open-source code and model; high practical value in semantics-rich settings, though advantages are less clear in non-semantic settings
Overall 7.8 A solid and systematic contribution that sets a new standard for semantic tabular learning