Towards Scaling Laws for Symbolic Regression¶
Conference: NeurIPS 2025 arXiv: 2510.26064 Code: None Area: Interpretability / Symbolic Regression Keywords: Symbolic Regression, Scaling Laws, Transformer, Power Law, Compute-Optimal
TL;DR¶
This work presents the first systematic study of scaling laws for symbolic regression (SR), demonstrating that end-to-end Transformer-based SR follows power-law scaling trends across three orders of magnitude of compute, and derives empirical rules for the optimal token-to-parameter ratio (\(\approx 15\)), as well as batch size and learning rate scaling with model size.
Background & Motivation¶
- Symbolic regression aims to discover underlying mathematical expressions from observational data, offering both interpretability and generalization ability.
- Pretrained Transformer-based SR methods have recently approached the performance of genetic programming, yet scaling effects remain almost entirely unstudied — no existing work exceeds \(\sim 100\)M parameters.
- Motivated by LLM scaling laws (Kaplan et al., Hoffmann et al.), the authors raise the central question: Do analogous scaling laws exist for SR? If so, can they guide the design of next-generation SR models?
- Prior work primarily adjusts training details at a fixed scale, lacking systematic analysis of the scale–performance relationship.
Method¶
Overall Architecture¶
An end-to-end encoder-decoder Transformer architecture is adopted. The input is tabular data (numerical pairs) and the output is a mathematical expression in LaTeX format. The overall pipeline is:
- Data generation: Recursively generate a base expression set → insert random constants + sample datasets.
- Encoding: Each numerical value in the table is split into mantissa and exponent, then projected into an embedding space.
- Model inference: A table-aware encoder (row/column bidirectional attention) feeds into a standard autoregressive decoder to generate expressions.
- Evaluation: Sample 128 expressions and select the one with the highest \(R^2\) as the prediction.
Key Designs¶
- Two-step data generation:
- Step 1: Starting from variables \(\{x_1, x_2\}\), recursively apply unary operators (exp, sin, neg, sqrt) and binary operators (+, −, ·, ÷) to generate all expression trees with depth ≤ 3; apply SymPy for canonicalization and deduplication, yielding \(|E| = 100{,}000\) base expressions.
- Step 2: For each base expression, sample \(k = 3{,}600\) (expression, dataset) pairs — randomly insert integer constants (range −9 to 9, probability \(p = 0.2\)) and sample 64 data points from a Gaussian mixture distribution.
-
Advantage: Avoids over-sampling bias present in traditional approaches, yielding cleaner training data.
-
Table-aware encoder architecture:
- Traditional methods merge each input point into a single embedding; this work generates independent embeddings for each cell in the table.
- Mantissa and exponent are each up-projected to the embedding dimension and summed.
- Inspired by tabular foundation models (TabPFN, etc.), each layer performs both row attention (across variables) and column attention (across data points).
-
The decoder cross-attends only to the updated embeddings of target cells.
-
End-to-end training pipeline:
- Expressions including constants are predicted directly, without BFGS post-processing.
- Target expressions are represented as LaTeX strings, with constants tokenized digit by digit.
- All model scales share the same data generation and evaluation protocol, ensuring fair scaling analysis.
Loss & Training¶
- Loss function: Standard cross-entropy loss between predicted tokens and ground-truth expression tokens.
- Optimizer: AdamCPR (\(\beta_1 = 0.9\), \(\beta_2 = 0.98\)) with linear warmup (first 5% of steps) followed by cosine annealing.
- FLOPs estimation: \(\text{FLOPs} \approx 6 \cdot (N_{enc} \cdot D_{in} + N_{dec} \cdot D_{out})\), where \(N = N_{enc} + N_{dec}\) denotes the number of feed-forward parameters.
- Hyperparameter search strategy: For each model scale, batch size and learning rate are grid-searched at a token-to-parameter ratio of 20; the optimal configuration is then used to sweep ratios from 5 to 80.
Key Experimental Results¶
Main Results¶
Detailed architectures and best performance across five model scales (6.5M–93M):
| Model | Dimension | Encoder Layers | Decoder Layers | Attention Heads | Parameters |
|---|---|---|---|---|---|
| XS | 256 | 3 | 3 | 4 | 6.48M |
| S | 320 | 4 | 4 | 5 | 13.40M |
| M | 384 | 5 | 5 | 6 | 24.01M |
| L | 448 | 7 | 7 | 7 | 45.53M |
| XL | 512 | 11 | 11 | 8 | 93.08M |
Best performance of each model at maximum compute budget:
| Model | Max FLOPs | \(\text{Acc}_{\text{solved}}\) | \(\text{Acc}_{R^2>0.99}\) | Val. Loss |
|---|---|---|---|---|
| 6.5M | 7.20e+16 | 0.149 | 0.526 | 0.424 |
| 13.5M | 2.88e+17 | 0.271 | 0.667 | 0.312 |
| 24M | 9.81e+17 | 0.378 | 0.762 | 0.240 |
| 45.5M | 3.53e+18 | 0.519 | 0.835 | 0.168 |
| 93M | 1.47e+19 | 0.597 | 0.883 | 0.105 |
Ablation Study¶
- Token-to-parameter ratio sweep: Ratios from 5 to 80 are evaluated; the optimal value is approximately \(\approx 15\), with a slight upward trend as the compute budget increases, suggesting that data volume should grow slightly faster than model parameters.
- Batch size scaling: The optimal batch size increases with model scale — 32 for 6.5M, 128 for 13.5M, and 256 for 93M.
- Learning rate scaling: The optimal learning rate increases with compute budget (4.6e-4 for 6.5M and 24M; 1.0e-3 for 93M), which is opposite to the trend observed in LLMs where learning rate decreases with scale.
Key Findings¶
- Power-law scaling: \(\text{Acc}_{\text{solved}}\) grows from \(\sim 0.03\) at the lowest compute budget to \(\sim 0.60\) at the highest, following a clear power-law trend; extrapolation predicts 0.8 accuracy at \(3.8 \times 10^{21}\) FLOPs.
- \(\text{Acc}_{R^2>0.99}\) improves faster: Approximate matching is considerably easier than exact matching; the 93M model already achieves 0.883.
- No sign of saturation: The largest model continues to improve at the largest compute budget, suggesting that further scaling can yield additional gains.
Highlights & Insights¶
- First scaling laws for SR: This work demonstrates that symbolic regression obeys power-law scaling analogous to LLMs, providing a new design principle for the field — systematic performance improvement through scale rather than elaborate tricks.
- Learning rate trend opposite to LLMs: The optimal learning rate increases with scale in SR, revealing fundamental differences in training dynamics across task types.
- Direct practical value of token-to-parameter ratio: The optimal ratio of \(\approx 15\) provides practitioners with an actionable heuristic for resource allocation.
- Data generation methodology: Recursive generation combined with canonicalization and deduplication outperforms traditional random sampling, ensuring uniform coverage of expression space.
- Table-aware encoder: The row/column bidirectional attention design represents an effective cross-domain transfer from tabular foundation models.
Limitations & Future Work¶
- Limited expression complexity: Only expressions with ≤ 2 variables and small integer constants are considered; real-world SR tasks typically involve more variables and floating-point constants.
- Single-seed training: Due to compute constraints, each configuration is trained with a single seed, introducing result variance.
- Limited compute range: Three orders of magnitude provide limited reliability for extrapolation predictions.
- No comparison with existing SR methods: The focus is on scaling insights without verifying whether the approach surpasses GP or other deep SR methods.
- Directions for improvement:
- Extend to more variables, floating-point constants, and more complex operator sets.
- Verify whether end-to-end SR + improved data generation + scaling comprehensively outperforms alternative methods.
- Train at larger scale (> 100M parameters) to validate extrapolation predictions.
Related Work & Insights¶
- E2E (Kamienny et al., NeurIPS 2022): The closest predecessor, the first end-to-end Transformer for SR; this work shifts focus from loss engineering to scaling effects.
- Biggio et al.: First proposed pretraining Transformers for SR, using a set encoder with BFGS constant optimization.
- Chinchilla (Hoffmann et al.): Compute-optimal scaling laws for language models; the methodology is directly borrowed by this work.
- TabPFN (Hollmann et al., Nature 2025): A tabular foundation model whose row/column attention architecture is adopted here.
- Core insight: Symbolic regression should follow the same paradigm as LLMs — systematically leveraging the benefits of scale rather than pursuing elaborate training tricks.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to introduce scaling law analysis into symbolic regression, filling an important gap.
- Technical Depth: ⭐⭐⭐ — The methodology is relatively standard (Transformer + synthetic data); the primary contribution lies in experimental design and analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Five model scales, three orders of magnitude of compute, and systematic hyperparameter sweeps, though single-seed training is a weakness.
- Value: ⭐⭐⭐⭐ — Token-to-parameter ratio and batch size/learning rate scaling trends are directly useful to SR practitioners.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, well-designed figures, and prominent presentation of key findings.