Transfer Learning Beyond the Standard Model¶

Conference: NeurIPS 2025 arXiv: 2510.19168 Code: None (uses the Quijote simulation dataset, publicly available) Area: Physics Keywords: Transfer Learning, Cosmological Inference, ΛCDM, Negative Transfer, Foundation Models

TL;DR¶

This work investigates whether neural networks pre-trained on the standard cosmological model (ΛCDM) can transfer to beyond-standard-model scenarios (massive neutrinos, modified gravity, primordial non-Gaussianity). The study finds that a dummy node architecture can reduce simulation requirements by an order of magnitude, but negative transfer emerges when parameters exhibit strong physical degeneracies (e.g., \(\sigma_8\)–\(M_\nu\)).

Background & Motivation¶

Background: Simulation-based inference (SBI) has been successfully applied to ΛCDM cosmological parameter inference. A central goal of Stage-IV surveys (e.g., DESI) is to detect new physics beyond the standard model—massive neutrinos, modified gravity, and primordial non-Gaussianity.

Limitations of Prior Work: Simulations for beyond-ΛCDM scenarios are far more computationally expensive than ΛCDM simulations and must cover a substantially larger parameter space, forming the primary bottleneck for inference.

Key Challenge: Training inference models requires large numbers of costly beyond-ΛCDM simulations, yet computational budgets are limited.

Goal: To assess whether transfer learning via ΛCDM pre-training followed by beyond-ΛCDM fine-tuning can reduce the number of required beyond-ΛCDM simulations.

Key Insight: Drawing an analogy to the foundation model paradigm—ΛCDM as the "foundation model" and beyond-ΛCDM tasks as "downstream tasks."

Core Idea: A dummy node is appended to the output layer of the pre-trained network, providing unsupervised latent capacity that can be repurposed during fine-tuning to learn new-physics parameters, while simultaneously exposing the phenomenon of negative transfer induced by physical parameter degeneracies.

Method¶

Overall Architecture¶

Pre-training on Quijote ΛCDM simulations (32,768 samples) → frozen/fine-tuned weight transfer → fine-tuning with a small number of beyond-ΛCDM simulations (50–2,000) → evaluation of parameter inference MSE.

Key Designs¶

Dummy Node Architecture:
- Function: Additional "dummy" nodes are appended to the output layer during pre-training.
- Mechanism: During pre-training, the network outputs the 5 ΛCDM parameters plus \(N\) dummy nodes, with MSE computed only over the ΛCDM parameters; during fine-tuning, the dummy nodes are repurposed to output new-physics parameters (e.g., \(M_\nu\), \(f_{R0}\)).
- Design Motivation: Dummy nodes develop extra representational capacity during pre-training that can be reused to learn new-physics signals during fine-tuning, analogous to the modular head design in foundation models.
Comparison of Three Transfer Architectures:
- Dummy node: best-performing; provides additional representational capacity.
- No-dummy (weight initialization only): second-best; new parameters are initialized from scratch.
- Attach head (frozen pre-training + appended inference head): worst; pre-trained representations are excessively rigid.
Three Beyond-ΛCDM Scenarios:
- Massive neutrinos \(M_\nu \in [0.01, 1.0]\) eV: strongly degenerate with \(\sigma_8\).
- Modified gravity f(R): \(f_{R0} \in [-3\times10^{-4}, 0]\).
- Primordial non-Gaussianity: equilateral \(f_{NL} \in [-600, 600]\), local \(f_{NL} \in [-300, 300]\).

Loss & Training¶

MSE loss, AdamW optimizer (\(\beta_1=0.5\), \(\beta_2=0.999\)).
Pre-training learning rate: \([10^{-5}, 10^{-1}]\); fine-tuning learning rate: \([10^{-6}, 10^{-3}]\) (more conservative).
Optuna hyperparameter search (100 trials).
Input: 79-bin matter power spectrum \(P(k)\), \(k \in [0.0089,\, 0.5]\ h/\mathrm{Mpc}\).

Key Experimental Results¶

Main Results — Simulation Efficiency¶

Beyond-ΛCDM Scenario	Transfer Learning Effect	Simulation Savings
Massive neutrinos \(P(k)\)	Total MSE substantially improved	~10×
Massive neutrinos \(MP(k)\)	Negative transfer on \(\sigma_8\) and \(M_\nu\)	Uncertain
Modified gravity f(R)	Substantially improved	~10×
Equilateral \(f_{NL}\)	Consistent improvement	Significant
Local \(f_{NL}\)	No improvement (prior mismatch)	0

Ablation Study — Architecture Comparison¶

Architecture	Total MSE Performance	Degree of Negative Transfer
Dummy node	Best	Mild (only under \(\sigma_8\)–\(M_\nu\) degeneracy)
No-dummy	Second-best	Moderate
Attach head	Worst	Severe (negative transfer also in total MSE)

Key Findings¶

Dummy node consistently optimal: outperforms the no-transfer baseline in total MSE across all scenarios.
Negative transfer driven by physical degeneracy: the signals of \(\sigma_8\) and \(M_\nu\) in the marked power spectrum overlap substantially, requiring the pre-trained \(\sigma_8\) mapping to be "unlearned" before \(M_\nu\) can be learned.
SHAP analysis reveals the mechanism: small-scale power spectrum information is used to infer \(\sigma_8\) during pre-training; after fine-tuning, the same information is reassigned to \(M_\nu\), with SHAP value signs for \(\sigma_8\) reversing.
Benefits emerge with as few as 2,000 pre-training simulations: the full 32K simulation set is not required; a small pre-training corpus already confers transfer advantages.

Highlights & Insights¶

The double-edged nature of the foundation model paradigm in physics: pre-training can accelerate inference but may also bias representations—"pre-training on large standard-model datasets can dramatically reduce costs, but may also bias representations in ways that hinder the discovery of new physics."
Negative transfer as a physical signal: the emergence of negative transfer itself reflects the physical degeneracy structure of the parameter space and can serve as a diagnostic tool.
Elegance of the dummy node design: conceptually simple yet highly effective, offering an architectural pattern applicable to a broad range of transfer learning tasks.

Limitations & Future Work¶

Only simple fully connected networks are used; more expressive architectures (e.g., normalizing flows) are not evaluated.
Only the matter power spectrum is considered; realistic observables (galaxy clustering, weak lensing) are not validated.
The failure on local \(f_{NL}\) is attributed to prior mismatch rather than the method itself.
Systematic errors and observational noise are not considered.

vs. Multi-fidelity SBI (Thiele2025, Saoulis2025): those works transfer between different fidelity levels of the same physics; this paper transfers between different physical models—a more challenging setting.
vs. Foundation models (BERT, CLIP): the dummy node is analogous to a modular head design; this paper demonstrates that such a paradigm is also effective for physical inference.
Implication: any application of foundation models to scientific inference should be alert to negative transfer—particularly when new parameters are degenerate with parameters encountered during pre-training.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study of transfer from the standard cosmological model to beyond-standard-model scenarios; the negative transfer finding is of independent value.
Experimental Thoroughness: ⭐⭐⭐⭐ Four beyond-ΛCDM scenarios + three architectures + SHAP analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure with a well-balanced presentation of physical intuition and machine learning methodology.
Value: ⭐⭐⭐⭐⭐ Carries broad cautionary implications for the application of foundation models to physical inference.