NeurIPS 2025 Optimization LLM Bayesian inference probabilistic programming model averaging Stan joint distribution

Large Language Bayes¶

Conference: NeurIPS 2025 arXiv: 2504.14025 Code: To be confirmed Area: Optimization Keywords: LLM, Bayesian inference, probabilistic programming, model averaging, Stan, joint distribution

TL;DR¶

This work mathematically "glues" an LLM and a probabilistic programming language (PPL/Stan) into a joint distribution $p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}$. Given only an informal problem description and data, the system automatically samples candidate formal models from the LLM, performs Bayesian inference within each model, and produces a marginal-likelihood-weighted model average — requiring no user-written probabilistic model.

Background & Motivation¶

Background: Bayesian inference requires users to specify formal models (e.g., prior distributions, likelihood functions), which demands substantial statistical expertise. LLMs can interpret natural-language descriptions but cannot perform rigorous probabilistic inference. Probabilistic programming languages (Stan, Pyro) support rigorous inference but require formal model specifications as input.

Limitations of Prior Work: The traditional Bayesian workflow requires a statistician to (1) understand the problem, (2) manually construct a model, (3) write PPL code, (4) run inference, and (5) diagnose and iterate. Steps 2–3 constitute the primary bottleneck. Existing LLM-for-statistics approaches merely prompt LLMs to generate code, without incorporating the LLM's uncertainty over model space into a statistical framework.

Key Challenge: While LLMs can infer modeling intent from natural language, their model outputs are merely code strings rather than components of a probability distribution. The key challenge is how to make the LLM's model selection an organic part of the Bayesian framework.

Goal: To develop a mathematically rigorous framework that unifies the model-generation capability of LLMs with the inference capability of PPLs into a single Bayesian inference problem.

Key Insight: Treating the LLM as a prior $p(m|t)$ over model space, with the PPL handling the likelihood and posterior given a specific model. The joint distribution naturally defines model averaging.

Core Idea: LLM as model prior + PPL as within-model inference = a complete Bayesian inference system, requiring only natural-language input from the user.

Method¶

Overall Architecture¶

The user provides a natural-language problem description $t$ and data $x$ → the LLM generates $N$ candidate formal models $m_1, \ldots, m_N$ (Stan code) → approximate inference (MCMC/VI) is run within the PPL for each $m_i$ → all model posteriors are averaged using marginal likelihood weights $p(x|m_i)$ → a predictive distribution is returned.

Key Designs¶

Mathematical Gluing of the Joint Distribution:
- Function: Defines $p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}$
- Mechanism: $m$ is the formal model generated by the LLM (Stan code), $z$ denotes the within-model latent variables, and $x$ is the observed data. The LLM supplies the model-space prior $p(m|t)$, while the PPL supplies the conditional distribution $p(z,x|m)$ within a given model.
- Design Motivation: This construction unifies model-selection uncertainty and parameter uncertainty within a single Bayesian framework.
- Posterior inference target: $p(z,m|x,t) \propto p(m|t) \cdot p(x|m) \cdot p(z|x,m)$
Inference Recipe:
- Function: Combines self-normalized importance sampling, MCMC, and importance-weighted variational inference.
- Steps: (a) Sample $N$ models from the LLM; (b) run Stan's MCMC (or VI) for each model to obtain $p(z|x,m_i)$; (c) estimate the marginal likelihood $p(x|m_i)$ via bridge sampling or warp bridge sampling; (d) compute the weighted average.
- Design Motivation: The model space is discrete and vast, making exhaustive enumeration infeasible. The LLM serves as an importance sampler over model space, with marginal likelihoods as importance weights.
LLM Prompt Engineering:
- Function: A system prompt is designed to encourage the LLM to reason about modeling strategies before generating Stan code.
- Includes 6 in-context learning examples demonstrating the reasoning process from problem description to model selection.
- Design Motivation: Encouraging reasoning prior to code generation improves both the quality and diversity of the generated models.

Loss & Training¶

No training is involved — the framework leverages existing pretrained LLMs (GPT-4) and PPLs (Stan). The contribution is an inference algorithm rather than a learned system.

Key Experimental Results¶

Main Results¶

Evaluation Dimension	Result	Notes
Predictive validity	Meaningful predictions produced from informal descriptions	End-to-end validation
vs. naive averaging	Weighted averaging outperforms uniform averaging	Marginal likelihood weights are effective
Model diversity	LLM-generated models cover diverse modeling strategies	The strategy-discussion step in the system prompt is beneficial
Posterior coverage	95% credible intervals achieve target coverage	Statistical consistency validated

Ablation Study¶

Configuration	Key Finding	Notes
With/without "thinking" step	Models are more diverse when strategy discussion is included	Prompting affects model-space exploration
$N$ (number of candidate models)	More models → better coverage, with diminishing returns	~10–20 models are generally sufficient
Marginal likelihood estimation method	Warp bridge sampling is most stable	Bridge sampling may be unreliable for multimodal posteriors
Availability of $p(m	t)$	Uniform approximation remains effective when commercial APIs do not expose log-probabilities

Key Findings¶

The framework produces meaningful posterior predictive distributions from natural-language descriptions across diverse tasks (regression, classification, time series).
The LLM's implicit "prior" over models captures statistically reasonable modeling preferences.
Bayesian model averaging automatically down-weights inappropriate models via marginal likelihoods.

Highlights & Insights¶

Mathematical Elegance: Treating the LLM as a prior over model space is a natural conceptual move, yet making it mathematically rigorous requires careful treatment of numerous subtleties (heterogeneous latent variable spaces, inaccessible $p(m|t)$, etc.). This paper resolves all of these issues coherently.
Natural Language as a Statistical Modeling Interface: From a broader perspective, this work opens the possibility for non-experts to conduct Bayesian analysis directly from natural-language descriptions — a significant step toward the democratization of statistics.
Proper Treatment of Model Uncertainty: Unlike conventional AutoML, which selects a single best model, this paper performs Bayesian model averaging — a particularly valuable property when sample sizes are small and model selection is difficult.

Limitations & Future Work¶

$p(m|t)$ is generally inaccessible via commercial LLM APIs, necessitating a uniform approximation.
Chain-of-thought reasoning in the LLM renders $p(m|t)$ even less tractable.
Estimation of the marginal likelihood $p(x|m)$ may be unreliable in high-dimensional models.
Since latent variable spaces $z$ differ across models, model averaging must be performed in predictive space rather than parameter space.
LLM-generated Stan code may contain syntax errors or numerical instabilities.
Scalability: MCMC becomes prohibitively slow for large datasets and complex models.

vs. AutoML/CASH: AutoML searches for the optimal model from a candidate set; this paper performs Bayesian model averaging, explicitly retaining uncertainty over models.
vs. LLM-for-code (Codex/GPT-4): Using LLMs solely for code generation does not address uncertainty in model selection; this paper integrates the generation process into a Bayesian framework.
vs. BMA (Bayesian Model Averaging): Classical BMA requires a manually specified model set; this paper uses the LLM to generate the model set automatically.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The joint distribution formulation combining LLMs and PPLs is a mathematically beautiful and pioneering framework.
Experimental Thoroughness: ⭐⭐⭐ Proof-of-concept experiments are provided, but large-scale benchmark comparisons are lacking.
Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are elegant and the problem motivation is clearly articulated.
Value: ⭐⭐⭐⭐⭐ This work may inaugurate a new paradigm of "natural language → Bayesian inference."