Large Language Bayes¶
Conference: NeurIPS 2025 arXiv: 2504.14025 Code: To be confirmed Area: Optimization Keywords: LLM, Bayesian inference, probabilistic programming, model averaging, Stan, joint distribution
TL;DR¶
This work mathematically "glues" an LLM and a probabilistic programming language (PPL/Stan) into a joint distribution \(p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}\). Given only an informal problem description and data, the system automatically samples candidate formal models from the LLM, performs Bayesian inference within each model, and produces a marginal-likelihood-weighted model average — requiring no user-written probabilistic model.
Background & Motivation¶
Background: Bayesian inference requires users to specify formal models (e.g., prior distributions, likelihood functions), which demands substantial statistical expertise. LLMs can interpret natural-language descriptions but cannot perform rigorous probabilistic inference. Probabilistic programming languages (Stan, Pyro) support rigorous inference but require formal model specifications as input.
Limitations of Prior Work: The traditional Bayesian workflow requires a statistician to (1) understand the problem, (2) manually construct a model, (3) write PPL code, (4) run inference, and (5) diagnose and iterate. Steps 2–3 constitute the primary bottleneck. Existing LLM-for-statistics approaches merely prompt LLMs to generate code, without incorporating the LLM's uncertainty over model space into a statistical framework.
Key Challenge: While LLMs can infer modeling intent from natural language, their model outputs are merely code strings rather than components of a probability distribution. The key challenge is how to make the LLM's model selection an organic part of the Bayesian framework.
Goal: To develop a mathematically rigorous framework that unifies the model-generation capability of LLMs with the inference capability of PPLs into a single Bayesian inference problem.
Key Insight: Treating the LLM as a prior \(p(m|t)\) over model space, with the PPL handling the likelihood and posterior given a specific model. The joint distribution naturally defines model averaging.
Core Idea: LLM as model prior + PPL as within-model inference = a complete Bayesian inference system, requiring only natural-language input from the user.
Method¶
Overall Architecture¶
The user provides a natural-language problem description \(t\) and data \(x\) → the LLM generates \(N\) candidate formal models \(m_1, \ldots, m_N\) (Stan code) → approximate inference (MCMC/VI) is run within the PPL for each \(m_i\) → all model posteriors are averaged using marginal likelihood weights \(p(x|m_i)\) → a predictive distribution is returned.
Key Designs¶
-
Mathematical Gluing of the Joint Distribution:
- Function: Defines \(p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}\)
- Mechanism: \(m\) is the formal model generated by the LLM (Stan code), \(z\) denotes the within-model latent variables, and \(x\) is the observed data. The LLM supplies the model-space prior \(p(m|t)\), while the PPL supplies the conditional distribution \(p(z,x|m)\) within a given model.
- Design Motivation: This construction unifies model-selection uncertainty and parameter uncertainty within a single Bayesian framework.
- Posterior inference target: \(p(z,m|x,t) \propto p(m|t) \cdot p(x|m) \cdot p(z|x,m)\)
-
Inference Recipe:
- Function: Combines self-normalized importance sampling, MCMC, and importance-weighted variational inference.
- Steps: (a) Sample \(N\) models from the LLM; (b) run Stan's MCMC (or VI) for each model to obtain \(p(z|x,m_i)\); (c) estimate the marginal likelihood \(p(x|m_i)\) via bridge sampling or warp bridge sampling; (d) compute the weighted average.
- Design Motivation: The model space is discrete and vast, making exhaustive enumeration infeasible. The LLM serves as an importance sampler over model space, with marginal likelihoods as importance weights.
-
LLM Prompt Engineering:
- Function: A system prompt is designed to encourage the LLM to reason about modeling strategies before generating Stan code.
- Includes 6 in-context learning examples demonstrating the reasoning process from problem description to model selection.
- Design Motivation: Encouraging reasoning prior to code generation improves both the quality and diversity of the generated models.
Loss & Training¶
No training is involved — the framework leverages existing pretrained LLMs (GPT-4) and PPLs (Stan). The contribution is an inference algorithm rather than a learned system.
Key Experimental Results¶
Main Results¶
| Evaluation Dimension | Result | Notes |
|---|---|---|
| Predictive validity | Meaningful predictions produced from informal descriptions | End-to-end validation |
| vs. naive averaging | Weighted averaging outperforms uniform averaging | Marginal likelihood weights are effective |
| Model diversity | LLM-generated models cover diverse modeling strategies | The strategy-discussion step in the system prompt is beneficial |
| Posterior coverage | 95% credible intervals achieve target coverage | Statistical consistency validated |
Ablation Study¶
| Configuration | Key Finding | Notes |
|---|---|---|
| With/without "thinking" step | Models are more diverse when strategy discussion is included | Prompting affects model-space exploration |
| \(N\) (number of candidate models) | More models → better coverage, with diminishing returns | ~10–20 models are generally sufficient |
| Marginal likelihood estimation method | Warp bridge sampling is most stable | Bridge sampling may be unreliable for multimodal posteriors |
| Availability of $p(m | t)$ | Uniform approximation remains effective when commercial APIs do not expose log-probabilities |
Key Findings¶
- The framework produces meaningful posterior predictive distributions from natural-language descriptions across diverse tasks (regression, classification, time series).
- The LLM's implicit "prior" over models captures statistically reasonable modeling preferences.
- Bayesian model averaging automatically down-weights inappropriate models via marginal likelihoods.
Highlights & Insights¶
- Mathematical Elegance: Treating the LLM as a prior over model space is a natural conceptual move, yet making it mathematically rigorous requires careful treatment of numerous subtleties (heterogeneous latent variable spaces, inaccessible \(p(m|t)\), etc.). This paper resolves all of these issues coherently.
- Natural Language as a Statistical Modeling Interface: From a broader perspective, this work opens the possibility for non-experts to conduct Bayesian analysis directly from natural-language descriptions — a significant step toward the democratization of statistics.
- Proper Treatment of Model Uncertainty: Unlike conventional AutoML, which selects a single best model, this paper performs Bayesian model averaging — a particularly valuable property when sample sizes are small and model selection is difficult.
Limitations & Future Work¶
- \(p(m|t)\) is generally inaccessible via commercial LLM APIs, necessitating a uniform approximation.
- Chain-of-thought reasoning in the LLM renders \(p(m|t)\) even less tractable.
- Estimation of the marginal likelihood \(p(x|m)\) may be unreliable in high-dimensional models.
- Since latent variable spaces \(z\) differ across models, model averaging must be performed in predictive space rather than parameter space.
- LLM-generated Stan code may contain syntax errors or numerical instabilities.
- Scalability: MCMC becomes prohibitively slow for large datasets and complex models.
Related Work & Insights¶
- vs. AutoML/CASH: AutoML searches for the optimal model from a candidate set; this paper performs Bayesian model averaging, explicitly retaining uncertainty over models.
- vs. LLM-for-code (Codex/GPT-4): Using LLMs solely for code generation does not address uncertainty in model selection; this paper integrates the generation process into a Bayesian framework.
- vs. BMA (Bayesian Model Averaging): Classical BMA requires a manually specified model set; this paper uses the LLM to generate the model set automatically.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The joint distribution formulation combining LLMs and PPLs is a mathematically beautiful and pioneering framework.
- Experimental Thoroughness: ⭐⭐⭐ Proof-of-concept experiments are provided, but large-scale benchmark comparisons are lacking.
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are elegant and the problem motivation is clearly articulated.
- Value: ⭐⭐⭐⭐⭐ This work may inaugurate a new paradigm of "natural language → Bayesian inference."