Skip to content

Large Language Bayes

Conference: NeurIPS 2025 arXiv: 2504.14025 Code: To be confirmed Area: Optimization Keywords: LLM, Bayesian inference, probabilistic programming, model averaging, Stan, joint distribution

TL;DR

This work mathematically "glues" an LLM and a probabilistic programming language (PPL/Stan) into a joint distribution \(p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}\). Given only an informal problem description and data, the system automatically samples candidate formal models from the LLM, performs Bayesian inference within each model, and produces a marginal-likelihood-weighted model average — requiring no user-written probabilistic model.

Background & Motivation

Background: Bayesian inference requires users to specify formal models (e.g., prior distributions, likelihood functions), which demands substantial statistical expertise. LLMs can interpret natural-language descriptions but cannot perform rigorous probabilistic inference. Probabilistic programming languages (Stan, Pyro) support rigorous inference but require formal model specifications as input.

Limitations of Prior Work: The traditional Bayesian workflow requires a statistician to (1) understand the problem, (2) manually construct a model, (3) write PPL code, (4) run inference, and (5) diagnose and iterate. Steps 2–3 constitute the primary bottleneck. Existing LLM-for-statistics approaches merely prompt LLMs to generate code, without incorporating the LLM's uncertainty over model space into a statistical framework.

Key Challenge: While LLMs can infer modeling intent from natural language, their model outputs are merely code strings rather than components of a probability distribution. The key challenge is how to make the LLM's model selection an organic part of the Bayesian framework.

Goal: To develop a mathematically rigorous framework that unifies the model-generation capability of LLMs with the inference capability of PPLs into a single Bayesian inference problem.

Key Insight: Treating the LLM as a prior \(p(m|t)\) over model space, with the PPL handling the likelihood and posterior given a specific model. The joint distribution naturally defines model averaging.

Core Idea: LLM as model prior + PPL as within-model inference = a complete Bayesian inference system, requiring only natural-language input from the user.

Method

Overall Architecture

The user provides a natural-language problem description \(t\) and data \(x\) → the LLM generates \(N\) candidate formal models \(m_1, \ldots, m_N\) (Stan code) → approximate inference (MCMC/VI) is run within the PPL for each \(m_i\) → all model posteriors are averaged using marginal likelihood weights \(p(x|m_i)\) → a predictive distribution is returned.

Key Designs

  1. Mathematical Gluing of the Joint Distribution:

    • Function: Defines \(p(z,x,m|t) = p(m|t)_{\text{LLM}} \cdot p(z,x|m)_{\text{PPL}}\)
    • Mechanism: \(m\) is the formal model generated by the LLM (Stan code), \(z\) denotes the within-model latent variables, and \(x\) is the observed data. The LLM supplies the model-space prior \(p(m|t)\), while the PPL supplies the conditional distribution \(p(z,x|m)\) within a given model.
    • Design Motivation: This construction unifies model-selection uncertainty and parameter uncertainty within a single Bayesian framework.
    • Posterior inference target: \(p(z,m|x,t) \propto p(m|t) \cdot p(x|m) \cdot p(z|x,m)\)
  2. Inference Recipe:

    • Function: Combines self-normalized importance sampling, MCMC, and importance-weighted variational inference.
    • Steps: (a) Sample \(N\) models from the LLM; (b) run Stan's MCMC (or VI) for each model to obtain \(p(z|x,m_i)\); (c) estimate the marginal likelihood \(p(x|m_i)\) via bridge sampling or warp bridge sampling; (d) compute the weighted average.
    • Design Motivation: The model space is discrete and vast, making exhaustive enumeration infeasible. The LLM serves as an importance sampler over model space, with marginal likelihoods as importance weights.
  3. LLM Prompt Engineering:

    • Function: A system prompt is designed to encourage the LLM to reason about modeling strategies before generating Stan code.
    • Includes 6 in-context learning examples demonstrating the reasoning process from problem description to model selection.
    • Design Motivation: Encouraging reasoning prior to code generation improves both the quality and diversity of the generated models.

Loss & Training

No training is involved — the framework leverages existing pretrained LLMs (GPT-4) and PPLs (Stan). The contribution is an inference algorithm rather than a learned system.

Key Experimental Results

Main Results

Evaluation Dimension Result Notes
Predictive validity Meaningful predictions produced from informal descriptions End-to-end validation
vs. naive averaging Weighted averaging outperforms uniform averaging Marginal likelihood weights are effective
Model diversity LLM-generated models cover diverse modeling strategies The strategy-discussion step in the system prompt is beneficial
Posterior coverage 95% credible intervals achieve target coverage Statistical consistency validated

Ablation Study

Configuration Key Finding Notes
With/without "thinking" step Models are more diverse when strategy discussion is included Prompting affects model-space exploration
\(N\) (number of candidate models) More models → better coverage, with diminishing returns ~10–20 models are generally sufficient
Marginal likelihood estimation method Warp bridge sampling is most stable Bridge sampling may be unreliable for multimodal posteriors
Availability of $p(m t)$ Uniform approximation remains effective when commercial APIs do not expose log-probabilities

Key Findings

  • The framework produces meaningful posterior predictive distributions from natural-language descriptions across diverse tasks (regression, classification, time series).
  • The LLM's implicit "prior" over models captures statistically reasonable modeling preferences.
  • Bayesian model averaging automatically down-weights inappropriate models via marginal likelihoods.

Highlights & Insights

  • Mathematical Elegance: Treating the LLM as a prior over model space is a natural conceptual move, yet making it mathematically rigorous requires careful treatment of numerous subtleties (heterogeneous latent variable spaces, inaccessible \(p(m|t)\), etc.). This paper resolves all of these issues coherently.
  • Natural Language as a Statistical Modeling Interface: From a broader perspective, this work opens the possibility for non-experts to conduct Bayesian analysis directly from natural-language descriptions — a significant step toward the democratization of statistics.
  • Proper Treatment of Model Uncertainty: Unlike conventional AutoML, which selects a single best model, this paper performs Bayesian model averaging — a particularly valuable property when sample sizes are small and model selection is difficult.

Limitations & Future Work

  • \(p(m|t)\) is generally inaccessible via commercial LLM APIs, necessitating a uniform approximation.
  • Chain-of-thought reasoning in the LLM renders \(p(m|t)\) even less tractable.
  • Estimation of the marginal likelihood \(p(x|m)\) may be unreliable in high-dimensional models.
  • Since latent variable spaces \(z\) differ across models, model averaging must be performed in predictive space rather than parameter space.
  • LLM-generated Stan code may contain syntax errors or numerical instabilities.
  • Scalability: MCMC becomes prohibitively slow for large datasets and complex models.
  • vs. AutoML/CASH: AutoML searches for the optimal model from a candidate set; this paper performs Bayesian model averaging, explicitly retaining uncertainty over models.
  • vs. LLM-for-code (Codex/GPT-4): Using LLMs solely for code generation does not address uncertainty in model selection; this paper integrates the generation process into a Bayesian framework.
  • vs. BMA (Bayesian Model Averaging): Classical BMA requires a manually specified model set; this paper uses the LLM to generate the model set automatically.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The joint distribution formulation combining LLMs and PPLs is a mathematically beautiful and pioneering framework.
  • Experimental Thoroughness: ⭐⭐⭐ Proof-of-concept experiments are provided, but large-scale benchmark comparisons are lacking.
  • Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are elegant and the problem motivation is clearly articulated.
  • Value: ⭐⭐⭐⭐⭐ This work may inaugurate a new paradigm of "natural language → Bayesian inference."