Skip to content

Adaptive Querying with AI Persona Priors

Conference: ICML 2026
arXiv: 2605.00696
Code: https://github.com/yw3453/adaptive-query-ai-persona-priors (Available)
Area: Bayesian Experimental Design / Adaptive Querying / LLM Applications
Keywords: AI Persona, Bayesian Adaptive Querying, Digital Twins, Adaptive Testing, Cold Start

TL;DR

The authors package "the distribution of LLM-generated responses conditioned on a persona" as a finite mixture Bayesian prior. This allows efficient prediction of other responses by performing closed-form updates on the persona posterior after asking only a few questions, outperforming classical CAT/IRT baselines in performance.

Background & Motivation

Background: Adaptive querying is a core tool in scenarios such as Computerized Adaptive Testing (CAT), questionnaire surveys, and recommendation cold starts. Mainstream solutions either follow Item Response Theory (IRT/CAT), which parameterizes item-user relationships with low-dimensional latent traits, or use Neural Bayesian Experimental Design (BED) for amortized inference or variational approximation on more flexible models.

Limitations of Prior Work: IRT/CAT traits have low dimensionality and require large-scale historical calibration data for every item; new items entering the pool require recalibration. Neural BED is flexible but requires training surrogate or policy networks and performing nested Monte Carlo integration during deployment, leading to poor real-time performance. Both paths are difficult to apply in cold-start user and cold-start item scenarios.

Key Challenge: The trade-off between expressivity (capturing high-dimensional heterogeneous response patterns) and computability (real-time closed-form posterior updates). Achieving expressivity requires complex models, while maintaining computability often limits the approach to low-dimensional parametric models.

Goal: To construct a prior that simultaneously possesses (1) high expressivity (capturing the diversity of real user response patterns), (2) closed-form posterior updates, and (3) no requirement for large-scale calibration data for items.

Key Insight: LLMs can simulate response distributions for specific groups after being injected with persona profiles. By using a persona dictionary to offline pre-calculate the response distribution for each persona \(\times\) each item, "which persona the user belongs to" can be treated as a discrete latent variable \(\theta \in \{1,\dots,n\}\), thereby reducing the generative model to a finite mixture distribution.

Core Idea: Using LLM-generated persona response distributions as components of a finite mixture prior, transforming Bayesian adaptive querying into closed-form posterior updates on discrete latent variables and one-step-ahead entropy minimization.

Method

Overall Architecture

The method is divided into offline and online stages. Offline stage: A persona dictionary is utilized (this paper uses \(n=2058\) real US respondent profiles from Twin-2K-500); for each persona \(\xi_\theta\) and each item \(x\), GPT-5-mini is prompted to obtain a \(K\)-category response distribution \(\mu_{\theta,x} \in \Delta^{K-1}\), all of which are cached in a lookup table. Online stage: Initialize the persona prior \(p(\theta)\) for a new user (estimated via EM from training users). In each step, an item is selected based on history \(h_t\), response \(Y_{x_{t+1}}\) is observed, the persona posterior is updated in closed form, and the response distribution for the target item \(I^\star\) is predicted using the mixture distribution. After the budget is exhausted, final predictions are made and log loss / Brier / ordinal MSE are calculated.

Key Designs

  1. Persona-induced Latent Variable Model:

    • Function: Replaces "continuous low-dimensional ability traits" in traditional IRT with "discrete persona membership" and uses the LLM to provide \(p(Y_x \mid \theta)\).
    • Mechanism: Under the conditional independence assumption \(p(\theta, Y)=p(\theta)\prod_i p(Y_i \mid \theta)\), since \(\theta\) is discrete and the categorical item likelihood is categorical, the posterior \(p(\theta \mid Y_{I_t}) \propto p(\theta)\prod_{i \in I_t}\mu_{\theta,i,Y_i}\) is entirely closed-form; the predictive distribution \(p(Y_x=k \mid Y_{I_t})=\sum_\theta \mu_{\theta,x,k}\,p(\theta\mid Y_{I_t})\) is also a finite sum.
    • Design Motivation: Completely bypass nested Monte Carlo and variational approximations to achieve both a "flexible prior" and "real-time inference" simultaneously; additionally, each persona retains interpretable semantic labels, facilitating downstream user clustering.
  2. Greedy One-step-ahead Adaptive Querying:

    • Function: Selects the item from the feasible set \(\mathcal{I}_{\text{feas}} \setminus I_t\) that best compresses the target posterior uncertainty at each step.
    • Mechanism: The sum of marginal entropies of the target item set is used as the uncertainty \(U(P_t)=\sum_{x' \in I^\star} H(Y_{x'} \mid h_t)\). For each candidate \(x\), calculate \(\Delta_U(x \mid h_t) = \sum_k p(Y_x=k\mid Y_{I_t})\sum_{x'} H(Y_{x'}\mid h_t, Y_x=k)\) and select the minimum. Because the persona model ensures \(p(Y_x \mid Y_{I_t})\) and \(H(Y_{x'} \mid \ldots)\) are finite sums over personas, the entire greedy process can run efficiently.
    • Design Motivation: Since classical BED requires high-dimensional integration for the predictive distribution, one-step-ahead is essentially unusable under large-scale item banks; the persona model removes this bottleneck, making the greedy algorithm—previously restricted to toy scales—truly practical.
  3. Empirical Bayes Prior Learning + Scoring Rule Evaluation:

    • Function: Performs EM fitting of the persona prior \(p(\theta)\) on real data to mitigate model misspecification where "synthetic personas do not match the real population."
    • Mechanism: Maximize the marginal likelihood of training users \(\sum_j \log \sum_\theta p(\theta)\,p(Y^{(j)}\mid\theta)\). The E-step calculates responsibility \(\gamma_{j,\theta}\propto p(\theta)p(Y^{(j)}\mid\theta)\), and the M-step updates \(p(\theta)\) to the average responsibility. The prediction side uses proper scoring rules (log loss for Shannon entropy, Brier for Gini) for evaluation, ensuring a mathematical correspondence between training objectives and evaluation metrics.
    • Design Motivation: Synthetic persona dictionaries are inevitably misspecified for real populations; EM concentrates the model's mass on a few personas that best match the training users, effectively "softly selecting a useful subset of personas," resulting in better robustness.

Loss & Training

No gradient-based training is involved. Training occurs in two places: (1) Offline LLM prompting to extract \(\mu_{\theta,x}\); (2) EM estimation of the prior using real users. Online querying is entirely based on closed-form Bayesian updates and greedy search. CAT baselines (GRM/GPCM and multidimensional variants) follow the convention of EM training for item parameters, followed by inference using gridded posteriors.

Key Experimental Results

Main Results

WorldValuesBench (91 items, 88,459 users, 4-point Likert) + 100,000 synthetic users, 5 items as prediction targets, remaining 86 items as the queryable set, budget \(T \in \{5, 10, 20, 40, 86\}\).

Setting Method \(T=5\) Log loss \(T=20\) Log loss Note
Synthetic Users (well-specified) Greedy (persona) Best Best Significantly lower than CAT; Curve approaches Full oracle
Synthetic Users Non-adaptive Bayesian Design Second Best Second Best Adaptive advantage is evident with synthetic data
Synthetic Users CAT/IRT Series Trailing significantly Still trailing Structural model misspecification
Real WVB Greedy (persona, EM prior) Best Comparable to non-adaptive Adaptive dominates at small budgets
Real WVB Non-adaptive (persona) Second Can exceed greedy More robust at large budgets, less affected by misspecification
Real WVB CAT (GRM/GPCM/M-) Trailing Trailing Even with 70,000 training users

Ablation Study

Configuration Phenomenon Interpretation
Greedy + EM prior Optimal on real data EM prior effectively mitigates the mismatch between persona dictionary and real population
Greedy + Uniform prior Significant gap from EM Performance degrades without training data, but still remains competitive with CAT
Random / Random Fixed (persona model) Moderate Validates the independent contributions of the "querying strategy" and "persona model"
Full (all 86 items queried) Near upper bound but not absolute best Under misspecified settings, more observations do not necessarily lead to better predictions

Key Findings

  • On well-specified synthetic data, the persona model structurally outperforms CAT: the low-dimensional traits assumed by CAT do not align with the data generation process.
  • On real WVB, greedy is most effective for small budgets (\(T \le 10\)), but as the budget increases, non-adaptive design can surpass greedy—a typical phenomenon of greedy overconfidence and bias from early incorrect inferences in misspecified models.
  • The EM-fitted persona prior concentrates mass on very few personas, effectively "auto-selecting a subset" from the 2058 original personas, which is crucial for real-population inference.
  • CAT still loses to the persona method even when given 70k training users; when items have no calibration data, CAT is unusable, whereas the persona method only needs one more LLM prompt to incorporate new items.

Highlights & Insights

  • Upgrading "LLM as a simulator" to "LLM generating components of a Bayesian model" is an elegant perspective shift: heuristic persona simulation becomes probabilistic inference with proper posteriors.
  • Discrete latent variables + categorical likelihood make the closed-form posterior a finite sum, bypassing the nested MC difficulties the BED community has been solving. It is an underrated case of "structure choice as computing power."
  • Treating "no need to retrain item parameters when the item bank expands" as a system-level selling point is highly practical—this is the natural advantage of LLM-priors, transferable to recommendation cold starts, medical diagnosis, psychological scale generation, etc.
  • The phenomenon where "greedy is overtaken by non-adaptive under misspecification" provides a useful engineering reminder: adaptive is not a panacea; model mismatch can turn greedy into a noise amplifier.

Limitations & Future Work

  • The quality of LLM-provided persona response distributions directly determines prior quality; for domains unfamiliar to LLMs (low-resource languages, specialized questionnaires), offline distributions may be poor.
  • Currently only supports categorical items; continuous/ordinal items require extending the likelihood form.
  • One-step-ahead greedy degrades over long budgets; the paper mentions replacing it with RL multi-step planning but does not implement it, leaving it for future work.
  • The fixed persona dictionary is a potential bottleneck: as user populations change, the dictionary needs constant updates or expansion, otherwise even EM cannot save the model from misspecification.
  • vs Classical CAT/IRT: CAT uses continuous low-dimensional traits + item parameters; this paper uses discrete personas + LLM-given likelihoods. The former needs large-scale calibration data per item, while the latter needs only a prompt.
  • vs Neural BED (Foster et al. 2021, Ivanova et al. 2021): Neural BED learns amortized surrogate/policy networks but loses exact posteriors; this paper retains exact posteriors.
  • vs Collaborative Filtering: CF uses similarity/matrix factorization for existing ratings; this paper has an explicit generative model, closed-form Bayesian updates, and active item selection, and does not require target population historical ratings.
  • vs persona-based simulation (Argyle/Aher/Horton): They use LLM personas as heuristic simulation tools; this paper embeds persona outputs into a Bayesian model, granting statistical guarantees of Bayesian inference.

Rating

  • Novelty: ⭐⭐⭐⭐ The perspective of using persona discrete latent variables to make BED inference closed-form is clear and practical, although individual components have precedents.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Synthetic and real experiments, multiple baselines, and multiple scoring rules are covered, with the only limitation being the item type restricted to 4 Likert categories.
  • Writing Quality: ⭐⭐⭐⭐ The problem motivation and mathematical derivations are clean and efficient; the correspondence between Bayesian inference and scoring rules is well-explained.
  • Value: ⭐⭐⭐⭐ Directly applicable to recommendation cold starts, questionnaires, psychometrics, etc., especially in scenarios where item banks are frequently updated.