Skip to content

Adaptive Querying with AI Persona Priors

Conference: ICML 2026
arXiv: 2605.00696
Code: https://github.com/yw3453/adaptive-query-ai-persona-priors (available)
Area: Bayesian Experimental Design / Adaptive Querying / LLM Applications
Keywords: AI Persona, Bayesian Adaptive Querying, Digital Twin, Adaptive Testing, Cold Start

TL;DR

The authors encapsulate the "distribution of LLM responses under persona conditions" as a finite mixture Bayesian prior, enabling efficient prediction of other responses for a user after only a few questions by performing closed-form posterior updates over persona, outperforming classic CAT/IRT baselines.

Background & Motivation

Background: Adaptive querying is a core tool in scenarios such as computerized adaptive testing (CAT), surveys, and recommendation cold start. Mainstream approaches either follow Item Response Theory (IRT/CAT), parameterizing item-user relationships with low-dimensional latent traits, or use neural Bayesian Experimental Design (BED), employing amortized inference or variational approximations in more flexible models.

Limitations of Prior Work: IRT/CAT's trait dimension is too low and requires large-scale historical calibration data for each item; new items require recalibration. Neural BED is flexible but needs to train surrogate/policy networks and still requires nested Monte Carlo integration at deployment, which is slow in real time. Both approaches struggle in cold start scenarios for users or items.

Key Challenge: The trade-off between expressiveness (capturing high-dimensional heterogeneous response patterns) and computational tractability (real-time closed-form posterior updates). Expressiveness requires complex models, while tractability demands low-dimensional parameterizations.

Goal: Construct a prior that simultaneously achieves (1) high expressiveness (capturing real user response diversity), (2) closed-form posterior updates, and (3) no need for extensive calibration data per item.

Key Insight: LLMs, when injected with persona profiles, can simulate the response distribution of specific groups. By precomputing the response distribution for each persona × item offline using a persona dictionary, "which persona a user belongs to" can be treated as a discrete latent variable \(\theta \in \{1,\dots,n\}\), reducing the generative model to a finite mixture distribution.

Core Idea: Use LLM-generated persona response distributions as components of a finite mixture prior, transforming Bayesian adaptive querying into closed-form posterior updates and one-step lookahead entropy minimization over discrete latent variables.

Method

Overall Architecture

The method consists of offline and online phases. Offline: using a persona dictionary (here, \(n=2058\) real US respondent profiles from Twin-2K-500), for each persona \(\xi_\theta\) and item \(x\), GPT-5-mini is prompted to obtain a \(K\)-class response distribution \(\mu_{\theta,x} \in \Delta^{K-1}\), all cached as a lookup table. Online: for a new user, initialize the persona prior \(p(\theta)\) (estimated via EM on training users), select an item at each step based on history \(h_t\), observe response \(Y_{x_{t+1}}\), update the persona posterior in closed form, and use the mixture distribution to predict the response distribution for target items \(I^\star\); after budget is exhausted, make final predictions and compute log loss / Brier / ordinal MSE.

Key Designs

  1. Persona-induced Latent Variable Model:

    • Function: Replaces the "continuous low-dimensional ability trait" in traditional IRT with "discrete persona membership," with LLM providing \(p(Y_x \mid \theta)\).
    • Mechanism: Under the conditional independence assumption \(p(\theta, Y)=p(\theta)\prod_i p(Y_i \mid \theta)\), since \(\theta\) is discrete and item likelihood is categorical, the posterior \(p(\theta \mid Y_{I_t}) \propto p(\theta)\prod_{i \in I_t}\mu_{\theta,i,Y_i}\) is fully closed-form; the predictive distribution \(p(Y_x=k \mid Y_{I_t})=\sum_\theta \mu_{\theta,x,k}\,p(\theta\mid Y_{I_t})\) is also a finite sum.
    • Design Motivation: Completely avoids nested Monte Carlo and variational approximations, achieving both "flexible prior" and "real-time inference" in a single model; each persona retains interpretable semantic labels, facilitating downstream user clustering.
  2. Greedy One-step Lookahead Adaptive Querying:

    • Function: At each step, selects from the feasible item set \(\mathcal{I}_{\text{feas}} \setminus I_t\) the item that most reduces posterior uncertainty over the targets.
    • Mechanism: Uses the sum of marginal entropies over target items as uncertainty \(U(P_t)=\sum_{x' \in I^\star} H(Y_{x'} \mid h_t)\); for each candidate \(x\), computes \(\Delta_U(x \mid h_t) = \sum_k p(Y_x=k\mid Y_{I_t})\sum_{x'} H(Y_{x'}\mid h_t, Y_x=k)\), selecting the minimum. The persona model ensures \(p(Y_x \mid Y_{I_t})\) and \(H(Y_{x'} \mid \ldots)\) are finite sums over persona, making the greedy procedure efficient.
    • Design Motivation: Classic BED requires high-dimensional integration for predictive distributions, making one-step lookahead infeasible for large item pools; the persona model removes this bottleneck, making greedy algorithms practical at scale.
  3. Empirical Bayes Prior Learning + Scoring Rule Evaluation:

    • Function: Fits the persona prior \(p(\theta)\) via EM on real data, mitigating model misspecification from mismatch between synthetic persona and real populations.
    • Mechanism: Maximizes the marginal likelihood of training users \(\sum_j \log \sum_\theta p(\theta)\,p(Y^{(j)}\mid\theta)\); E-step computes responsibility \(\gamma_{j,\theta}\propto p(\theta)p(Y^{(j)}\mid\theta)\), M-step updates \(p(\theta)\) as the average responsibility. Prediction uses proper scoring rules (log loss for Shannon entropy, Brier for Gini), ensuring mathematical alignment between training objectives and evaluation metrics.
    • Design Motivation: The synthetic persona dictionary is inevitably misspecified for real populations; EM concentrates mass on the personas best matching training users, effectively "softly selecting a useful persona subset," improving robustness.

Loss & Training

No gradient-based training. Training occurs in two places: (1) offline LLM prompting to extract \(\mu_{\theta,x}\); (2) EM estimation of the prior on real users. All online querying is based on closed-form Bayesian updates and greedy search. CAT baselines (GRM/GPCM and multidimensional variants) are trained via EM for item parameters, then use grid-based posterior inference.

Key Experimental Results

Main Results

WorldValuesBench (91 items, 88,459 users, 4-point Likert) + 100,000 synthetic users; 5 items as prediction targets, remaining 86 as query pool, budget \(T \in \{5, 10, 20, 40, 86\}\).

Setting Method \(T=5\) Log loss \(T=20\) Log loss Notes
Synthetic users (well-specified) Greedy (persona) Best Best Significantly lower than CAT; curve close to Full oracle
Synthetic users Non-adaptive Bayesian Design Second best Second best Adaptive advantage clear on synthetic data
Synthetic users CAT/IRT series Clearly worse Still worse Structural model misspecification
Real WVB Greedy (persona, EM prior) Best Comparable to non-adaptive Adaptive superior at low budget
Real WVB Non-adaptive (persona) Second Can surpass greedy More robust at high budget, less affected by misspecification
Real WVB CAT (GRM/GPCM/M-) Worse Worse Even with 70k training users

Ablation Study

Configuration Phenomenon Interpretation
Greedy + EM prior Best on real data EM prior effectively mitigates mismatch between persona dictionary and real population
Greedy + Uniform prior Noticeably worse than EM Performance degrades without training data, but still matches CAT
Random / Random Fixed (persona model) Moderate Validates independent contributions of "querying strategy" and "persona model"
Full (all 86 items queried) Near upper bound but not absolutely optimal With misspecification, more observations do not always yield better predictions

Key Findings

  • On well-specified synthetic data, the persona model structurally outperforms CAT: CAT's low-dimensional trait assumption mismatches the data-generating process.
  • On real WVB, greedy is most effective at low budget (\(T \le 10\)), but as budget increases, non-adaptive design can surpass greedy—this is a typical case of overconfident greedy inference being misled by early errors under model misspecification.
  • The EM-fitted persona prior concentrates mass on a small subset of personas, effectively "automatically selecting a subset" from the original 2058 personas, which is crucial for real population inference.
  • CAT loses to the persona method even with the advantage of 70k training users; when items lack calibration data, CAT is unusable, while the persona method only requires an additional LLM prompt to incorporate new items.

Highlights & Insights

  • Elevates "LLM as simulator" to "LLM as generator of Bayesian model components," a compelling perspective shift: heuristic persona simulation becomes probabilistic inference with proper posterior.
  • Discrete latent variables + categorical likelihood yield closed-form posteriors as finite sums, circumventing the nested MC challenge long faced by the BED community—a case where "structural choice equals computational power" is underrated.
  • "No need to retrain item parameters when expanding the item pool" is a system-level advantage—this is a natural strength of LLM-prior, transferable to recommendation cold start, diagnostics, psychometric scale generation, etc.
  • The phenomenon of "greedy being surpassed by non-adaptive under misspecification" offers a practical engineering reminder: adaptive is not a panacea; model mismatch can turn greedy into a noise amplifier.

Limitations & Future Work

  • The quality of persona response distributions from LLM directly determines prior quality; for domains unfamiliar to LLMs (low-resource languages, specialized questionnaires), offline distributions may be poor.
  • Currently supports only categorical items; continuous/ordinal items require extending the likelihood form.
  • One-step lookahead greedy degrades with long budgets; the paper mentions possible RL multi-step planning, but this is left for future work.
  • Fixed persona dictionary is a potential bottleneck: as user populations shift, the dictionary must be updated or expanded; otherwise, even EM cannot remedy misspecification.
  • vs Classic CAT/IRT: CAT uses continuous low-dimensional traits + item parameters; this work uses discrete persona + LLM-provided likelihoods. The former requires extensive calibration data per item, the latter only a single prompt.
  • vs Neural BED (Foster et al. 2021, Ivanova et al. 2021): Neural BED learns amortized surrogate/policy networks but loses exact posterior; this work retains exact posterior.
  • vs Collaborative Filtering: CF uses similarity/matrix factorization for existing ratings; this work has explicit generative model, closed-form Bayesian updates, active item selection, and does not require historical ratings from the target population.
  • vs Persona-based Simulation (Argyle/Aher/Horton): They use LLM persona as heuristic simulation tools; this work embeds persona outputs into a Bayesian model, providing statistical guarantees for Bayesian inference.

Rating

  • Novelty: ⭐⭐⭐⭐ The perspective of using persona discrete latent variables to make BED inference closed-form is clear and practical, though individual components have prior art.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Both synthetic and real experiments, multiple baselines, and scoring rules are covered, though item types are limited to 4-category Likert.
  • Writing Quality: ⭐⭐⭐⭐ Problem motivation and mathematical derivations are clean; the correspondence between Bayesian and scoring rules is clearly explained.
  • Value: ⭐⭐⭐⭐ Direct practical value for recommendation cold start, surveys, psychometrics, especially in scenarios with frequent item pool updates.