Adaptive Querying with AI Persona Priors¶
Conference: ICML 2026
arXiv: 2605.00696
Code: https://github.com/yw3453/adaptive-query-ai-persona-priors (Available)
Area: Bayesian Experimental Design / Adaptive Querying / LLM Applications
Keywords: AI Persona, Bayesian Adaptive Querying, Digital Twins, Adaptive Testing, Cold Start
TL;DR¶
The authors package "the distribution of LLM-generated responses conditioned on a persona" as a finite mixture Bayesian prior. This allows efficient prediction of other responses by performing closed-form updates on the persona posterior after asking only a few questions, outperforming classical CAT/IRT baselines in performance.
Background & Motivation¶
Background: Adaptive querying is a core tool in scenarios such as Computerized Adaptive Testing (CAT), questionnaire surveys, and recommendation cold starts. Mainstream solutions either follow Item Response Theory (IRT/CAT), which parameterizes item-user relationships with low-dimensional latent traits, or use Neural Bayesian Experimental Design (BED) for amortized inference or variational approximation on more flexible models.
Limitations of Prior Work: IRT/CAT traits have low dimensionality and require large-scale historical calibration data for every item; new items entering the pool require recalibration. Neural BED is flexible but requires training surrogate or policy networks and performing nested Monte Carlo integration during deployment, leading to poor real-time performance. Both paths are difficult to apply in cold-start user and cold-start item scenarios.
Key Challenge: The trade-off between expressivity (capturing high-dimensional heterogeneous response patterns) and computability (real-time closed-form posterior updates). Achieving expressivity requires complex models, while maintaining computability often limits the approach to low-dimensional parametric models.
Goal: To construct a prior that simultaneously possesses (1) high expressivity (capturing the diversity of real user response patterns), (2) closed-form posterior updates, and (3) no requirement for large-scale calibration data for items.
Key Insight: LLMs can simulate response distributions for specific groups after being injected with persona profiles. By using a persona dictionary to offline pre-calculate the response distribution for each persona \(\times\) each item, "which persona the user belongs to" can be treated as a discrete latent variable \(\theta \in \{1,\dots,n\}\), thereby reducing the generative model to a finite mixture distribution.
Core Idea: Using LLM-generated persona response distributions as components of a finite mixture prior, transforming Bayesian adaptive querying into closed-form posterior updates on discrete latent variables and one-step-ahead entropy minimization.
Method¶
Overall Architecture¶
The method is divided into offline and online stages. Offline stage: A persona dictionary is utilized (this paper uses \(n=2058\) real US respondent profiles from Twin-2K-500); for each persona \(\xi_\theta\) and each item \(x\), GPT-5-mini is prompted to obtain a \(K\)-category response distribution \(\mu_{\theta,x} \in \Delta^{K-1}\), all of which are cached in a lookup table. Online stage: Initialize the persona prior \(p(\theta)\) for a new user (estimated via EM from training users). In each step, an item is selected based on history \(h_t\), response \(Y_{x_{t+1}}\) is observed, the persona posterior is updated in closed form, and the response distribution for the target item \(I^\star\) is predicted using the mixture distribution. After the budget is exhausted, final predictions are made and log loss / Brier / ordinal MSE are calculated.
Key Designs¶
-
Persona-induced Latent Variable Model:
- Function: Replaces "continuous low-dimensional ability traits" in traditional IRT with "discrete persona membership" and uses the LLM to provide \(p(Y_x \mid \theta)\).
- Mechanism: Under the conditional independence assumption \(p(\theta, Y)=p(\theta)\prod_i p(Y_i \mid \theta)\), since \(\theta\) is discrete and the categorical item likelihood is categorical, the posterior \(p(\theta \mid Y_{I_t}) \propto p(\theta)\prod_{i \in I_t}\mu_{\theta,i,Y_i}\) is entirely closed-form; the predictive distribution \(p(Y_x=k \mid Y_{I_t})=\sum_\theta \mu_{\theta,x,k}\,p(\theta\mid Y_{I_t})\) is also a finite sum.
- Design Motivation: Completely bypass nested Monte Carlo and variational approximations to achieve both a "flexible prior" and "real-time inference" simultaneously; additionally, each persona retains interpretable semantic labels, facilitating downstream user clustering.
-
Greedy One-step-ahead Adaptive Querying:
- Function: Selects the item from the feasible set \(\mathcal{I}_{\text{feas}} \setminus I_t\) that best compresses the target posterior uncertainty at each step.
- Mechanism: The sum of marginal entropies of the target item set is used as the uncertainty \(U(P_t)=\sum_{x' \in I^\star} H(Y_{x'} \mid h_t)\). For each candidate \(x\), calculate \(\Delta_U(x \mid h_t) = \sum_k p(Y_x=k\mid Y_{I_t})\sum_{x'} H(Y_{x'}\mid h_t, Y_x=k)\) and select the minimum. Because the persona model ensures \(p(Y_x \mid Y_{I_t})\) and \(H(Y_{x'} \mid \ldots)\) are finite sums over personas, the entire greedy process can run efficiently.
- Design Motivation: Since classical BED requires high-dimensional integration for the predictive distribution, one-step-ahead is essentially unusable under large-scale item banks; the persona model removes this bottleneck, making the greedy algorithm—previously restricted to toy scales—truly practical.
-
Empirical Bayes Prior Learning + Scoring Rule Evaluation:
- Function: Performs EM fitting of the persona prior \(p(\theta)\) on real data to mitigate model misspecification where "synthetic personas do not match the real population."
- Mechanism: Maximize the marginal likelihood of training users \(\sum_j \log \sum_\theta p(\theta)\,p(Y^{(j)}\mid\theta)\). The E-step calculates responsibility \(\gamma_{j,\theta}\propto p(\theta)p(Y^{(j)}\mid\theta)\), and the M-step updates \(p(\theta)\) to the average responsibility. The prediction side uses proper scoring rules (log loss for Shannon entropy, Brier for Gini) for evaluation, ensuring a mathematical correspondence between training objectives and evaluation metrics.
- Design Motivation: Synthetic persona dictionaries are inevitably misspecified for real populations; EM concentrates the model's mass on a few personas that best match the training users, effectively "softly selecting a useful subset of personas," resulting in better robustness.
Loss & Training¶
No gradient-based training is involved. Training occurs in two places: (1) Offline LLM prompting to extract \(\mu_{\theta,x}\); (2) EM estimation of the prior using real users. Online querying is entirely based on closed-form Bayesian updates and greedy search. CAT baselines (GRM/GPCM and multidimensional variants) follow the convention of EM training for item parameters, followed by inference using gridded posteriors.
Key Experimental Results¶
Main Results¶
WorldValuesBench (91 items, 88,459 users, 4-point Likert) + 100,000 synthetic users, 5 items as prediction targets, remaining 86 items as the queryable set, budget \(T \in \{5, 10, 20, 40, 86\}\).
| Setting | Method | \(T=5\) Log loss | \(T=20\) Log loss | Note |
|---|---|---|---|---|
| Synthetic Users (well-specified) | Greedy (persona) | Best | Best | Significantly lower than CAT; Curve approaches Full oracle |
| Synthetic Users | Non-adaptive Bayesian Design | Second Best | Second Best | Adaptive advantage is evident with synthetic data |
| Synthetic Users | CAT/IRT Series | Trailing significantly | Still trailing | Structural model misspecification |
| Real WVB | Greedy (persona, EM prior) | Best | Comparable to non-adaptive | Adaptive dominates at small budgets |
| Real WVB | Non-adaptive (persona) | Second | Can exceed greedy | More robust at large budgets, less affected by misspecification |
| Real WVB | CAT (GRM/GPCM/M-) | Trailing | Trailing | Even with 70,000 training users |
Ablation Study¶
| Configuration | Phenomenon | Interpretation |
|---|---|---|
| Greedy + EM prior | Optimal on real data | EM prior effectively mitigates the mismatch between persona dictionary and real population |
| Greedy + Uniform prior | Significant gap from EM | Performance degrades without training data, but still remains competitive with CAT |
| Random / Random Fixed (persona model) | Moderate | Validates the independent contributions of the "querying strategy" and "persona model" |
| Full (all 86 items queried) | Near upper bound but not absolute best | Under misspecified settings, more observations do not necessarily lead to better predictions |
Key Findings¶
- On well-specified synthetic data, the persona model structurally outperforms CAT: the low-dimensional traits assumed by CAT do not align with the data generation process.
- On real WVB, greedy is most effective for small budgets (\(T \le 10\)), but as the budget increases, non-adaptive design can surpass greedy—a typical phenomenon of greedy overconfidence and bias from early incorrect inferences in misspecified models.
- The EM-fitted persona prior concentrates mass on very few personas, effectively "auto-selecting a subset" from the 2058 original personas, which is crucial for real-population inference.
- CAT still loses to the persona method even when given 70k training users; when items have no calibration data, CAT is unusable, whereas the persona method only needs one more LLM prompt to incorporate new items.
Highlights & Insights¶
- Upgrading "LLM as a simulator" to "LLM generating components of a Bayesian model" is an elegant perspective shift: heuristic persona simulation becomes probabilistic inference with proper posteriors.
- Discrete latent variables + categorical likelihood make the closed-form posterior a finite sum, bypassing the nested MC difficulties the BED community has been solving. It is an underrated case of "structure choice as computing power."
- Treating "no need to retrain item parameters when the item bank expands" as a system-level selling point is highly practical—this is the natural advantage of LLM-priors, transferable to recommendation cold starts, medical diagnosis, psychological scale generation, etc.
- The phenomenon where "greedy is overtaken by non-adaptive under misspecification" provides a useful engineering reminder: adaptive is not a panacea; model mismatch can turn greedy into a noise amplifier.
Limitations & Future Work¶
- The quality of LLM-provided persona response distributions directly determines prior quality; for domains unfamiliar to LLMs (low-resource languages, specialized questionnaires), offline distributions may be poor.
- Currently only supports categorical items; continuous/ordinal items require extending the likelihood form.
- One-step-ahead greedy degrades over long budgets; the paper mentions replacing it with RL multi-step planning but does not implement it, leaving it for future work.
- The fixed persona dictionary is a potential bottleneck: as user populations change, the dictionary needs constant updates or expansion, otherwise even EM cannot save the model from misspecification.
Related Work & Insights¶
- vs Classical CAT/IRT: CAT uses continuous low-dimensional traits + item parameters; this paper uses discrete personas + LLM-given likelihoods. The former needs large-scale calibration data per item, while the latter needs only a prompt.
- vs Neural BED (Foster et al. 2021, Ivanova et al. 2021): Neural BED learns amortized surrogate/policy networks but loses exact posteriors; this paper retains exact posteriors.
- vs Collaborative Filtering: CF uses similarity/matrix factorization for existing ratings; this paper has an explicit generative model, closed-form Bayesian updates, and active item selection, and does not require target population historical ratings.
- vs persona-based simulation (Argyle/Aher/Horton): They use LLM personas as heuristic simulation tools; this paper embeds persona outputs into a Bayesian model, granting statistical guarantees of Bayesian inference.
Rating¶
- Novelty: ⭐⭐⭐⭐ The perspective of using persona discrete latent variables to make BED inference closed-form is clear and practical, although individual components have precedents.
- Experimental Thoroughness: ⭐⭐⭐⭐ Synthetic and real experiments, multiple baselines, and multiple scoring rules are covered, with the only limitation being the item type restricted to 4 Likert categories.
- Writing Quality: ⭐⭐⭐⭐ The problem motivation and mathematical derivations are clean and efficient; the correspondence between Bayesian inference and scoring rules is well-explained.
- Value: ⭐⭐⭐⭐ Directly applicable to recommendation cold starts, questionnaires, psychometrics, etc., especially in scenarios where item banks are frequently updated.