ICML2025 Medical LLM In-Context Exploration Multi-Armed Bandit Contextual Bandit UCB Algorithm Distillation Exploration-Exploitation

EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration¶

Conference: ICML2025
arXiv: 2410.06238
Code: allenanie/EVOLvE
Area: LLM Exploration / Reinforcement Learning / Bandit
Keywords: In-Context Exploration, Multi-Armed Bandit, Contextual Bandit, UCB, Algorithm Distillation, Exploration-Exploitation

TL;DR¶

This paper proposes the BanditBench benchmark and three enhancement strategies (inference-time algorithm guidance, few-shot demonstration, and oracle fine-tuning) to systematically evaluate and improve the in-context exploration capabilities of LLMs in bandit environments, enabling smaller models to outperform larger ones through algorithm distillation.

Background & Motivation¶

LLMs have been widely applied in scenarios such as recommendation systems, education, and healthcare, which require making optimal decisions under uncertainty. However, these scenarios are fundamentally different from traditional prediction tasks: the LLM can only observe the rewards of its selected actions (partial feedback). Thus, it must actively explore to discover the optimal strategy while avoiding the opportunity costs associated with excessive exploration. This is the classic exploration-exploitation tradeoff problem.

Although theoretical optimal algorithms like UCB and Thompson Sampling exist in the bandit/RL fields, how LLMs balance exploration and exploitation under uncertainty, and whether they can self-improve via in-context learning, remains under-studied. Prior work (Krishnamurthy et al., 2024) only conducted preliminary evaluations on small-scale MABs, concluding that LLMs perform poorly without heavy interventions.

The core motivation of this paper is: Can the knowledge of known optimal algorithms be leveraged to efficiently inject exploration capabilities into LLMs?

Method¶

1. BanditBench Benchmark¶

Constructs a natural language interaction environment covering both Multi-Armed Bandit (MAB) and Contextual Bandit (CB):

MAB Setup: Varies across two dimensions—action space (\(K=5\) / \(K=20\)), reward distribution (Bernoulli / Gaussian), and exploration difficulty (\(\Delta_{\min}=0.5\) for easy / \(\Delta_{\min}=0.2\) for hard). Action descriptions are categorized into non-semantic "Video" and semantic "Clothes". There are 16 configurations in total.
CB Setup: A semi-synthetic task based on the MovieLens dataset. The context includes user features (age, gender, occupation, zip code) and movie features. The action space is \(K=10\) (easy) or \(K=30\) (hard). True rewards are constructed via SVD low-rank decomposition: \(r_{i,j} = u_i^T \Sigma v_j\).

2. History Representation Formats¶

Defines three textual formats to convey interaction histories to the LLM:

Raw History (RH): Records the complete sequence of \((t', a_{t'}, r_{t'})\), where context length grows linearly.
Summarized History (SH): Compressed into the empirical mean \(\hat{\mathbb{E}}[r^a]\) of each arm, sample counts \(N_t(a)\), and the current step \(t\). Only applicable to MAB.
Algorithm-Guided Support (AG): Directly provides the exploitation values and exploration bonuses computed by UCB.

3. Three Enhancement Strategies¶

(a) Inference-Time Algorithm Guidance (AG)¶

Explicitly provides the exploitation values and exploration bonuses calculated by the UCB algorithm to the LLM at inference time:

UCB Formula for MAB:

\[V^{\text{exploit}}(a,t) = \hat{\mu}_t(a), \quad V^{\text{explore}}(a,t) = \alpha \sqrt{\frac{\log(t)}{N_t(a)}}\]

The LLM only needs to perform addition and argmax to make optimal decisions, bypassing the need to infer reward distributions from raw history.

LinUCB for CB: Assuming a linear reward model \(\mathbb{E}[r_t^a | x_t^a] = (x_t^a)^T \theta^*\), ridge regression is used to estimate \(\hat{\theta}\), with the confidence interval given by \(\alpha\sqrt{(x_t^a)^T(D_a^T D_a + \lambda I_d)^{-1} x_t^a}\).

(b) In-Context Few-Shot Demonstration¶

Places 5 oracle trajectories generated by UCB as few-shot examples in the LLM's context. The challenge is that the bandit context length scales linearly with steps and number of actions, potentially exceeding the LLM's effective context window.

(c) Oracle Fine-Tuning (OFT)¶

Supervised fine-tuning is performed using trajectories generated by UCB, with the loss function:

\[\mathcal{L}_{\text{OFT}}(\pi) = -\mathbb{E}_{(\phi(H_t^{\text{UCB}}), a_t^{\text{UCB}}) \sim \mathcal{D}_{\text{OFT}}} [\log \pi(a_t^{\text{UCB}} | \phi(H_t^{\text{UCB}}))]\]

Note that OFT \(\neq\) Behavior Cloning: OFT distills the policy iteration process (a sequence of continuously improving policies \(\{\pi_1, \pi_2, \ldots, \pi_*\}\)) rather than replicating a single static policy. The model must determine its current phase of improvement based on historical interactions to output the corresponding action.

4. Functional Analysis of Exploration Efficiency¶

Fits the cumulative regret curve using a parametric function:

\[f(T) = \frac{\lambda \log(T)^\alpha}{\Delta_{\min}} + \beta T + \lambda_2\]

Smaller \(\alpha\) \(\rightarrow\) slower logarithmic growth \(\rightarrow\) more efficient exploration.
\(\beta = 0\) \(\rightarrow\) no linear regret \(\rightarrow\) achieves theoretical optimality (sublinear regret).
A strong algorithm should exhibit small \(\alpha\) and \(\beta \approx 0\).

Key Experimental Results¶

Evaluated models: Gemma-2B, Gemma-9B, Gemini-1.5 Flash, and Gemini-1.5 Pro. Results are averaged over 30 independent runs, using t-tests (\(p < 0.05\)) for pairwise win-rates.

Inference-Time Support Win-Rate (%)¶

Method	Gemma-2B	Gemma-9B	Flash	Pro	Flash(CB)	Pro(CB)
RH	7.6	10.5	27.7	45.5	0.0	7.1
SH	10.5	5.3	34.8	60.0	—	—
AG	4.9	4.1	32.2	59.6	46.4	64.3
UCB/LinUCB	—	—	—	—	90.6	96.4

Algorithm Distillation Win-Rate (%)¶

Method	Flash(MAB)	Pro(MAB)	Flash(CB)	Pro(CB)
Few-shot + RH	27.5	39.1	3.6	7.1
Few-shot + AG	50.2	56.4	60.7	25.0
OFT + RH	65.6	—	28.6	—
OFT + AG	28.3	—	89.3	—

Key Findings¶

OFT-tuned Flash outperforms the larger Pro model, demonstrating the strong potential of algorithm distillation.
AG yields the most significant improvements in large action spaces (\(K=20\), \(K=30\)) and complex CB tasks.
Task difficulty affects distillation performance: Few-shot benefits more from simple data (50.2 vs 43.0), while OFT benefits more from hard data (65.6 vs 54.5).
Surprising representation preference contrast: Few-shot prefers SH (50.2 vs 27.5), whereas OFT prefers RH (65.6 vs 28.3).
Regret analysis shows: Pro + AG/SH approaches sublinear regret; Flash + OFT also achieves sublinear regret with a lower \(\alpha\); Gemma-2B/9B basically remain in the linear regret regime.
Easy-to-hard generalization: Policy distilled from simple \(K=5\) tasks generalizes well to hard \(K=20\) tasks (34.0% \(\rightarrow\) 48.4%).

Exploration Behavior Analysis (MinFrac / OptFrac)¶

Metric/Steps	100	250	500	750	1000
UCB MinFrac	82.3%	48.6%	27.8%	19.6%	15.3%
Flash(AG) MinFrac	11.3%	4.5%	2.3%	1.5%	1.1%
UCB OptFrac	32.7%	49.4%	58.7%	62.6%	65.0%
Flash(AG) OptFrac	14.4%	15.6%	16.3%	16.6%	16.8%

UCB exhibits an ideal pattern of "exploring widely first, then exploiting", whereas LLMs (even with AG) show severe under-exploration, with OptFrac barely growing.

Highlights & Insights¶

Systematic Benchmark: BanditBench is the first testbed for LLM exploration capabilities that spans MAB + CB with various difficulties, action spaces, and reward distributions.
Smaller Model Outperforms Larger Model: Through OFT, Flash outperforms Pro on both MAB and CB, showing that exploration capabilities can be efficiently injected via algorithm distillation.
Profound Distinction between OFT and BC: OFT distills policy improvement trajectories rather than a static policy, granting the model the capability to "learn how to learn".
Huge Gain of AG in CB: The OFT win-rate for CB leaps from 28.6% to 89.3%, indicating that structured guiding information is crucial for complex decision-making.
Regret Function Fitting: Innovatively analyzes LLM exploration efficiency via \(f(T) = \lambda\log(T)^\alpha/\Delta_{\min} + \beta T + \lambda_2\), providing a theoretical comparison framework.

Limitations & Future Work¶

Environmental Limitations: Only evaluates bandits (stateless RL), without addressing stateful full RL problems (MDPs).
Algorithmic Dependence: Both AG and OFT rely on the specific UCB algorithm; their applicability to more complex, black-box optimal algorithms remains unverified.
LLM Exploration is Still Far Inferior to UCB: Even under the optimal setting, there remains a notable gap between LLMs and UCB/LinUCB's 90%+ win-rate.
CB Dataset Limitations: Only evaluates on the MovieLens dataset; generalization needs further validation.
Fine-Tuning Only Conducted on Flash: Tuning results on Pro are not reported, which limits the completeness of the conclusions.
OptFrac Barely Grows: Suggests that the LLM has not truly "learned" meaningful exploration, but rather passively exploits the provided statistical information.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to systematically distill optimal bandit algorithm knowledge into LLMs; the distinction between OFT and BC is profound.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 16 MAB + 2 CB configurations, 4 models, 30 repetitions, and extensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ — Structured clearly, with complete formal definitions and creative regret function analysis.
Value: ⭐⭐⭐⭐ — Highly practical guidelines for the LLM-as-agent field, providing clear paths for improvement.