Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches¶

Conference: NeurIPS 2025 arXiv: 2509.19924 Code: None Area: Reinforcement Learning Keywords: foundation models, exploration, reinforcement-learning, VLM, knowing-doing gap

TL;DR¶

This paper systematically evaluates the zero-shot exploration capabilities of LLMs/VLMs on classic RL exploration tasks (bandits, Gridworld, Atari), identifies a knowing-doing gap in VLMs — where high-level reasoning succeeds but low-level control fails — and proposes a simple VLM-RL hybrid framework that substantially accelerates learning under idealized conditions.

Background & Motivation¶

Exploration under sparse rewards remains a fundamental challenge in RL. Foundation models (LLMs/VLMs) possess strong semantic priors and reasoning capabilities; whether these can be leveraged to improve exploration efficiency is an open question.

Limitations of prior work:

Narrow evaluation scope: MAB experiments focus only on complex prompt engineering, without studying the effect of simple instruction phrasing.

Incomplete environment hierarchy: No systematic, progressive evaluation spanning simple (bandit) to complex (Atari) settings.

Unclear failure modes: Why do VLMs fail in visual environments — is it a comprehension problem or an execution problem?

This paper addresses these questions through a three-level progressive evaluation (MAB → Gridworld → Atari) and reveals the root causes of failure via qualitative analysis.

Method¶

Overall Architecture¶

A three-tier evaluation framework: 1. Multi-Armed Bandit (isolating the exploration–exploitation tradeoff): compares the effect of implicit vs. explicit prompts on LLM exploration behavior. 2. Gridworld (introducing state transitions and memory requirements): tests LLM spatial navigation in deterministic/stochastic environments. 3. Atari hard-exploration games (high-dimensional visual input + sparse rewards): evaluates GPT-4o zero-shot gameplay. 4. Hybrid framework: a VLM periodically takes over control from a PPO agent to serve as a semantic exploration guide.

Key Designs¶

Prompt design ablation: - Implicit (v1): "Your goal is to maximize the total reward by pulling the arm with the highest probability" → requires the LLM to infer the need for exploration. - Explicit (v2): "Your goal is to maximize the total reward by finding out which arm has the highest probability" → directly instructs exploration.

Temporal information in Atari: A frame skip of \(m=6\) steps (rather than consecutive 4 frames) is introduced to increase temporal diversity and help the VLM infer motion direction. A unified minimal prompt is used across all games.

Hybrid algorithm: - The PPO agent relinquishes control to the VLM with probability \(\epsilon\) for \(T\) steps. - The VLM acts as a "semantic explorer," steering the agent toward promising state regions. - PPO resumes standard on-policy learning from the new states.

Loss & Training¶

The hybrid framework uses the standard PPO loss.
The VLM operates zero-shot without any training.
Baselines: PPO + RND (Random Network Distillation) as a strong exploration baseline.
Evaluation metrics: cumulative reward, regret, and learning curves.

Key Experimental Results¶

MAB Experiments¶

Model	Implicit prompt (v1)	Explicit prompt (v2)	UCB	Thompson Sampling
GPT-3.5	High regret	Moderate regret	Low regret	Low regret
GPT-4	Moderate regret	Near-optimal	Low regret	Low regret
Gemini 1.0	High regret	Moderate regret	—	—
Gemini 1.5	Moderate regret	Moderate-low	—	—

Suboptimality gap analysis (GPT-4, explicit prompt):

\(\Delta\)	GPT-4 vs. UCB/TS
0.6	Competitive
0.4	Competitive
0.2	Clearly inferior

Atari Zero-Shot Experiments¶

Game	GPT-4o	RB 250K	RB 2.5M	RB 25M	Human
Freeway	21	8	32	32	29.6
Gravitar	500	64	199	2405	3351
Montezuma	0	0	50	544	4753
Pitfall	-158	-26	-7	-7	6464
Private Eye	-1000	503	125	1573	69571
Solaris	600	681	1137	2093	12326
Venture	0	8	20	1513	1188

Gridworld Results¶

Setting	Action Only	Simple Plan	Focused Plan	PPO/RecPPO
Deterministic	LLM performs well	LLM excels	LLM excels	Slow convergence
Stochastic (partial obs.)	Severe degradation	Some improvement	Some improvement	Eventually converges

Hybrid Framework Results (Freeway)¶

Method	Score after 100K steps	Convergence Speed
Vanilla PPO	~5	Slow
PPO + RND	~15	Moderate
PPO + VLM	~25	Fast

Key Findings¶

Explicit prompts substantially improve exploration: LLMs do not infer the need to explore on their own and require explicit instruction.
Knowing-doing gap: VLMs correctly identify "move upward" in Freeway and recognize enemies to fire at in Gravitar (+250 points), yet completely fail in games requiring precise temporal control.
Failure mode taxonomy:
Precise control failure: Montezuma (correct reasoning of "get the key" but unable to execute the jump).
Self-identification failure: Venture (unable to identify the pink square as the player character).
Temporal reasoning failure: Pitfall (understands "jump over the pit" but misjudges timing).
Hybrid framework is effective under idealized conditions: On Freeway, where the VLM strategy is correct and control is simple, the hybrid approach substantially outperforms PPO+RND.

Highlights & Insights¶

Precise characterization of the knowing-doing gap: The failure is not that VLMs misunderstand the game, but that they cannot translate understanding into precise low-level actions — a fundamental bottleneck for current VLMs as autonomous agents.
Progressive evaluation design: The MAB → Gridworld → Atari progression systematically exposes the capability boundaries of foundation models at each level.
Honest experimental design: The hybrid framework is validated on Freeway — a game where the VLM is already known to perform well — and is explicitly presented as an upper-bound analysis rather than a general solution.
Practical implication: Foundation models are better suited as "semantic accelerators" for RL than as end-to-end controllers.

Limitations & Future Work¶

The hybrid framework is validated on only one game (Freeway); generalizability remains unknown.
The computational cost of VLM inference (per-step GPT-4o calls) and its tradeoff with sample efficiency are not quantified.
Atari evaluation is limited to GPT-4o; open-source VLMs are not compared.
The intervention mechanism is non-adaptive — decisions about when to transfer control to the VLM and when to return it to RL should be grounded in uncertainty estimates.
Alternative integration modes, such as using VLMs as reward shapers or state abstractors, are not explored.

Atari-GPT (Waytowich et al.): evaluates VLMs on dense-reward Atari; this paper focuses on sparse-reward hard-exploration games.
BALROG (Paglieri et al.): identifies the knowing-doing gap in NetHack; this paper independently validates the same phenomenon in Atari.
TextAtari (Li et al.): removing the visual bottleneck substantially improves LLM reasoning, confirming that low-level control is the primary bottleneck.
Intelligent Go-Explore (Lu et al.): uses GPT-4 to replace handcrafted heuristics for selecting states to revisit — a more successful paradigm than the direct control approach taken here.
Motif (Klissarov et al.): employs LLMs as intrinsic reward functions rather than direct controllers, making better use of semantic understanding capabilities.

Rating¶

Novelty: 7/10 — The systematic evaluation is valuable, but the knowing-doing gap concept has prior precedent.
Experimental Thoroughness: 7/10 — The progressive evaluation design is strong; validation of the hybrid framework is insufficient.
Value: 6/10 — The hybrid framework is overly simplistic and serves more as a proof of concept.
Writing Quality: 8/10 — Structure is clear; qualitative analysis is vivid and well-illustrated.