ICLR 2026 Oral Reinforcement Learning behavior foundation models task inference zero-shot RL successor features UCB linear bandits

Optimistic Task Inference for Behavior Foundation Models¶

Conference: ICLR 2026 Oral arXiv: 2510.20264 Code: GitHub Area: Reinforcement Learning / Foundation Models / Zero-shot RL Keywords: behavior foundation models, task inference, zero-shot RL, successor features, UCB, linear bandits

TL;DR¶

This paper proposes OpTI-BFM — a test-time task inference method for Behavior Foundation Models that requires neither a complete reward function nor an annotated dataset, and recovers oracle performance within approximately 5 episodes of environment interaction. The core insight is to exploit the linear structure of successor features to reduce task inference to a linear bandit problem, employing a UCB strategy for optimistic exploration in task embedding space, with formal regret guarantees.

Background & Motivation¶

Background: Behavior Foundation Models (BFMs) leverage Universal Successor Features (USFs) for zero-shot RL — during pretraining, a family of policies \((\pi_z)_{z \in \mathcal{Z}}\) is learned, and at test time the optimal policy \(\pi_{z_r}\) can be retrieved immediately given a reward function \(r\).

Limitations of Prior Work: - Standard task inference (Eq. 4) requires computing \(z_r = \text{Cov}(\phi)^{-1} \mathbb{E}[\phi(s) r(s)]\) over an inference dataset \(\mathcal{D}\), which demands (i) access to an inference dataset and (ii) reward labels for each state. - Pretraining data may be unavailable or proprietary; annotating rewards from pixels is costly. - Existing methods (e.g., FB) use 50K annotated samples for task inference by default.

Key Challenge: BFM zero-shot policy retrieval is computationally efficient (requiring only a single linear regression) but data-annotation-inefficient (requiring large amounts of labeled data for task inference).

Core Idea: Transform task inference from offline regression to online interaction — exploiting the linear structure \(V^{\pi_z}(s) = z^\top \psi^{\pi_z}(s)\) to reduce policy selection to a linear bandit, and performing optimistic exploration in task embedding space via UCB.

Method¶

Overall Architecture¶

A pretrained BFM (providing policy family \(\pi_z\), successor features \(\psi\), and features \(\phi\)) → At test time: maintain a posterior estimate \(\hat{z}_t\) and confidence set \(\mathcal{C}_t\) for \(z_r\) → Select the optimistic \(z_t\) at each step (optimal under the upper confidence bound over \(\mathcal{C}_t\)) → Execute \(\pi_{z_t}\) to collect states and rewards → Update \(\hat{z}_t\) → Repeat until convergence.

Key Designs¶

Online Task Embedding Estimation
- Function: Maintain a regularized least-squares estimate of \(z_r\).
- Mechanism: \(\hat{z}_t = V_t^{-1} \sum_{i=0}^t \phi(s_i) r_i\), where \(V_t = \lambda I_d + \sum_{i=0}^t \phi(s_i) \phi(s_i)^\top\). Updated after each new interaction \((s_t, r_t)\).
- Confidence set: \(\mathcal{C}_t = \{z : \|z - \hat{z}_t\|_{V_t} \leq \beta_t\}\), an ellipsoidal confidence region.
Optimistic Task Selection (UCB)
- Function: Select the optimistic task embedding among all plausible candidates.
- Mechanism: \(z_t = \arg\max_{z \in \mathcal{C}_t} \psi^{\pi_z}(s_t)^\top z\) — selects the task embedding within the confidence set that maximizes predicted return at the current state.
- Key distinction from linear bandits: Two distinct "contexts" are involved — \(\phi\) (for regression and confidence interval estimation) and \(\psi\) (for the acquisition function). Regressing on \(\phi\) yields tighter estimates.
- Design Motivation: Optimism naturally balances exploration and exploitation — exploring (selecting informative policies) under high uncertainty, and exploiting (selecting the estimated best policy) under low uncertainty.
Regret Bound
- Function: Formally bound the regret of OpTI-BFM.
- Core Result: For a perfectly trained BFM (satisfying the linear structure), the regret after \(n\) episodes is \(R_n = \tilde{O}(\sqrt{n \cdot d})\), where \(d\) is the task embedding dimension.
- Key Assumptions: Perfectly trained BFM (accurate SFs, optimal policies) + reward lies within the linear span of the features.
- Design Motivation: By formalizing the connection to the classical LinUCB algorithm, the method directly inherits theoretical guarantees from the linear bandit literature.
Thompson Sampling Variant (OpTI-BFM-TS)
- Avoids the optimization problem by directly sampling from the posterior: \(z_t \sim \mathcal{N}(\hat{z}_t, \beta_t V_t^{-1})\).
- Simpler but slightly lower performance (a notable gap from UCB on Cheetah).

Loss & Training¶

Pretraining: Standard Forward-Backward (FB) framework.
Test-time inference: Purely online; no additional training required. Each step only updates \(V_t\) and \(\hat{z}_t\) (matrix-vector operations).
Computational overhead: Approximately 5× slower than pure policy inference (Table 1), but the absolute time cost remains small.

Key Experimental Results¶

Main Results (DMC Walker / Cheetah / Quadruped, 4 tasks each)¶

Method	Episodes to Recover Oracle Performance	Annotation Requirement
Oracle (offline inference)	—	50K annotated samples
OpTI-BFM	~5 episodes (5K steps)	5K online rewards
OpTI-BFM-TS	~8 episodes	8K online rewards
LoLA (black-box policy search)	~30+ episodes	30K+
Random	Does not converge	—

OpTI-BFM matches offline inference requiring 50K annotated samples using 10× fewer annotations.

Ablation Study¶

Configuration	Result	Notes
Per-step update vs. per-episode update	Per-step update converges faster	Per-episode update has theoretical guarantees
OpTI-BFM vs. RND (task-agnostic exploration)	OpTI-BFM is more data-efficient	Demonstrates value of task-aware exploration
OpTI-BFM vs. random baseline	Active data collection greatly outperforms passive	Validates the importance of active collection
Non-stationary rewards (decay \(\rho < 1\))	Can track changing rewards	Requires an appropriate forgetting factor
Warm-start (\(n\) offline annotated samples)	Initial performance improves rapidly with \(n\)	Compatible with offline data

Key Findings¶

Oracle performance is recovered within 5 episodes across all environments and tasks, a direct consequence of fully exploiting the BFM linear structure.
LoLA (black-box search) ignores the linear structure and thus converges more than 6× slower, demonstrating the value of structure-aware exploration.
OpTI-BFM naturally extends to non-stationary rewards (via a forgetting factor \(\rho\)) and warm-start settings (via offline data pre-initialization).
Computational overhead is manageable (~5×); the primary bottleneck is environment interaction rather than inference itself.

Highlights & Insights¶

Eliminates the most critical practical bottleneck of BFMs: Prior "zero-shot" BFMs required 50K annotated samples for task inference — a substantial data demand that severely limits real-world deployment. OpTI-BFM reduces this requirement from 50K to ~5K (online, without pre-stored datasets).
Elegant theoretical connection: Task inference in BFMs is precisely reduced to the classical linear bandit problem, inheriting a rich set of theoretical tools (regret bounds, ellipsoidal confidence sets, the elliptical potential lemma).
Practical unified interface: OpTI-BFM supports three modes — (1) purely online with no prior knowledge, (2) warm-start with a small amount of offline data, and (3) non-stationary reward tracking — making it a unified framework for BFM deployment.

Limitations & Future Work¶

Relies on the linear SF structure of BFMs — nonlinear reward functions fall outside the scope of the theoretical guarantees.
Theoretical analysis assumes a perfectly trained BFM (accurate SFs and optimal policies); how approximation errors propagate in practice requires further analysis.
The regret bound covers only episode-level updates, whereas experiments show that step-level updates perform better — a theory-practice gap remains.
Evaluation is limited to DMC environments (Walker/Cheetah/Quadruped); more complex high-dimensional visual settings are not tested.
The dimension \(d\) of the task embedding space \(\mathcal{Z}\) affects regret — high-dimensional embedding settings may require substantially more exploration.

vs. Standard BFM task inference (FB): Requires 50K annotated samples and a pre-stored dataset; OpTI-BFM requires only ~5 episodes of online interaction.
vs. LoLA (Sikchi 2025): Black-box policy search that ignores the linear structure, converging more than 6× slower.
vs. LinUCB (Abbasi-Yadkori 2011): The theoretical foundation of OpTI-BFM, extended to the novel BFM setting where regression and acquisition use distinct contexts.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The connection between BFM task inference and linear bandits is remarkably elegant, directly addressing the core practical bottleneck of BFMs.
Experimental Thoroughness: ⭐⭐⭐⭐ Standard benchmarks with diverse ablations (non-stationarity, warm-start, data efficiency comparisons).
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear, with a smooth progression from intuition to formal derivation.
Value: ⭐⭐⭐⭐⭐ The ICLR Oral distinction is well deserved; the work substantially advances the practical deployment of BFMs.