Optimistic Task Inference for Behavior Foundation Models¶
Conference: ICLR 2026 Oral arXiv: 2510.20264 Code: GitHub Area: Reinforcement Learning / Foundation Models / Zero-shot RL Keywords: behavior foundation models, task inference, zero-shot RL, successor features, UCB, linear bandits
TL;DR¶
This paper proposes OpTI-BFM — a test-time task inference method for Behavior Foundation Models that requires neither a complete reward function nor an annotated dataset, and recovers oracle performance within approximately 5 episodes of environment interaction. The core insight is to exploit the linear structure of successor features to reduce task inference to a linear bandit problem, employing a UCB strategy for optimistic exploration in task embedding space, with formal regret guarantees.
Background & Motivation¶
Background: Behavior Foundation Models (BFMs) leverage Universal Successor Features (USFs) for zero-shot RL — during pretraining, a family of policies \((\pi_z)_{z \in \mathcal{Z}}\) is learned, and at test time the optimal policy \(\pi_{z_r}\) can be retrieved immediately given a reward function \(r\).
Limitations of Prior Work: - Standard task inference (Eq. 4) requires computing \(z_r = \text{Cov}(\phi)^{-1} \mathbb{E}[\phi(s) r(s)]\) over an inference dataset \(\mathcal{D}\), which demands (i) access to an inference dataset and (ii) reward labels for each state. - Pretraining data may be unavailable or proprietary; annotating rewards from pixels is costly. - Existing methods (e.g., FB) use 50K annotated samples for task inference by default.
Key Challenge: BFM zero-shot policy retrieval is computationally efficient (requiring only a single linear regression) but data-annotation-inefficient (requiring large amounts of labeled data for task inference).
Core Idea: Transform task inference from offline regression to online interaction — exploiting the linear structure \(V^{\pi_z}(s) = z^\top \psi^{\pi_z}(s)\) to reduce policy selection to a linear bandit, and performing optimistic exploration in task embedding space via UCB.
Method¶
Overall Architecture¶
A pretrained BFM (providing policy family \(\pi_z\), successor features \(\psi\), and features \(\phi\)) → At test time: maintain a posterior estimate \(\hat{z}_t\) and confidence set \(\mathcal{C}_t\) for \(z_r\) → Select the optimistic \(z_t\) at each step (optimal under the upper confidence bound over \(\mathcal{C}_t\)) → Execute \(\pi_{z_t}\) to collect states and rewards → Update \(\hat{z}_t\) → Repeat until convergence.
Key Designs¶
-
Online Task Embedding Estimation
- Function: Maintain a regularized least-squares estimate of \(z_r\).
- Mechanism: \(\hat{z}_t = V_t^{-1} \sum_{i=0}^t \phi(s_i) r_i\), where \(V_t = \lambda I_d + \sum_{i=0}^t \phi(s_i) \phi(s_i)^\top\). Updated after each new interaction \((s_t, r_t)\).
- Confidence set: \(\mathcal{C}_t = \{z : \|z - \hat{z}_t\|_{V_t} \leq \beta_t\}\), an ellipsoidal confidence region.
-
Optimistic Task Selection (UCB)
- Function: Select the optimistic task embedding among all plausible candidates.
- Mechanism: \(z_t = \arg\max_{z \in \mathcal{C}_t} \psi^{\pi_z}(s_t)^\top z\) — selects the task embedding within the confidence set that maximizes predicted return at the current state.
- Key distinction from linear bandits: Two distinct "contexts" are involved — \(\phi\) (for regression and confidence interval estimation) and \(\psi\) (for the acquisition function). Regressing on \(\phi\) yields tighter estimates.
- Design Motivation: Optimism naturally balances exploration and exploitation — exploring (selecting informative policies) under high uncertainty, and exploiting (selecting the estimated best policy) under low uncertainty.
-
Regret Bound
- Function: Formally bound the regret of OpTI-BFM.
- Core Result: For a perfectly trained BFM (satisfying the linear structure), the regret after \(n\) episodes is \(R_n = \tilde{O}(\sqrt{n \cdot d})\), where \(d\) is the task embedding dimension.
- Key Assumptions: Perfectly trained BFM (accurate SFs, optimal policies) + reward lies within the linear span of the features.
- Design Motivation: By formalizing the connection to the classical LinUCB algorithm, the method directly inherits theoretical guarantees from the linear bandit literature.
-
Thompson Sampling Variant (OpTI-BFM-TS)
- Avoids the optimization problem by directly sampling from the posterior: \(z_t \sim \mathcal{N}(\hat{z}_t, \beta_t V_t^{-1})\).
- Simpler but slightly lower performance (a notable gap from UCB on Cheetah).
Loss & Training¶
- Pretraining: Standard Forward-Backward (FB) framework.
- Test-time inference: Purely online; no additional training required. Each step only updates \(V_t\) and \(\hat{z}_t\) (matrix-vector operations).
- Computational overhead: Approximately 5× slower than pure policy inference (Table 1), but the absolute time cost remains small.
Key Experimental Results¶
Main Results (DMC Walker / Cheetah / Quadruped, 4 tasks each)¶
| Method | Episodes to Recover Oracle Performance | Annotation Requirement |
|---|---|---|
| Oracle (offline inference) | — | 50K annotated samples |
| OpTI-BFM | ~5 episodes (5K steps) | 5K online rewards |
| OpTI-BFM-TS | ~8 episodes | 8K online rewards |
| LoLA (black-box policy search) | ~30+ episodes | 30K+ |
| Random | Does not converge | — |
OpTI-BFM matches offline inference requiring 50K annotated samples using 10× fewer annotations.
Ablation Study¶
| Configuration | Result | Notes |
|---|---|---|
| Per-step update vs. per-episode update | Per-step update converges faster | Per-episode update has theoretical guarantees |
| OpTI-BFM vs. RND (task-agnostic exploration) | OpTI-BFM is more data-efficient | Demonstrates value of task-aware exploration |
| OpTI-BFM vs. random baseline | Active data collection greatly outperforms passive | Validates the importance of active collection |
| Non-stationary rewards (decay \(\rho < 1\)) | Can track changing rewards | Requires an appropriate forgetting factor |
| Warm-start (\(n\) offline annotated samples) | Initial performance improves rapidly with \(n\) | Compatible with offline data |
Key Findings¶
- Oracle performance is recovered within 5 episodes across all environments and tasks, a direct consequence of fully exploiting the BFM linear structure.
- LoLA (black-box search) ignores the linear structure and thus converges more than 6× slower, demonstrating the value of structure-aware exploration.
- OpTI-BFM naturally extends to non-stationary rewards (via a forgetting factor \(\rho\)) and warm-start settings (via offline data pre-initialization).
- Computational overhead is manageable (~5×); the primary bottleneck is environment interaction rather than inference itself.
Highlights & Insights¶
- Eliminates the most critical practical bottleneck of BFMs: Prior "zero-shot" BFMs required 50K annotated samples for task inference — a substantial data demand that severely limits real-world deployment. OpTI-BFM reduces this requirement from 50K to ~5K (online, without pre-stored datasets).
- Elegant theoretical connection: Task inference in BFMs is precisely reduced to the classical linear bandit problem, inheriting a rich set of theoretical tools (regret bounds, ellipsoidal confidence sets, the elliptical potential lemma).
- Practical unified interface: OpTI-BFM supports three modes — (1) purely online with no prior knowledge, (2) warm-start with a small amount of offline data, and (3) non-stationary reward tracking — making it a unified framework for BFM deployment.
Limitations & Future Work¶
- Relies on the linear SF structure of BFMs — nonlinear reward functions fall outside the scope of the theoretical guarantees.
- Theoretical analysis assumes a perfectly trained BFM (accurate SFs and optimal policies); how approximation errors propagate in practice requires further analysis.
- The regret bound covers only episode-level updates, whereas experiments show that step-level updates perform better — a theory-practice gap remains.
- Evaluation is limited to DMC environments (Walker/Cheetah/Quadruped); more complex high-dimensional visual settings are not tested.
- The dimension \(d\) of the task embedding space \(\mathcal{Z}\) affects regret — high-dimensional embedding settings may require substantially more exploration.
Related Work & Insights¶
- vs. Standard BFM task inference (FB): Requires 50K annotated samples and a pre-stored dataset; OpTI-BFM requires only ~5 episodes of online interaction.
- vs. LoLA (Sikchi 2025): Black-box policy search that ignores the linear structure, converging more than 6× slower.
- vs. LinUCB (Abbasi-Yadkori 2011): The theoretical foundation of OpTI-BFM, extended to the novel BFM setting where regression and acquisition use distinct contexts.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The connection between BFM task inference and linear bandits is remarkably elegant, directly addressing the core practical bottleneck of BFMs.
- Experimental Thoroughness: ⭐⭐⭐⭐ Standard benchmarks with diverse ablations (non-stationarity, warm-start, data efficiency comparisons).
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical motivation is clear, with a smooth progression from intuition to formal derivation.
- Value: ⭐⭐⭐⭐⭐ The ICLR Oral distinction is well deserved; the work substantially advances the practical deployment of BFMs.