Improving Planning and MBRL with Temporally-Extended Actions¶
Conference: NeurIPS 2025 arXiv: 2505.15754 Code: None Area: Reinforcement Learning Keywords: temporally-extended actions, model-based RL, planning, action duration, multi-armed bandit
TL;DR¶
This paper proposes treating action duration as an additional optimization variable in shooting-based planning and MBRL, combined with a multi-armed bandit (MAB) mechanism for automatic duration range selection. The approach significantly accelerates planning across multiple environments and solves challenging tasks that standard methods fail to handle.
Background & Motivation¶
Background: Continuous systems are commonly approximated in discrete time, where a small simulation step \(\delta_t\) requires a long planning horizon \(D\), imposing substantial computational burden on CEM/MPPI.
Limitations of Prior Work: The optimal frame-skip value varies across environments. While model-free RL has explored learning frame-skip, this problem remains unaddressed in planning and MBRL. Long rollouts also exacerbate compounding errors.
Key Challenge: Small \(\delta_t\) ensures accuracy but enlarges the search space; large frame-skip reduces search complexity but sacrifices flexibility.
Goal: Enable the planner to jointly optimize actions and duration \(\delta t_k \in [\delta t_{\min}, \delta t_{\max}]\) at each step.
Key Insight: Treat \(\delta t\) as a continuous optimization variable, with MAB automatically selecting \(\delta t_{\max}\).
Core Idea: Action duration as a planner optimization variable, combined with a learned temporally-extended dynamics model and MAB-based automatic range selection.
Method¶
Overall Architecture¶
CEM outputs \((a_k, \delta t_k)\) at each decision step. The return is defined as \(J_2 = \sum_k \gamma^{e_{<k}} \sum_t \gamma^{t-1} \mathcal{R}(s,a)\), where \(e_k = \lfloor \delta t_k / \delta_t^{\text{env}} \rfloor\). Dual discount factors \(\gamma_1, \gamma_2\) are supported.
Key Designs¶
-
Temporally-Extended Dynamics Model \(\hat{F}_{\text{TE}}\):
- Takes \((s, a, \delta t)\) as input and outputs the next state distribution and reward.
- Inference time is constant (versus the iterative model \(F_{\text{IP}}\), which scales linearly with \(\delta t\)).
- Shorter rollouts reduce compounding errors.
-
MAB for Automatic \(\delta t_{\max}\) Selection:
- \(m = \log_2(T)\) exponentially spaced candidate values.
- UCB with EMA: \(\arg\max_i (\hat{R}_{i,T} + c\sqrt{2\log T / N(i,T)})\).
- Each candidate maintains an independent dataset and model.
-
Search Space Analysis:
- Standard: search space \(|\mathcal{A}|^H\), optimizing \(H|\mathcal{A}|\) variables.
- TE (\(m\)-fold): search space \(|\mathcal{A}|^{H/m}\), optimizing \((H/m)(|\mathcal{A}|+1)\) variables.
Key Experimental Results¶
Planning Experiments (Exact Dynamics)¶
| Environment | Standard | TE | Improvement |
|---|---|---|---|
| Mountain Car | Requires \(D \geq 60\) | Solved with \(D_{\text{TE}} \geq 4\) | 15× horizon reduction |
| Multi-hill MC (5 instances) | Solves 1/5 | Solves all | Resolves previously intractable tasks |
| Dubins Car (102-dim) | OOM at 4GB | 103MB, successful | 40× memory reduction |
MBRL Experiments¶
| Environment | PETS | TE(F) | TE(D) | Max Gain |
|---|---|---|---|---|
| Reacher | -6.5 | -4.2 | -4.5 | +35% |
| HalfCheetah | ~4900 | ~6100 | ~5800 | +24% |
| Hopper | ~230 | ~680 | ~350 | +195% |
| Walker | ~420 | ~610 | ~540 | +45% |
Ablation Study¶
| Configuration | Effect |
|---|---|
| Fixed \(\gamma_1\), decreasing \(\gamma_2\) | More decision steps, total steps unchanged |
| Fixed \(\gamma_2\), decreasing \(\gamma_1\) | Fewer total steps |
| Shared vs. independent models | Independent models perform better |
| TE(F) vs. TE(D) | MAB auto-selection approaches manual optimum |
Key Findings¶
- Sparse-reward environments benefit most: shallow depth enables deep search.
- MBRL advantages: shorter rollouts reduce compounding errors and accelerate convergence.
- Hopper shows the largest improvement (+195%).
- MAB eliminates the need for manual hyperparameter tuning.
Highlights & Insights¶
- Simplicity and effectiveness: Adding a single optimization dimension substantially extends the capability boundary of the planner.
- MAB-based hyperparameter automation: EMA + UCB addresses the non-stationary bandit problem.
- Search space analysis: The \(2^H\) vs. \(2^{H/m}\) comparison is intuitive and illuminating.
- The framework is transferable to any system involving a trade-off between temporal resolution and search depth.
Limitations & Future Work¶
- Predicting states reached after \(e_k\) steps from \(s\) is inherently more difficult than single-step prediction, posing accuracy challenges for \(\hat{F}_{\text{TE}}\) under long action durations.
- Maintaining independent models for each \(\delta t_{\max}\) candidate increases memory and training time overhead.
- Validation is limited to shooting-based planners (CEM); gradient-based methods and tree search (MCTS) have not been explored.
- Theoretical analysis is absent — under what conditions do temporally-extended actions guarantee no loss of optimality, and how does error propagation relate to \(\delta t\)?
- The dual discount factors \(\gamma_1, \gamma_2\) introduce additional degrees of freedom, though setting them equal suffices as a default.
Comparison of Two Variants¶
| Property | \(A_{\text{TE}}\)(F) Fixed Range | \(A_{\text{TE}}\)(D) Dynamic Selection |
|---|---|---|
| \(\delta t_{\max}\) | Manually specified | Automatically selected by MAB |
| Number of models | 1 | \(m = \log_2(T)\) |
| Dataset | Shared | Independent per candidate |
| Tuning requirement | Requires tuning \(\delta t_{\max}\) | None (MAB automated) |
| Performance | Slightly better when hand-tuned optimally | Approaches hand-tuned optimum; more robust |
Related Work & Insights¶
- vs. PETS (Chua et al. 2018): TE + MAB is integrated into PETS with minimal architectural changes.
- vs. Ni & Jang (2022): Model-free timescale learning; this work directly optimizes duration in MBRL.
- vs. Options (Sutton et al. 1999): Options require learning initiation and termination conditions; the proposed approach is substantially more lightweight.
Rating¶
- Novelty: ⭐⭐⭐⭐ First systematic study of action duration optimization in MBRL and planning.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two major settings, 7+ environments, complete ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, consistent notation.
- Value: ⭐⭐⭐⭐ Highly practical; directly integrable into existing MBRL frameworks.