Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents¶

Conference: ACL 2026 arXiv: 2604.05808 Code: GitHub Area: LLM Agent / Hierarchical Reinforcement Learning Keywords: Hierarchical Reinforcement Learning, Step-Level Transitions, Local Progress, Token Efficiency, Offline RL

TL;DR¶

This paper proposes STEP-HRL, which introduces a local progress module to iteratively compress interaction history within each subtask into compact textual summaries, enabling both high-level and low-level policies to make decisions based solely on single-step transitions rather than full histories. The approach achieves significant performance and generalization gains on ScienceWorld and ALFWorld while reducing token usage.

Background & Motivation¶

Background: LLM agents have demonstrated strong capabilities in interactive decision-making tasks. Reinforcement learning provides a principled mechanism for improving agents through environment interaction and reward feedback. Existing LLM agents predominantly adopt a "history-conditioned" paradigm, where policies condition on ever-growing historical sequences.

Limitations of Prior Work: (1) The quadratic complexity of attention mechanisms makes reasoning over long histories computationally expensive; (2) unfiltered history accumulates redundant or irrelevant information that may obscure decision-critical signals; (3) existing HRL methods introduce temporal abstraction but still condition both high- and low-level policies on cumulative histories, inheriting the long-context dependency problem.

Key Challenge: History conditioning is a modeling choice rather than a necessity of RL — conflating long-horizon decision-making with long-context reasoning introduces unnecessary computational overhead and reasoning noise.

Goal: Design a progress-based rather than history-based HRL framework in which policies rely solely on single-step transitions for decision-making.

Key Insight: The sequence of completed subtasks naturally constitutes a compact summary of global progress; the remaining challenge is how to compactly represent the local interaction history within each subtask.

Core Idea: Introduce a local progress policy \(\pi_\theta^p\) that iteratively compresses the intra-subtask interaction history into a compact textual representation at each step. The low-level policy conditions only on the current subtask, local progress, and current observation, eliminating dependence on the full history.

Method¶

Overall Architecture¶

STEP-HRL comprises three policies sharing parameters: (1) the high-level policy \(\pi_\theta^h\) generates the next subtask conditioned on the task instruction, completed subtasks, the final progress of the previous subtask, and the current observation; (2) the low-level policy \(\pi_\theta^l\) generates primitive actions conditioned on the current subtask, local progress, and current observation; (3) the local progress policy \(\pi_\theta^p\) updates local progress conditioned on the current subtask, previous action, current observation, and prior-step progress. Training proceeds in two stages: behavioral cloning initialization → step-level offline RL optimization.

Key Designs¶

Local Progress Policy:
- Function: Iteratively compresses the growing intra-subtask interaction history into a fixed-size textual summary.
- Mechanism: \(p_t^k \sim \pi_\theta^p(\cdot | g_k, a_{t-1}^k, o_t^k, p_{t-1}^k)\). At each step, the policy receives the previous progress, the last action, and the current observation, selectively extracts subtask-relevant information, and outputs an updated compact progress summary. Initialized as empty: \(p_0^k = \varnothing\).
- Design Motivation: Unlike simple history truncation, local progress is selective — retaining only subtask-relevant information and discarding redundancy.
Step-Level Transition Construction:
- Function: Enables both low-level and high-level policies to make decisions based on constant-size inputs.
- Mechanism: The low-level step transition is \((o_t^k, p_t^k, a_t^k, \hat{r}_t^k, o_{t+1}^k, p_{t+1}^k)\); the high-level step transition is \((\hat{p}_{k-1}, o_0^k, g_k, R_k, \hat{p}_k, o_0^{k+1})\), where \(\hat{p}_k\) denotes the final local progress upon completion of subtask \(g_k\).
- Design Motivation: Step-level transitions are Markovian — decisions can be made without backtracking through the full history.
Step-Level Offline RL (IQL-based):
- Function: Further optimizes policies after behavioral cloning initialization.
- Mechanism: Built on the Implicit Q-Learning framework, the three policies share parameters but are each equipped with independent critic networks (utterance-level \(V\) and \(Q\)). Value functions are learned via expectile regression, and policies are optimized via advantage-weighted regression. The low-level policy uses intrinsic rewards (subtask completion = 1); the high-level policy uses extrinsic environment rewards.
- Design Motivation: Behavioral cloning merely imitates experts, whereas offline RL can discover superior policies; step-level transitions yield more stable value estimates for RL.

Loss & Training¶

The behavioral cloning stage employs autoregressive cross-entropy loss. The offline RL stage jointly optimizes: a Q-function TD regression loss, a value function expectile loss, and an advantage-weighted policy loss. All three policies share LLM parameters to facilitate cross-level knowledge transfer.

Key Experimental Results¶

Main Results¶

ScienceWorld (30 science task families)

Method	Total Score	Token Usage	Generalization (Unseen Variants)
ReAct	32.1	High	Low
GLIDER (HRL)	48.2	High	Medium
STEP-HRL (BC only)	52.7	Low	Medium
STEP-HRL (BC + RL)	57.3	Low	High

Ablation Study¶

Configuration	ScienceWorld	ALFWorld
No local progress (full history)	44.8	62.3
Fixed-window truncation	47.2	65.1
Local progress (STEP-HRL)	57.3	78.4

Key Findings¶

STEP-HRL with behavioral cloning alone already surpasses existing HRL baselines (52.7 vs. 48.2), validating the effectiveness of step-level transitions per se.
Offline RL provides a further 4.6-point improvement, demonstrating that step-level transitions enable more efficient RL optimization.
The local progress module outperforms fixed-window truncation by 10.1 points — selective information retention is substantially superior to naive truncation.
Parameter sharing across the three policies reduces training and inference overhead while promoting cross-level knowledge transfer.

Highlights & Insights¶

The core insight that "long-horizon decision-making ≠ long-context reasoning" is profound — step-level transitions demonstrate that information compression can substitute for history accumulation.
Local progress functions as an information bottleneck, naturally achieving attention focusing and noise filtering.
The parameter-sharing design across three policies strikes a favorable balance between efficiency and performance.

Limitations & Future Work¶

The quality of local progress depends on the LLM's summarization capability — weaker LLMs may produce low-quality progress representations.
Subtask decomposition and progress annotations in expert demonstrations are generated by DeepSeek, which may introduce its inherent biases.
Evaluation is limited to text-based environments (ScienceWorld, ALFWorld); applicability to visual or multimodal environments remains unexplored.
Offline RL is constrained by the quality and diversity of collected data.

vs. GLIDER: GLIDER employs HRL but still conditions on full histories; STEP-HRL eliminates historical dependence through local progress.
vs. ReAct: ReAct interleaves reasoning and acting but lacks hierarchical structure; STEP-HRL adds hierarchical abstraction and step-level optimization.
vs. Decision Transformer: DT reformulates decision-making as sequence prediction and requires complete trajectories; STEP-HRL requires only single-step transitions.

Rating¶

Novelty: ⭐⭐⭐⭐ The HRL design combining step-level transitions and local progress is novel and well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ Two benchmarks, detailed ablations, token analysis, and generalization evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Problem formulation is clear and method derivation is complete.
Value: ⭐⭐⭐⭐ Provides a more efficient framework for long-horizon decision-making in LLM agents.