Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning¶
Conference: ICML2025
arXiv: 2506.07744
Code: GitHub
Area: Offline Hierarchical RL
Keywords: Offline reinforcement learning, hierarchical RL, trajectory stitching, graph search, temporal distance representation, subgoal selection
TL;DR¶
This paper proposes the Graph-Assisted Stitching (GAS) framework, which replaces explicit high-level policy learning with graph-search-based subgoal selection. By constructing a graph through clustering in the Temporal Distance Representation (TDR) space and performing shortest-path planning, GAS enables highly efficient cross-trajectory stitching in offline HRL. It achieves a breakthrough, boosting the performance on the most challenging antmaze-giant-stitch task from the Prev. SOTA of 1.0 to 88.3.
Background & Motivation¶
Offline hierarchical reinforcement learning (offline HRL) tackles long-horizon, sparse-reward tasks by utilizing a high-level policy to generate subgoals and a low-level policy to execute actions. However, existing methods suffer from three major bottlenecks:
Temporal-unaware subgoal sampling: Most methods sample subgoal candidates at fixed step intervals along a trajectory (e.g., every \(c\) steps), neglecting the actual temporal distance relationships between states, which leads to redundant or inefficient subgoals.
Insufficient long-horizon reasoning capability: Methods like HIQL perform well on antmaze-medium but experience a sharp decline in ultra-long horizon environments like antmaze-giant, because sparse rewards provide extremely weak learning signals for the high-level policy.
Lack of cross-trajectory stitching capability: Offline datasets contain multiple trajectories aiming at different goals. Ideally, a policy should stitch useful segments from various trajectories to form new paths. However, prior methods lack an effective cross-goal stitching mechanism.
Core Idea: Since training a high-level policy is difficult, it is better to replace it with graph search—specifically, construct a state graph in the TDR space and directly select the sequence of subgoals using the Dijkstra shortest-path algorithm.
Method¶
GAS comprises four core components:
1. Temporal Distance Representation (TDR) Pre-training¶
Embed states into a latent space \(\mathcal{H}\) such that the Euclidean distance between two states equals the minimum number of time steps required for the optimal policy to transition between them:
Train the representation on offline data via an IQL-style expectile regression loss:
where \(V(s,g) = -\|\psi(s) - \psi(g)\|_2\), and \(\ell_\tau^2(x) = |\tau - \mathbb{1}(x<0)| \cdot x^2\).
2. TD-Aware Graph Construction¶
Cluster all states in the TDR space with a fixed temporal distance interval \(H_{\text{TD}}\):
- The first state becomes the center of the first cluster.
- Subsequent states are assigned to the nearest cluster if their distance to its center is \(\leq H_{\text{TD}}\); otherwise, a new cluster is created.
- Cluster center = mean of the states in the cluster \(\rightarrow\) used as graph nodes.
- Construct edges between nodes with distance \(\leq H_{\text{TD}}\).
Semantically similar states from different trajectories are aggregated into the same node, naturally enabling cross-trajectory stitching.
3. Temporal Efficiency (TE) Metric Filtering¶
To prevent low-quality states (from suboptimal trajectories) from contaminating the graph structure, a TE metric is defined to measure the execution efficiency of transitions at the current state:
where \(s_{\text{opt}}\) is the optimal future state along the trajectory at a temporal distance of \(H_{\text{TD}}\), and \(s_{\text{reached}}\) is the state actually reached after \(H_{\text{TD}}\) steps. Only highly efficient states with \(\theta_{\text{TE}} \geq 0.9\) are kept for graph construction. The selected states account for only 2%–8% of the dataset, and the number of nodes is less than 1% of the dataset overall.
4. Task Planning and Execution¶
No high-level policy is required during inference:
- Precompute the shortest distances from all graph nodes to the final goal using Dijkstra's algorithm.
- Find reachable nodes within the \(H_{\text{TD}}\) range of the current state.
- Select \(v_{\text{subgoal}} = \arg\min_{v \in \mathcal{V}_{\text{near}}} (\text{Dists}[v] + \|h_{\text{cur}} - v\|_2)\).
- The low-level policy executes actions along the direction vector \(\vec{h}_{\text{dir}} = \text{dir}(\psi(s_t), \psi(s_{\text{sub}}))\).
The low-level policy is trained using DDPG+BC loss and IQL value functions, with subgoal sampling adjusted to TDR distance rather than fixed steps.
Key Experimental Results¶
State-based Benchmarks (OGBench)¶
| Task | GCBC | HIQL | HHILP | GAS |
|---|---|---|---|---|
| antmaze-medium-navigate | 33.1 | 95.3 | 96.3 | 96.3 |
| antmaze-large-navigate | 23.4 | 89.9 | 86.8 | 93.2 |
| antmaze-giant-navigate | 0.0 | 67.3 | 53.1 | 77.6 |
| antmaze-medium-stitch | 43.2 | 92.0 | 96.0 | 98.1 |
| antmaze-large-stitch | 2.3 | 71.7 | 34.1 | 96.3 |
| antmaze-giant-stitch | 0.0 | 1.0 | 0.0 | 88.3 |
| antmaze-large-explore | 0.0 | 2.9 | 2.4 | 94.2 |
| scene-play | 5.4 | 40.0 | 43.4 | 73.6 |
| kitchen-partial | 69.5 | 73.1 | 66.7 | 87.3 |
Key Findings¶
- Breakthrough improvement on stitching tasks: Performance on antmaze-giant-stitch surges from HIQL's 1.0 \(\rightarrow\) 88.3 (+8730%).
- Enormous advantage in exploratory datasets: Performance on antmaze-large-explore improves from HIQL's 2.9 \(\rightarrow\) 94.2.
- Equally effective in visual environments: Reaches 55.8 on visual-antmaze-giant-stitch (compared to only 3.6 for Prev. SOTA).
TE Filtering Ablation¶
| Task | Graphing with All States | With TE Filtering | Gain |
|---|---|---|---|
| antmaze-giant-navigate | 63.4 | 77.6 | +14.2 |
| antmaze-giant-stitch | 75.3 | 88.3 | +13.0 |
| antmaze-large-explore | 75.4 | 94.2 | +18.8 |
Highlights & Insights¶
- Paradigm shift: Replacing high-level policy learning with graph search bypasses the fundamental difficulty of training high-level policies under sparse rewards.
- Exquisite design of the TE metric: Simply using Cosine Similarity effectively filters out suboptimal states. Utilizing only 2%–8% of the data yields better results than using the entire dataset.
- Unified \(H_{\text{TD}}\) parameter: A single hyperparameter simultaneously controls the clustering interval, edge construction threshold, subgoal sampling distance, and reachable range during execution, ensuring consistency between training and testing.
- Validation in extreme stitching scenarios: Achieves a quantum leap from 1.0 to 88.3 on antmaze-giant-stitch, the task that demands the strongest stitching capability.
Limitations & Future Work¶
- Performance gap in visual environments: Compared to state-based environments, overall performance in visual environments remains suboptimal. The authors acknowledge a lack of specialized representation learning for high-dimensional visual inputs.
- \(H_{\text{TD}}\) requires environment-specific tuning: The optimal value of \(H_{\text{TD}}\) varies across tasks, and there lacks an adaptive selection mechanism.
- Static graph structure: The graph is constructed once during inference and is not updated online based on execution feedback.
- Dijkstra algorithm is re-run per episode: Although the number of nodes is <1%, latency might still become an issue in large-scale environments.
Related Work & Insights¶
- HIQL (Park et al., 2023): Utilizes a latent space representation learned via value functions for the high-level policy. GAS inherits the TDR and low-level training implementation from HIQL.
- HHILP (Park et al., 2024c): Introduces TDR representation learning. Building on this, GAS replaces the high-level policy with graph search.
- OGBench (Park et al., 2025a): Provides a standardized evaluation benchmark containing three categories of datasets: navigate, stitch, and explore.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Replacing high-level policies with graph search and the TE filtering metric are core innovations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers locomotion/navigation/manipulation, state/visual modalities, and 5 ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clearly structured with intuitive diagrams and complete mathematical formulations.
- Value: ⭐⭐⭐⭐⭐ — Achieves order-of-magnitude improvements in offline HRL stitching tasks, showing significant practical value.