Hierarchical Reinforcement Learning with Targeted Causal Interventions¶
Conference: ICML2025
arXiv: 2507.04373
Code: GitHub
Area: Reinforcement Learning
Keywords: Hierarchical Reinforcement Learning, Causal Discovery, Subgoal Structure, Intervention Sampling, Long-Horizon Sparse Rewards
TL;DR¶
This paper proposes the HRC framework, which models the relationships between subgoals in hierarchical reinforcement learning as a causal graph. It learns the subgoal structure using a causal discovery algorithm and performs targeted interventions based on causal effect prioritization, significantly reducing the training cost for long-horizon sparse reward tasks.
Background & Motivation¶
Traditional RL performs poorly in long-horizon sparse reward tasks. HRL mitigates this issue by decomposing tasks into a hierarchy of subgoals. The core challenge lies in how to efficiently discover the hierarchical structure among subgoals and leverage it to accelerate training.
Existing works (Hu et al., 2022; Nguyen et al., 2024) attempt to infer causal relationships among subgoals using causal discovery algorithms, but suffer from the following limitations:
- Direct adaptation of general-purpose causal discovery algorithms (Ke et al., 2019) without customization for HRL scenarios.
- Random selection from the controllable subgoal set for exploration, failing to prioritize based on causal effects.
- Lack of theoretical analysis and performance guarantees.
To address these issues, this paper proposes a systematic causal hierarchical reinforcement learning framework.
Method¶
Overall Architecture: HRC (Hierarchical RL via Causality)¶
HRC associates environmental "unlocking factor" variables \(\mathcal{X} = \{X_1, \ldots, X_n\}\) with subgoals \(\Phi = \{g_1, \ldots, g_n\}\) in a one-to-one mapping, where subgoal \(g_i\) is achieved when \(X_i^t = 1\). The causal relationships among subgoals are represented by the subgoal structure \(\mathscr{G}\) (a directed graph).
The algorithm maintains two key sets:
- Controllable Set (CS): Subgoals that have already been mastered.
- Intervention Set (IS): Subgoals utilized for causal exploration.
Algorithm Flow (Algorithm 1)¶
- Initialization: Train root subgoals (subgoals without parent nodes) and add them to CS.
- Loop (until the final goal is added to IS):
- Select a subgoal \(g_{\text{sel}}\) from CS \(\rightarrow\) add to IS.
- Intervention Sampling: Perform interventions on subgoals in IS and collect trajectory data \(D_I\).
- Causal Discovery: Infer the subgoal structure \(\hat{\mathscr{G}}\).
- Identify Reachable Subgoals (CCS): Subgoals whose parent nodes are all in IS.
- Train Reachable Subgoals \(\rightarrow\) add to CS.
Targeted Strategy¶
The key innovation lies in how to select \(g_{\text{sel}}\) from CS, proposing two causally guided ranking rules:
Rule 1: Causal Effect Ranking
Selects the subgoal with the largest causal effect on the final goal \(g_n\). When all subgoals are of AND type, this rule achieves optimality by adding only subgoals with active paths to the final goal into IS.
Rule 2: Shortest Path Ranking
Inspired by A* search, it selects the subgoal with the minimum combined cost \(\mathsf{f}(g_i) = \mathsf{g}(g_i) + \mathsf{h}(g_i)\). When the subgoal structure is a DAG and exclusively consisting of OR types, this rule perfectly matches the shortest path, yielding the minimum training cost.
Causal Discovery Algorithm: SSD (Subgoal Structure Discovery)¶
The state transition of subgoals is modeled as an Abstract Structural Causal Model (A-SCM):
where for AND subgoals, \(\theta_i = \bigwedge_{g_j \in PA_{g_i}} X_j^t\), and for OR subgoals, \(\theta_i = \bigvee_{g_j \in PA_{g_i}} X_j^t\).
The parent nodes are recovered by minimizing a loss function with sparsity regularization:
Theoretical proof (Theorem 8.4): There exists \(\lambda > 0\) such that the positive coefficients in the optimal solution \(\boldsymbol{\beta}^*\) correspond exactly to the parent nodes of \(X_i\).
Key Experimental Results¶
Theoretical Cost Analysis (Table 1)¶
| Graph Structure | HRC_h (Targeted) | HRC_b (Random) |
|---|---|---|
| Tree \(G(n,b)\) | \(O(\log^2(n) \cdot b)\) | \(\Omega(n^2 b)\) |
| Semi-Erdős–Rényi \(G(n,p)\) | \(O(n^{4/3+2c/3} \log n)\) | \(\Omega(n^2)\) |
The targeted strategy achieves exponential acceleration on tree structures.
Synthetic Data Experiments (Figure 5)¶
- On semi-Erdős–Rényi graphs, both HRC_c and HRC_s significantly outperform HRC_b.
- On tree structures, the causal effect rule is equivalent to the shortest path rule (as there is only one path to the final goal).
- The sparser the graph, the greater the advantage of the targeted strategy.
2D-Minecraft Environment (Figure 6)¶
| Method | Convergence Speed |
|---|---|
| HRC_h(SSD) | Fastest |
| HRC_b(SSD) | Second fastest |
| CDHRL | Slow |
| HAC / HER / OHRL / PPO | Slowest |
Causal Discovery Accuracy Comparison (Table 2)¶
| Method | SHD ↓ | Missing Edges | Extra Edges |
|---|---|---|---|
| SSD (Ours) | 12.3 | 6.0 | 6.3 |
| SDI (Ke et al.) | 19.8 | 4.2 | 15.6 |
The number of extra edges for SSD is far lower than that of SDI, reducing the overall SHD by 38%.
Highlights & Insights¶
- Causal Perspective on HRL: Equating subgoal achievement to a causal intervention (the do-operator), establishing an elegant bridge between HRL and causal inference.
- Theoretical Guarantees: Providing the first training cost complexity analysis for causal HRL, showing a reduction in cost from \(\Omega(n^2)\) to \(O(\log^2 n)\) on tree structures.
- Customized Causal Discovery: SSD is specifically tailored to the AND/OR subgoal structures of HRL, achieving superior accuracy compared to general-purpose algorithms.
- Complementary Dual-Ranking Rules: The causal effect rule is suited for AND-type subgoals, while the shortest path rule is suited for OR-type subgoals, in addition to a hybrid rule.
Limitations & Future Work¶
- The resource variables are assumed to be discrete and binary; continuous or high-dimensional state spaces require additional decoupled representation learning.
- Subgoals are assumed to never be lost once achieved (Assumption 4.2), which does not hold in reversible environments.
- Causal discovery can only recover discoverable parent nodes (Def 8.2), meaning some parent-child relationships may be missed.
- Experiments are only validated in 2D-Minecraft; evaluations in more complex 3D environments or robotic manipulation tasks are missing.
- The set of environmental resource variables \(\mathcal{X}\) must be predefined in practice; autonomous discovery remains an open problem.
Related Work & Insights¶
- CDHRL (Hu et al., 2022): The first framework for causal HRL, which utilizes SDI for causal discovery but lacks prioritization.
- Nguyen et al., 2024: Performs causal discovery on state-action pairs, which is more challenging in large state spaces.
- HAC (Levy et al., 2017): Hierarchical Actor-Critic without causal structures.
- HER (Andrychowicz et al., 2017): Hindsight Experience Replay.
- Insight: Causal effect prioritization can be extended to other domains requiring prioritized exploration, such as active learning and experimental design.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Causal intervention perspective + targeted exploration strategy + specialized causal discovery)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Theoretical validation + synthetic data + 2D-Minecraft, but lacks evaluations in more realistic scenarios)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, with the "craftsman" example consistently integrated for ease of understanding)
- Value: ⭐⭐⭐⭐ (Establishes a theoretical foundation for causal HRL, with the targeted strategy yielding practical acceleration)