Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks¶

Conference: ICLR 2026 arXiv: 2602.22817 Code: To be confirmed Area: LLM Alignment Keywords: group-relative RL, advantage estimation, long-horizon agent, bias-variance tradeoff, context consistency

TL;DR¶

This paper identifies a "historical context inconsistency" problem in stepwise group-based RL methods (e.g., GRPO/GiGPO)—steps within the same group may have different historical contexts, leading to biased advantage estimation. HGPO is proposed to achieve low-bias, balanced-variance advantage estimation through hierarchical grouping and adaptive weighting, yielding significant improvements on ALFWorld and WebShop with negligible additional overhead (<0.001%).

Background & Motivation¶

Background: RL-based LLM agent training methods (e.g., GRPO, GiGPO) have shown strong performance on long-horizon tasks. The core idea is to group multiple steps from the same rollout and estimate advantages using relative signals within the group.

Limitations of Prior Work: In long-horizon tasks, different steps within the same rollout episode may have entirely different historical contexts (e.g., step 3 and step 10 face distinct combinations of environment states). Mixing steps with inconsistent contexts when computing advantages introduces systematic bias.

Key Challenge: Step-level advantage estimation is unbiased but high-variance; group-level estimation is low-variance but biased. How can an optimal balance between the two be achieved?

Goal: Design a hierarchical advantage estimation approach that constructs nested group structures based on historical context consistency, enabling controllable bias-variance tradeoff.

Key Insight: Define the \(k\)-step context operator \(\mathcal{C}_k\), and construct nested groups \(G_0^H \supseteq G_1^H \supseteq \cdots \supseteq G_K^H\) based on shared historical context over 0 to \(K\) steps.

Core Idea: Groups with greater context consistency yield more accurate (lower-bias) advantage estimates and should receive higher weights.

Method¶

Overall Architecture¶

HGPO inserts a hierarchical advantage estimation module into the standard GRPO/GiGPO pipeline. For each step, nested groups are constructed based on historical context at multiple levels; advantages are computed per level and then aggregated via adaptive weighting. No additional models or rollouts are required—the approach relies entirely on offline hashmap lookups.

Key Designs¶

Context-Aware Hierarchical Grouping:
- Function: Stratify steps by historical context consistency.
- Mechanism: \(G_k^H\) contains all steps that share identical history over the first \(k\) steps. \(k=0\) corresponds to the entire rollout (all steps share an empty history); \(k=K\) yields the finest-grained grouping.
- Implementation: State sequence hashes are stored in a hashmap, enabling \(O(1)\) lookup.
- Design Motivation: Groups at higher \(k\) have more consistent historical contexts, resulting in lower bias in advantage estimation.
Adaptive Weighting Advantage Estimation:
- Function: Aggregate advantage estimates across hierarchy levels.
- Core Formula: \(w_k = \frac{(k+1)^\alpha}{\sum_k (k+1)^\alpha}\), assigning greater weight to higher-level (larger \(k\)) groups.
- \(\alpha\) controls the bias-variance tradeoff: \(\alpha \to 0\) reduces to uniform weighting; \(\alpha \to \infty\) reduces to the finest-grained grouping.
- Theoretical Guarantee: The advantage estimates of HGPO interpolate between step-level (unbiased, high-variance) and oracle estimation.
Computational Overhead Control:
- Offline hashmap lookup requires no additional forward passes.
- Adds approximately 0.5 seconds per iteration (<0.001% of total training time).
- Compatible with any group-based RL method (GRPO, GiGPO, DAPO, etc.).

Key Experimental Results¶

Main Results¶

Method	ALFWorld In-Succ	ALFWorld Out-Succ	WebShop Score	WebShop Succ
GiGPO (1.5B)	93.29%	91.53%	86.80%	73.24%
HGPO (1.5B, K=4)	94.85%	92.12%	90.64%	78.12%
GiGPO (7B)	95.43%	92.79%	88.44%	72.50%
HGPO (7B, K=4)	95.96%	93.75%	90.49%	79.29%
GPT-4o	—	48.0%	—	—
Gemini-2.5-Pro	—	60.3%	—	—

Ablation Study¶

Configuration	WebShop Score
HGPO K=0 (= GiGPO)	86.80%
HGPO K=1	87.32%
HGPO K=2	88.92%
HGPO K=4	90.64%

Key Findings¶

Smaller models (1.5B) benefit more: average gain of 3.41% (K=2); larger models (7B) gain 0.74%.
Performance improves with larger \(K\), though with diminishing returns.
HGPO surpasses closed-source models such as GPT-4o and Gemini-2.5-Pro on ALFWorld.
Computational overhead is negligible (<0.001% increase in training time).

Highlights & Insights¶

Valuable Problem Identification: "Historical context inconsistency" is a real and previously overlooked issue in group-based RL.
Zero-Cost Improvement: No additional models, rollouts, or GPUs are required—gains are achieved purely through hashmap lookups and weighted aggregation.
Plug-and-Play: Compatible with any group-based RL method, including GRPO, GiGPO, and DAPO.
Theoretical analysis demonstrates that HGPO strictly dominates pure step-level and pure group-level estimation along the bias-variance spectrum.

Limitations & Future Work¶

Validation is limited to two benchmarks (ALFWorld and WebShop); broader coverage would strengthen the claims.
Gains for larger models (7B) are marginal—only 0.13% at K=4—suggesting that advantage estimation is already relatively accurate at scale.
The approach assumes hashable environment states; applicability to continuous state spaces is not discussed.
In-depth comparison with value-based advantage estimation methods (e.g., GAE) is absent.

vs. GRPO: GRPO groups at the outcome level and ignores step-level contextual differences.
vs. GiGPO: GiGPO extends to the step level but still uses the full rollout as the group, retaining context inconsistency.
vs. DAPO: DAPO focuses on exploration and clipping, which is orthogonal to HGPO and can be combined with it.
The findings offer broad implications for all group-based RLHF and agent training methods.

Supplementary Technical Details¶

Illustrative Example of Context Inconsistency¶

In ALFWorld, a task such as "find an apple and place it in the refrigerator" may span multiple steps. Step 3 (opening a drawer) and step 8 (opening the refrigerator) may belong to the same rollout group yet face entirely different environment states. Using the intra-group mean as the baseline in this setting introduces bias.

Conceptual Comparison with GAE¶

Generalized Advantage Estimation (GAE) interpolates between TD(0) and Monte Carlo estimation via the \(\lambda\) parameter to control the bias-variance tradeoff. HGPO adopts a similar philosophy but operates at the group level rather than the temporal-step level, and requires no additional value function approximation.

Rating¶

Novelty: ⭐⭐⭐⭐ The identified problem is valuable; the solution is elegant and zero-cost.
Experimental Thoroughness: ⭐⭐⭐ Two benchmarks are sufficient to demonstrate effectiveness, though broader coverage is desirable.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; theoretical analysis is rigorous.
Value: ⭐⭐⭐⭐ Plug-and-play improvement with practical significance for the group-based RL community.