Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents¶

Conference: ICLR 2026 arXiv: 2510.03253 Code: To be confirmed Area: Agent / Alignment Keywords: hierarchical DPO, preference learning, long-horizon agent, curriculum learning, action group

TL;DR¶

This paper proposes the HPL framework to address the granularity mismatch in preference learning for long-horizon LLM agents. Through three-level DPO (trajectory-level + step-level + action-group-level) and two-dimensional curriculum learning (subtask complexity × sample difficulty), HPL significantly outperforms baselines such as ETO and IPR on ALFWorld/WebShop/InterCode-SQL (average 59.44 vs. 55.43/55.49).

Background & Motivation¶

Background: Direct Preference Optimization (DPO) has become the dominant approach for LLM alignment, yet it suffers from granularity mismatch in long-horizon agent tasks — trajectory-level DPO signals are too coarse (unable to identify critical decision points), while step-level signals exhibit excessive variance.

Limitations of Prior Work: Existing agent preference learning methods either rely on outcome-level rewards (trajectory success/failure) or use step-level signals that require extensive rollouts to reduce variance. Each granularity has its own trade-offs, and no unified framework has been established.

Key Challenge: Coarse granularity prevents precise credit assignment; fine granularity yields high variance and low sample efficiency. An "appropriately sized" granularity is needed.

Goal: Design a multi-granularity preference learning framework that simultaneously leverages trajectory-level, step-level, and action-group-level preference signals.

Key Insight: Group action sequences by semantic coherence (e.g., "navigate to the kitchen" as one group, "open the refrigerator and retrieve an item" as another), and perform preference comparison at the group level.

Core Idea: Three-level DPO provides complementary credit assignment signals, coupled with two-dimensional curriculum learning that guides training from easy to hard.

Method¶

Overall Architecture¶

HPL constructs preference pairs at three hierarchical levels from exploratory data generated by a reference policy: ① trajectory-level comparison of complete trajectories → ② step-level Monte Carlo rollouts from decision points to compare subsequent sub-trajectories → ③ action-group-level comparison of semantically coherent action group pairs. The three DPO losses are combined with weighted summation and scheduled via two-dimensional curriculum learning.

Key Designs¶

Three-Level DPO Preference Signals:
- \(L_{traj\text{-}DPO}\): Compares complete trajectory pairs to provide global signals.
- \(L_{step\text{-}DPO}\): Performs Monte Carlo rollouts (\(M=5\)) from decision points to compare subsequent sub-trajectories.
- \(L_{group\text{-}DPO}\): Compares pairs of semantically coherent action groups.
- Final loss: \(L = L_{BC} + L_{traj} + L_{step} + L_{group}\)
- Theoretical guarantee (Proposition 1): Group-level DPO achieves a variance reduction of \(O(T/\log(1/\varepsilon))\) when \(k=\Theta(\log(1/\varepsilon))\).
Action Group Segmentation Strategies:
- Fixed-N: Fixed \(N=3\) groups.
- Fixed-K: Fixed \(K=3\) steps per group.
- Uncertainty-based: Segments at the 80th percentile threshold of policy entropy.
- Semantic (best): Uses GPT-4o as a semantic segmenter to group actions by subtask meaning.
- Design Motivation: Semantic segmentation yields the highest intra-group coherence, resulting in higher-quality DPO signals.
Two-Dimensional Curriculum Learning:
- \(3\times3\) difficulty matrix: Y-axis = group length (subtask complexity); X-axis = \(\Delta R = \hat{r}(G_w) - \hat{r}(G_l)\) (sample discriminability).
- Phase 1: \(B_{1,1}\) (short + easy) → Phase 2: \(B_{1,1} \cup B_{1,2} \cup B_{2,1}\) → Phase 3: all 9 buckets.
- Design Motivation: The model first establishes basic preference understanding on simple samples, then progressively incorporates harder samples.

Key Experimental Results¶

Main Results (Qwen2.5-1.5B)¶

Method	ALFWorld unseen	WebShop reward	InterCode-SQL	Average
ETO	66.42	56.57	57.67	55.43
IPR	66.67	57.76	57.17	55.49
HPL (Semantic)	74.13	60.74	58.50	59.44
GPT-4o zero-shot	36.43	—	—	—

Segmentation Strategy Comparison¶

Strategy	Average Score
Semantic (GPT-4o)	59.44
Fixed-N (3)	58.45
Uncertainty	56.95
Fixed-K (3)	56.74

Ablation Study¶

Configuration	Effect
Full HPL	Best
w/o group-DPO	Performance degradation
w/o curriculum learning	Impaired learning on long groups and hard samples
w/o step-DPO	Coarser credit assignment

Key Findings¶

Semantic segmentation significantly outperforms other strategies (59.44 vs. 56.74–58.45), indicating that semantic coherence is critical for group-level DPO.
HPL surpasses GPT-4o zero-shot (ALFWorld 74.13 vs. 36.43), demonstrating that a trained 1.5B model can far exceed a closed-source large model.
The three DPO levels are complementary: removing any single level degrades performance.
Curriculum learning is especially important for hard samples and long action groups.

Highlights & Insights¶

Action-group-level DPO represents a "sweet spot" between trajectory and step levels — its granularity aligns precisely with subtask boundaries.
The finding that semantic segmentation > fixed segmentation suggests that segmentation quality matters more than the segmentation method itself.
The \(3\times3\) two-dimensional curriculum is practically well-designed, simultaneously accounting for task complexity and sample difficulty.
The theoretical guarantee (variance reduction of \(O(T/\log(1/\varepsilon))\)) provides rigorous mathematical support for the practical approach.

Limitations & Future Work¶

Relies on one-shot exploration data collected from a frozen reference policy rather than online RL.
The number of Monte Carlo rollouts is limited (\(M=5\)), leaving potentially high per-step estimation variance.
Semantic segmentation depends on GPT-4o, introducing additional cost and external dependency.
Adaptive granularity selection — choosing different granularities at different steps based on uncertainty — remains unexplored.

vs. ETO: ETO uses only trajectory-level signals and cannot precisely locate erroneous steps.
vs. GRPO/GiGPO: GRPO employs group-relative advantage estimation, while HPL adopts group-level DPO; the two approaches are complementary.
vs. RLHF: HPL avoids reward model training by learning directly from preference pairs.
This work can inspire the application of multi-granularity feedback in agent training.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of three-level DPO and two-dimensional curriculum is well-motivated and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Three benchmarks with comparisons across multiple segmentation strategies.
Writing Quality: ⭐⭐⭐⭐ Theory and experiments are well integrated.
Value: ⭐⭐⭐⭐ Provides a practical framework for long-horizon agent alignment.