Learning from Demonstrations via Capability-Aware Goal Sampling¶

Conference: NeurIPS 2025 arXiv: 2601.08731 Code: GitHub Area: Reinforcement Learning Keywords: Imitation Learning, Curriculum Learning, Goal-Conditioned Reinforcement Learning, Capability-Awareness, World Model

TL;DR¶

This paper proposes Cago, a method that dynamically tracks an agent's attainment capability along expert demonstration trajectories and adaptively samples intermediate goals near the capability frontier, constructing an implicit curriculum to guide learning in long-horizon, sparse-reward tasks.

Background & Motivation¶

Background: Imitation learning trains agents from expert demonstrations via methods such as behavioral cloning (BC), GAIL, and inverse reinforcement learning, yet significant challenges remain for long-horizon complex tasks.

Limitations of Prior Work: - BC suffers from compounding errors. - Distribution-matching methods (e.g., GAIL) perform "flat matching" in early training, failing to distinguish between mastered and unmastered segments. - Reverse curriculum methods require resetting the agent to arbitrary states along demonstrations, which is unrealistic in real-world settings where quantities such as joint velocities are difficult to reproduce precisely.

Key Challenge: Existing methods do not account for the dynamic evolution of agent capability—they lack awareness of which portions of a task have been mastered and which remain challenging.

Goal: Construct an adaptive learning curriculum aligned with the agent's capability without requiring resets to arbitrary intermediate states.

Key Insight: Demonstrations are treated as structured roadmaps rather than direct imitation targets; the agent's current capability ceiling is continuously monitored to select intermediate goals.

Core Idea: Observation visit frequency is used to track the capability frontier, and goals that lie just beyond the agent's current capability are sampled to guide Go-Explore-style exploration.

Method¶

Overall Architecture¶

The framework operates as a three-step closed loop: (1) Observation Visit Tracking—recording the frequency with which the agent visits each position along demonstration trajectories; (2) Capability-Aware Goal Sampling—sampling intermediate goals near the capability frontier; (3) Go-Explore-style Training—a goal-conditioned policy first navigates to the sampled goal, a BC Explorer then continues exploration from that point, and the collected data trains both a World Model and the policy.

Key Designs¶

Observation Visit Tracking:
- Function: Maintains a dictionary \(\text{Dict}_{visit}\) recording the agent's visit frequency at each step of every demonstration trajectory.
- Design Motivation: Visit frequency directly reflects the agent's ability to reach corresponding states.
- Mechanism: Updated at each environment step as \(\text{Dict}_{visit}[\tau^{(i)}][j] += 1\) when \(\text{sim}(s_t, s_j^{(i)}) \leq \epsilon\); supports L2 distance (state space) and MSE (visual environments).
- Novelty: Only resets to the initial demonstration state, eliminating the need to reset to arbitrary intermediate states.
Capability-Aware Goal Sampling:
- Function: Samples goals of appropriate difficulty near the capability frontier.
- Design Motivation: Goals that are too easy provide no learning signal; goals that are too difficult cause divergence.
- Mechanism:
  - Capability ceiling: \(j^* = \max\{j | \text{Dict}_{visit}[\tau^{(i)}][j] \geq \lambda_{visit}\}\)
  - Sampling window: \(\mathcal{G}_{cap}(\pi^G, \tau^{(i)}) = \{s_k \in \tau^{(i)} | |k - j^*| \leq \delta \cdot L_i\}\)
- \(\lambda_{visit}\): visit frequency threshold (e.g., 100); \(\delta\): window size (e.g., 10% of trajectory length).
- Novelty: Unlike JSRL's uniform curriculum, this approach genuinely reflects the agent's current capability.
Go-Explore-style Data Collection:
- Function: Each episode is divided into a Go phase and an Explore phase.
- Design Motivation: The two-phase structure ensures collected data is both close to the demonstration distribution and exploratory.
- Mechanism: In the Go phase, the goal-conditioned policy \(\pi^G(\cdot|s, g)\) attempts to reach the sampled goal; in the Explore phase, the BC Explorer \(\pi^E\) (a behavioral cloning policy) continues exploration from the reached state.
- Novelty: The BC Explorer provides higher-quality exploration than random exploration.
World Model and Policy Training:
- Function: Trains the goal-conditioned policy using imagined trajectories from the World Model.
- Design Motivation: Data collected via Go-Explore is close to the demonstration distribution, yielding a more accurate World Model in these regions.
- Mechanism: Built on the Dreamer framework; uses a temporal distance function \(D_t(s,g)\) as reward \(r^G(s,g) = -D_t(s,g)\).
- Novelty: Theorem 1 proves that the BC Explorer effectively reduces the upper bound on model prediction error.
Goal Predictor:
- Function: Infers the final goal from the current observation at test time.
- Design Motivation: Demonstration trajectories are unavailable during testing.
- Mechanism: \(\mathcal{P}_\phi: s \mapsto \hat{g}\), trained by minimizing \(\|\mathcal{P}_\phi(s_t^{(i)}) - s_L^{(i)}\|_2^2\); the final policy is \(\pi(s) = \pi^G(s, \mathcal{P}(s))\).

Loss & Training¶

World Model: supervised loss under the Dreamer framework.
Policy: Actor-Critic with temporal distance reward.
Goal Predictor: MSE regression loss.
BC Explorer: behavioral cloning loss.
Each task uses only 10–20 demonstrations.

Key Experimental Results¶

Main Results¶

MetaWorld Very Hard Tasks (Success Rate %, averaged over 8 seeds):

Method	Disassemble	PickPlaceWall	ShelfPlace	StickPull	StickPush
Dreamer	~10%	~5%	~10%	~5%	~15%
JSRL	~25%	~20%	~30%	~20%	~30%
MoDem	~40%	~35%	~40%	~30%	~45%
Cal-QL	~15%	~10%	~15%	~10%	~20%
Cago	~70%	~60%	~65%	~55%	~70%

Adroit Dexterous Manipulation Tasks (Success Rate after 1M steps):

Method	Door	Hammer	Pen
MoDem	~60%	~70%	~55%
Cago	~80%	~85%	~75%

ManiSkill Hard Tasks: Cago is the only method capable of achieving high success rates under limited demonstrations.

Ablation Study¶

Component Importance (Disassemble / StickPush / Pen, 5 seeds):

Variant	Description	Performance
Cago (full)	Capability-aware sampling + BC Explorer	Best
Cago-FinalGoal	BC Explorer only, always targets final goal	Significant drop
Cago-StepBased	Goal sampled proportionally to training steps	Drop
Cago-NoExplorer	Capability-aware sampling only, no BC Explorer	Notable drop
Cago-RandomExplorer	Random exploration replaces BC Explorer	Drop

Key Findings¶

Capability-aware goal sampling is the core contribution; removing it causes significant performance degradation.
The normalized goal position progresses naturally from 0 to 1 over training, confirming the effectiveness of the adaptive curriculum.
The BC Explorer is critical for data quality; random exploration performs considerably worse.
The method functions effectively with as few as 10 demonstrations.
The visual-input variant (Cago-Visual) achieves comparable performance, demonstrating strong generalization.

Highlights & Insights¶

Core insight of "capability-awareness": Unlike existing methods that assume fixed curricula or global distribution matching, Cago genuinely tracks the agent's dynamic learning state.
Reset only to initial states: This is far more practical than reverse curriculum methods, as it avoids the need to reproduce latent variables such as joint velocities.
Dual validation via theory and experiment: Theorem 1 provides theoretical error-bound guarantees, and experiments span three major benchmarks.
Goal-conditioned extension of the Go-Explore paradigm: The classic exploration strategy is elegantly combined with demonstration-guided learning.

Limitations & Future Work¶

The method relies on resets to the initial demonstration state, which is far less restrictive than arbitrary-state resets but remains a constraint.
The similarity metric \(\text{sim}(\cdot,\cdot)\) and threshold \(\epsilon\) may require task-specific tuning.
The generalization of the Goal Predictor to out-of-distribution scenarios warrants further investigation.
Future work could explore integrating LLMs or VLMs as goal predictors for more abstract tasks.

Comparison with JSRL: JSRL uses a predefined curriculum rather than capability-aware adaptation.
Comparison with MoDem: MoDem achieves rapid convergence via demonstration oversampling but is limited in final performance.
The Go-Explore paradigm is reinterpreted in Cago through goal conditioning and demonstration guidance.
The Dreamer world model provides the infrastructure for imagination-based training under capability-aware sampling.

Rating¶

Novelty: ⭐⭐⭐⭐ Capability-aware goal sampling is intuitive and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three major benchmarks, 11 tasks, comprehensive ablations, and visual extension.
Writing Quality: ⭐⭐⭐⭐ Clear motivation, fluent method description, and complete theoretical analysis.
Value: ⭐⭐⭐⭐ Substantial improvements on long-horizon sparse-reward tasks with strong practical utility.