Skip to content

Open-World Skill Discovery from Unsegmented Demonstration Videos

Conference: ICCV 2025 arXiv: 2503.10684 Code: craftjarvis.github.io/SkillDiscovery Area: Image Segmentation Keywords: Skill Discovery, Temporal Video Segmentation, Behavior Cloning, Open-World, Minecraft

TL;DR

Inspired by the human cognitive Event Segmentation Theory (EST), this paper proposes the Skill Boundary Detection (SBD) algorithm, which leverages prediction error spikes from a pretrained unconditional action prediction model to automatically identify skill boundaries in unsegmented demonstration videos, significantly improving the performance of conditional policies and hierarchical agents in Minecraft.

Background & Motivation

One of the key challenges in building open-world agents is learning atomic skills from long videos. Hierarchical agents typically adopt a "planner + controller" architecture: the planner decomposes high-level instructions into atomic skills, and the controller executes individual skills. Training such architectures requires segmenting long trajectories into individual skill clips, yet real-world demonstration videos are typically long and unsegmented.

Limitations of existing segmentation methods:

Random segmentation (fixed-length): does not guarantee that each segment contains a complete, independent skill, and the preset length may not match actual skill duration.

Reward-driven: fails to capture skills without associated rewards and incorrectly segments when rewards are obtained repeatedly.

Top-down (manually predefined skill sets): expensive and yields limited skill diversity.

Bottom-up (clustering/BPE): performs poorly in visually partially observable environments when relying solely on action sequences.

All methods depend on hand-crafted rules, motivating the need for a learning-based, adaptive approach.

Core insight (from EST in cognitive science): humans naturally segment continuous experience into discrete events when prediction errors in perceptual expectations rise. By analogy, in agents—a sudden increase in the prediction error of an unconditional policy signals a skill transition.

Method

Overall Architecture

A four-stage pipeline:

Stage I: Pretrain a Transformer-XL unconditional policy \(\pi_{unconditional}\) on unsegmented datasets via behavior cloning (action labels generated by an inverse dynamics model).

Stage II: Apply the SBD algorithm to segment long videos into atomic skill clips.

Stage III: Train a conditional policy (video-conditioned GROOT / text-conditioned STEVE-1) on the segmented dataset.

Stage IV: Combine the conditional policy with a vision-language model to construct a hierarchical agent.

Key Designs

  1. Skill Boundary Detection (SBD) Algorithm: At each timestep \(t\), the unconditional model predicts the action and computes the loss against the ground truth. When the loss exceeds the historical mean by a threshold GAP, the timestep is marked as a skill boundary. A sliding window simulates the model's memory, which is cleared at each detected boundary.

    • Core criterion: \(\text{loss} - \text{mean}(\text{loss\_history}) > \text{GAP}\)
    • The hyperparameter GAP is set to 18, balancing average trajectory length and semantic coherence.
  2. Theoretical Guarantee — Boundary Theorem on Prediction Probability: The theoretical foundation for skill-transition detection is established under three assumptions:

    • Skill consistency: \(P(\pi_{t+1} \neq \pi_t | o_{1:t+1}) < 1/K\) (skills do not switch frequently)
    • Skill confidence: \(P(\pi_t(a_t|o_{1:t}) > c) > 1 - \delta\) (the agent assigns high confidence to its actions)
    • Action divergence at skill transitions: upon switching skills, the agent executes actions with very low probability under the previous skill

    Theorem 3.4 proves that the relative prediction probability has a high lower bound when no transition occurs and a low upper bound when a transition occurs. When \(c > m\) and \((K-4)c^2 > 2\), the two bounds do not overlap, guaranteeing distinguishability.

  3. External Information Augmentation: An optional component that uses in-game logs (e.g., crafting events) to mark boundaries that are difficult to detect via loss alone. It serves as a supplement only when detection fails, and the core method remains effective on purely visual data.

Loss & Training

Unconditional policy training: standard behavior cloning $\(\min_\theta \sum_{t \in [1...T]} -\log \pi_{unconditional}(a_t | o_{1:t})\)$

Prediction loss used by SBD: negative log-likelihood \(-\log P(a_t | o_{1:t})\)

Conditional policy training: - GROOT: C-VAE encodes 128-frame video instructions → behavior cloning - STEVE-1: VPT model adapted to MineCLIP latent space → text/video instruction following

Key Experimental Results

Main Results: Atomic Skill Benchmark

Policy Instruction Type Original avg SBD avg Relative Gain
GROOT Video-conditioned 9.5 25.4 +63.7%
STEVE-1 Image + Text 46.9 71.9 +52.1%

Representative skill improvements:

Skill GROOT Original GROOT SBD Gain
hunt sheep 26% 54% +107.7%
use bow 30% 80% +166.7%
collect wood (find+collect) 14.5 19.7 +36.1%

Long-Horizon Tasks: Hierarchical Agents

Method Wood Food Stone Iron Avg. Relative Gain
OmniJARVIS (Original) 95% 44% 82% 32% -
OmniJARVIS (SBD) 96% 55% 90% 35% +11.3%
Method Diamond Armor Food Avg. Relative Gain
JARVIS-1 (Original) 8% 12% 39% -
JARVIS-1 (SBD) 10% 19% 62% +20.8%

Ablation Study

Configuration Avg. Success Rate Notes
Random segmentation (fixed 128 frames) Baseline GROOT default
SBD (loss only) Large improvement Pure prediction-error detection
SBD (loss + external info) Best Combined with in-game event logs

Key Findings

  • SBD-produced segmentations yield a length distribution closer to actual skill durations, whereas random segmentation clusters around a fixed length.
  • Loss spikes are highly correlated with skill boundaries, validating the applicability of EST theory in agent settings.
  • SBD remains effective on datasets without external information, confirming that the core mechanism is loss-based detection, with external information as an optional enhancement.
  • SBD can leverage YouTube videos to train instruction-following agents, reducing data annotation costs.

Highlights & Insights

  • Clear motivation inspired by cognitive science: EST theory → prediction-error-based skill boundary detection, with strong alignment between theory and intuition.
  • Theoretical guarantee: Theorem 3.4 establishes distinguishable bounds on prediction probability for skill-switching vs. non-switching cases, making this more than a purely empirical method.
  • Strong generality: SBD requires only a pretrained unconditional policy, with no need for additional annotation, reward signals, or predefined skill sets.
  • Plug-and-play: SBD can directly replace the segmentation step in existing methods (GROOT, STEVE-1, OmniJARVIS), yielding consistent improvements across all.

Limitations & Future Work

  • The GAP hyperparameter requires manual tuning and may need different values for different environments or datasets.
  • Performance is limited for skill transitions with subtle action changes (e.g., Minecraft crafting), where external information is required.
  • Validation is limited to the Minecraft environment; extension to robotic manipulation, autonomous driving, and other domains is needed.
  • Adaptive determination of the number of skills is not addressed (the current approach relies on post-processing to prune segment lengths).
  • EST theory (Zacks et al.) links human event segmentation to prediction errors, providing a cognitive science foundation for the computational approach.
  • Complements skill discovery in the Option Framework (Sutton 1999) within hierarchical reinforcement learning.
  • The long-sequence modeling capability of Transformer-XL underlies the effectiveness of SBD.

Rating

  • Novelty: ⭐⭐⭐⭐ — Integration of cognitive science theory, learning-based method, and theoretical proof.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple policies, multiple agents, and both short- and long-horizon tasks.
  • Practicality: ⭐⭐⭐⭐ — Plug-and-play; compatible with YouTube data.
  • Overall: ⭐⭐⭐⭐