Open-World Skill Discovery from Unsegmented Demonstration Videos¶

Conference: ICCV 2025 arXiv: 2503.10684 Code: craftjarvis.github.io/SkillDiscovery Area: Image Segmentation Keywords: Skill Discovery, Temporal Video Segmentation, Behavior Cloning, Open-World, Minecraft

TL;DR¶

Inspired by the human cognitive Event Segmentation Theory (EST), this paper proposes the Skill Boundary Detection (SBD) algorithm, which leverages prediction error spikes from a pretrained unconditional action prediction model to automatically identify skill boundaries in unsegmented demonstration videos, significantly improving the performance of conditional policies and hierarchical agents in Minecraft.

Background & Motivation¶

One of the key challenges in building open-world agents is learning atomic skills from long videos. Hierarchical agents typically adopt a "planner + controller" architecture: the planner decomposes high-level instructions into atomic skills, and the controller executes individual skills. Training such architectures requires segmenting long trajectories into individual skill clips, yet real-world demonstration videos are typically long and unsegmented.

Limitations of existing segmentation methods:

Random segmentation (fixed-length): does not guarantee that each segment contains a complete, independent skill, and the preset length may not match actual skill duration.

Reward-driven: fails to capture skills without associated rewards and incorrectly segments when rewards are obtained repeatedly.

Top-down (manually predefined skill sets): expensive and yields limited skill diversity.

Bottom-up (clustering/BPE): performs poorly in visually partially observable environments when relying solely on action sequences.

All methods depend on hand-crafted rules, motivating the need for a learning-based, adaptive approach.

Core insight (from EST in cognitive science): humans naturally segment continuous experience into discrete events when prediction errors in perceptual expectations rise. By analogy, in agents—a sudden increase in the prediction error of an unconditional policy signals a skill transition.

Method¶

Overall Architecture¶

A four-stage pipeline:

Stage I: Pretrain a Transformer-XL unconditional policy $\pi_{unconditional}$ on unsegmented datasets via behavior cloning (action labels generated by an inverse dynamics model).

Stage II: Apply the SBD algorithm to segment long videos into atomic skill clips.

Stage III: Train a conditional policy (video-conditioned GROOT / text-conditioned STEVE-1) on the segmented dataset.

Stage IV: Combine the conditional policy with a vision-language model to construct a hierarchical agent.

Key Designs¶

Skill Boundary Detection (SBD) Algorithm: At each timestep $t$, the unconditional model predicts the action and computes the loss against the ground truth. When the loss exceeds the historical mean by a threshold GAP, the timestep is marked as a skill boundary. A sliding window simulates the model's memory, which is cleared at each detected boundary.
- Core criterion: $\text{loss} - \text{mean}(\text{loss\_history}) > \text{GAP}$
- The hyperparameter GAP is set to 18, balancing average trajectory length and semantic coherence.
Theoretical Guarantee — Boundary Theorem on Prediction Probability: The theoretical foundation for skill-transition detection is established under three assumptions:
- Skill consistency: $P(\pi_{t+1} \neq \pi_t | o_{1:t+1}) < 1/K$ (skills do not switch frequently)
- Skill confidence: $P(\pi_t(a_t|o_{1:t}) > c) > 1 - \delta$ (the agent assigns high confidence to its actions)
- Action divergence at skill transitions: upon switching skills, the agent executes actions with very low probability under the previous skill
Theorem 3.4 proves that the relative prediction probability has a high lower bound when no transition occurs and a low upper bound when a transition occurs. When $c > m$ and $(K-4)c^2 > 2$, the two bounds do not overlap, guaranteeing distinguishability.
External Information Augmentation: An optional component that uses in-game logs (e.g., crafting events) to mark boundaries that are difficult to detect via loss alone. It serves as a supplement only when detection fails, and the core method remains effective on purely visual data.

Loss & Training¶

Unconditional policy training: standard behavior cloning $$\min_\theta \sum_{t \in [1...T]} -\log \pi_{unconditional}(a_t | o_{1:t})$$

Prediction loss used by SBD: negative log-likelihood $-\log P(a_t | o_{1:t})$

Conditional policy training: - GROOT: C-VAE encodes 128-frame video instructions → behavior cloning - STEVE-1: VPT model adapted to MineCLIP latent space → text/video instruction following

Key Experimental Results¶

Main Results: Atomic Skill Benchmark¶

Policy	Instruction Type	Original avg	SBD avg	Relative Gain
GROOT	Video-conditioned	9.5	25.4	+63.7%
STEVE-1	Image + Text	46.9	71.9	+52.1%

Representative skill improvements:

Skill	GROOT Original	GROOT SBD	Gain
hunt sheep	26%	54%	+107.7%
use bow	30%	80%	+166.7%
collect wood (find+collect)	14.5	19.7	+36.1%

Long-Horizon Tasks: Hierarchical Agents¶

Method	Wood	Food	Stone	Iron	Avg. Relative Gain
OmniJARVIS (Original)	95%	44%	82%	32%	-
OmniJARVIS (SBD)	96%	55%	90%	35%	+11.3%

Method	Diamond	Armor	Food	Avg. Relative Gain
JARVIS-1 (Original)	8%	12%	39%	-
JARVIS-1 (SBD)	10%	19%	62%	+20.8%

Ablation Study¶

Configuration	Avg. Success Rate	Notes
Random segmentation (fixed 128 frames)	Baseline	GROOT default
SBD (loss only)	Large improvement	Pure prediction-error detection
SBD (loss + external info)	Best	Combined with in-game event logs

Key Findings¶

SBD-produced segmentations yield a length distribution closer to actual skill durations, whereas random segmentation clusters around a fixed length.
Loss spikes are highly correlated with skill boundaries, validating the applicability of EST theory in agent settings.
SBD remains effective on datasets without external information, confirming that the core mechanism is loss-based detection, with external information as an optional enhancement.
SBD can leverage YouTube videos to train instruction-following agents, reducing data annotation costs.

Highlights & Insights¶

Clear motivation inspired by cognitive science: EST theory → prediction-error-based skill boundary detection, with strong alignment between theory and intuition.
Theoretical guarantee: Theorem 3.4 establishes distinguishable bounds on prediction probability for skill-switching vs. non-switching cases, making this more than a purely empirical method.
Strong generality: SBD requires only a pretrained unconditional policy, with no need for additional annotation, reward signals, or predefined skill sets.
Plug-and-play: SBD can directly replace the segmentation step in existing methods (GROOT, STEVE-1, OmniJARVIS), yielding consistent improvements across all.

Limitations & Future Work¶

The GAP hyperparameter requires manual tuning and may need different values for different environments or datasets.
Performance is limited for skill transitions with subtle action changes (e.g., Minecraft crafting), where external information is required.
Validation is limited to the Minecraft environment; extension to robotic manipulation, autonomous driving, and other domains is needed.
Adaptive determination of the number of skills is not addressed (the current approach relies on post-processing to prune segment lengths).

EST theory (Zacks et al.) links human event segmentation to prediction errors, providing a cognitive science foundation for the computational approach.
Complements skill discovery in the Option Framework (Sutton 1999) within hierarchical reinforcement learning.
The long-sequence modeling capability of Transformer-XL underlies the effectiveness of SBD.

Rating¶

Novelty: ⭐⭐⭐⭐ — Integration of cognitive science theory, learning-based method, and theoretical proof.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple policies, multiple agents, and both short- and long-horizon tasks.
Practicality: ⭐⭐⭐⭐ — Plug-and-play; compatible with YouTube data.
Overall: ⭐⭐⭐⭐