Open-World Skill Discovery from Unsegmented Demonstration Videos¶
Conference: ICCV 2025 arXiv: 2503.10684 Code: craftjarvis.github.io/SkillDiscovery Area: Image Segmentation Keywords: Skill Discovery, Temporal Video Segmentation, Behavior Cloning, Open-World, Minecraft
TL;DR¶
Inspired by the human cognitive Event Segmentation Theory (EST), this paper proposes the Skill Boundary Detection (SBD) algorithm, which leverages prediction error spikes from a pretrained unconditional action prediction model to automatically identify skill boundaries in unsegmented demonstration videos, significantly improving the performance of conditional policies and hierarchical agents in Minecraft.
Background & Motivation¶
One of the key challenges in building open-world agents is learning atomic skills from long videos. Hierarchical agents typically adopt a "planner + controller" architecture: the planner decomposes high-level instructions into atomic skills, and the controller executes individual skills. Training such architectures requires segmenting long trajectories into individual skill clips, yet real-world demonstration videos are typically long and unsegmented.
Limitations of existing segmentation methods:
Random segmentation (fixed-length): does not guarantee that each segment contains a complete, independent skill, and the preset length may not match actual skill duration.
Reward-driven: fails to capture skills without associated rewards and incorrectly segments when rewards are obtained repeatedly.
Top-down (manually predefined skill sets): expensive and yields limited skill diversity.
Bottom-up (clustering/BPE): performs poorly in visually partially observable environments when relying solely on action sequences.
All methods depend on hand-crafted rules, motivating the need for a learning-based, adaptive approach.
Core insight (from EST in cognitive science): humans naturally segment continuous experience into discrete events when prediction errors in perceptual expectations rise. By analogy, in agents—a sudden increase in the prediction error of an unconditional policy signals a skill transition.
Method¶
Overall Architecture¶
A four-stage pipeline:
Stage I: Pretrain a Transformer-XL unconditional policy \(\pi_{unconditional}\) on unsegmented datasets via behavior cloning (action labels generated by an inverse dynamics model).
Stage II: Apply the SBD algorithm to segment long videos into atomic skill clips.
Stage III: Train a conditional policy (video-conditioned GROOT / text-conditioned STEVE-1) on the segmented dataset.
Stage IV: Combine the conditional policy with a vision-language model to construct a hierarchical agent.
Key Designs¶
-
Skill Boundary Detection (SBD) Algorithm: At each timestep \(t\), the unconditional model predicts the action and computes the loss against the ground truth. When the loss exceeds the historical mean by a threshold GAP, the timestep is marked as a skill boundary. A sliding window simulates the model's memory, which is cleared at each detected boundary.
- Core criterion: \(\text{loss} - \text{mean}(\text{loss\_history}) > \text{GAP}\)
- The hyperparameter GAP is set to 18, balancing average trajectory length and semantic coherence.
-
Theoretical Guarantee — Boundary Theorem on Prediction Probability: The theoretical foundation for skill-transition detection is established under three assumptions:
- Skill consistency: \(P(\pi_{t+1} \neq \pi_t | o_{1:t+1}) < 1/K\) (skills do not switch frequently)
- Skill confidence: \(P(\pi_t(a_t|o_{1:t}) > c) > 1 - \delta\) (the agent assigns high confidence to its actions)
- Action divergence at skill transitions: upon switching skills, the agent executes actions with very low probability under the previous skill
Theorem 3.4 proves that the relative prediction probability has a high lower bound when no transition occurs and a low upper bound when a transition occurs. When \(c > m\) and \((K-4)c^2 > 2\), the two bounds do not overlap, guaranteeing distinguishability.
-
External Information Augmentation: An optional component that uses in-game logs (e.g., crafting events) to mark boundaries that are difficult to detect via loss alone. It serves as a supplement only when detection fails, and the core method remains effective on purely visual data.
Loss & Training¶
Unconditional policy training: standard behavior cloning $\(\min_\theta \sum_{t \in [1...T]} -\log \pi_{unconditional}(a_t | o_{1:t})\)$
Prediction loss used by SBD: negative log-likelihood \(-\log P(a_t | o_{1:t})\)
Conditional policy training: - GROOT: C-VAE encodes 128-frame video instructions → behavior cloning - STEVE-1: VPT model adapted to MineCLIP latent space → text/video instruction following
Key Experimental Results¶
Main Results: Atomic Skill Benchmark¶
| Policy | Instruction Type | Original avg | SBD avg | Relative Gain |
|---|---|---|---|---|
| GROOT | Video-conditioned | 9.5 | 25.4 | +63.7% |
| STEVE-1 | Image + Text | 46.9 | 71.9 | +52.1% |
Representative skill improvements:
| Skill | GROOT Original | GROOT SBD | Gain |
|---|---|---|---|
| hunt sheep | 26% | 54% | +107.7% |
| use bow | 30% | 80% | +166.7% |
| collect wood (find+collect) | 14.5 | 19.7 | +36.1% |
Long-Horizon Tasks: Hierarchical Agents¶
| Method | Wood | Food | Stone | Iron | Avg. Relative Gain |
|---|---|---|---|---|---|
| OmniJARVIS (Original) | 95% | 44% | 82% | 32% | - |
| OmniJARVIS (SBD) | 96% | 55% | 90% | 35% | +11.3% |
| Method | Diamond | Armor | Food | Avg. Relative Gain |
|---|---|---|---|---|
| JARVIS-1 (Original) | 8% | 12% | 39% | - |
| JARVIS-1 (SBD) | 10% | 19% | 62% | +20.8% |
Ablation Study¶
| Configuration | Avg. Success Rate | Notes |
|---|---|---|
| Random segmentation (fixed 128 frames) | Baseline | GROOT default |
| SBD (loss only) | Large improvement | Pure prediction-error detection |
| SBD (loss + external info) | Best | Combined with in-game event logs |
Key Findings¶
- SBD-produced segmentations yield a length distribution closer to actual skill durations, whereas random segmentation clusters around a fixed length.
- Loss spikes are highly correlated with skill boundaries, validating the applicability of EST theory in agent settings.
- SBD remains effective on datasets without external information, confirming that the core mechanism is loss-based detection, with external information as an optional enhancement.
- SBD can leverage YouTube videos to train instruction-following agents, reducing data annotation costs.
Highlights & Insights¶
- Clear motivation inspired by cognitive science: EST theory → prediction-error-based skill boundary detection, with strong alignment between theory and intuition.
- Theoretical guarantee: Theorem 3.4 establishes distinguishable bounds on prediction probability for skill-switching vs. non-switching cases, making this more than a purely empirical method.
- Strong generality: SBD requires only a pretrained unconditional policy, with no need for additional annotation, reward signals, or predefined skill sets.
- Plug-and-play: SBD can directly replace the segmentation step in existing methods (GROOT, STEVE-1, OmniJARVIS), yielding consistent improvements across all.
Limitations & Future Work¶
- The GAP hyperparameter requires manual tuning and may need different values for different environments or datasets.
- Performance is limited for skill transitions with subtle action changes (e.g., Minecraft crafting), where external information is required.
- Validation is limited to the Minecraft environment; extension to robotic manipulation, autonomous driving, and other domains is needed.
- Adaptive determination of the number of skills is not addressed (the current approach relies on post-processing to prune segment lengths).
Related Work & Insights¶
- EST theory (Zacks et al.) links human event segmentation to prediction errors, providing a cognitive science foundation for the computational approach.
- Complements skill discovery in the Option Framework (Sutton 1999) within hierarchical reinforcement learning.
- The long-sequence modeling capability of Transformer-XL underlies the effectiveness of SBD.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Integration of cognitive science theory, learning-based method, and theoretical proof.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple policies, multiple agents, and both short- and long-horizon tasks.
- Practicality: ⭐⭐⭐⭐ — Plug-and-play; compatible with YouTube data.
- Overall: ⭐⭐⭐⭐