Skip to content

Watch and Learn: Learning to Use Computers from Online Videos

Conference: CVPR 2026
arXiv: 2510.04673
Code: N/A
Area: Pretraining
Keywords: Computer-Using Agent, Inverse Dynamics Model, Trajectory Generation, YouTube Tutorials, GUI Automation

TL;DR

Watch & Learn proposes using an inverse dynamics model (IDM) to automatically convert YouTube tutorial videos into executable UI trajectory data (53K+ trajectories without manual annotation), enhancing CUA capabilities with +11.1% improvement for Qwen 2.5VL-7B and +3.8% for UI-TARS-1.5-7B on OSWorld.

Background & Motivation

Background: Computer-using agents (CUA) require massive high-quality UI operation trajectory data for training, but manual annotation costs approximately $1.45/task.

Limitations of Prior Work: (a) Manual annotation is not scalable; (b) heuristic parsing methods have low accuracy (TongUI: 72.3%); (c) YouTube tutorial videos are a rich but underutilized data source.

Key Challenge: The web contains vast tutorial videos demonstrating real computer operations, but no tool can accurately extract action sequences from videos.

Goal: Build a high-accuracy IDM that automatically extracts UI operation trajectories from videos, and validate that these trajectories effectively train CUAs.

Key Insight: Use a SigLIP-2 visual encoder + Transformer inverse dynamics model to infer user actions from consecutive screenshots.

Core Idea: An inverse dynamics model with 91.7% accuracy automatically annotates UI operations in YouTube tutorial videos, converting them into 53K executable trajectories.

Method

Overall Architecture

Four stages: (1) Build 600K+ state transition corpus to train the IDM; (2) retrieve and filter tutorial videos from YouTube; (3) use IDM to predict action sequences in videos; (4) use generated trajectories for ICL or SFT to enhance CUA.

Key Designs

  1. Inverse Dynamics Model (IDM):

    • Function: Infer the user action from two consecutive screenshot frames
    • Mechanism: SigLIP-2 visual encoder + 4-layer Transformer + 3 specialized heads (action classification/coordinate prediction/text generation)
    • Accuracy: 91.7% (vs TongUI 72.3%), directly determining downstream performance
  2. Task-Aware Video Retrieval:

    • Function: Retrieve tutorial videos related to target tasks from YouTube
    • Mechanism: Covers 69 applications across 7 categories; uses optical flow to filter static frames
  3. Trajectory Generation and Quality Control:

    • Function: Convert videos into structured (state, action) trajectories
    • Scale: 53,125 high-quality trajectories

Loss & Training

  • IDM is trained on 600K human interaction data
  • Downstream usage: ICL (providing example trajectories) or SFT (fine-tuning models)

Key Experimental Results

Main Results: OSWorld-Verified

Model Method Baseline +W&L Gain
Gemini 2.5 Flash ICL 19.0% 22.0% +3.0%
Claude 4 Sonnet ICL 43.9% 45.5% +1.6%
Qwen 2.5VL 7B SFT 1.9% 13.0% +11.1%
UI-TARS-1.5-7B SFT 27.3% 31.1% +3.8%

Ablation Study: ICL Components (Gemini)

Config Success Rate
Frames only 19.0%
+ Actions 20.1%
+ Actions + Reasoning 22.0%

Key Findings

  • IDM accuracy (91.7% vs 72.3%) directly determines downstream performance
  • SFT gains far exceed ICL (Qwen: +11.1% vs best ICL +3.0%)
  • Auto-generated trajectories generalize across OS platforms; other methods degrade

Highlights & Insights

  • YouTube tutorials as free CUA training data: No manual annotation needed; only an accurate IDM is required
  • IDM accuracy is critical: The 19.4% accuracy gap directly causes downstream performance differences
  • ICL + reasoning chain is most effective: Providing not only example frames and actions but also reasoning processes yields the best results

Limitations & Future Work

  • YouTube video quality varies; optical flow filtering may discard valid content
  • IDM training relies on existing 600K annotated data; not fully zero-annotation
  • Only covers desktop OS operations; not extended to mobile platforms
  • vs ShowUI/OS-Atlas: Relies on manual annotation with high cost; W&L automates annotation for scalability
  • vs TongUI: Heuristic parsing accuracy is only 72.3%; IDM achieves 91.7%

Rating

  • Novelty: ⭐⭐⭐⭐ — Using IDM to automatically extract trajectories from videos is a novel data acquisition paradigm
  • Experimental Thoroughness: ⭐⭐⭐⭐ — OSWorld + multi-model validation + ICL/SFT dual modes
  • Writing Quality: ⭐⭐⭐⭐ — Clear framework
  • Value: ⭐⭐⭐⭐⭐ — Solves the CUA training data bottleneck with high practical value