CVPR2026 LLM Pretraining computer-using agent inverse dynamics model video-to-trajectory in-context learning supervised fine-tuning UI grounding

Watch and Learn: Learning to Use Computers from Online Videos¶

Conference: CVPR2026 arXiv: 2510.04673 Code: Project Page Area: LLM Pretraining Keywords: computer-using agent, inverse dynamics model, video-to-trajectory, in-context learning, supervised fine-tuning, UI grounding

TL;DR¶

This paper proposes Watch & Learn (W&L), a framework that leverages an Inverse Dynamics Model (IDM) to automatically convert human computer-use tutorial videos from the internet into executable UI trajectory data. The system generates 53K+ high-quality trajectories that serve as either ICL demonstrations or SFT training data, significantly improving CUA performance across multiple models and platforms.

Background & Motivation¶

Background: Computer-Using Agents (CUAs) require large-scale multi-step human-computer interaction trajectories for training, but manual annotation is prohibitively expensive — OpenCUA's AgentNet dataset of 22K tasks took 6 months and over $32,000, with scaling to millions costing upwards of $500K.

Limitations of Prior Work: Manually annotated UI datasets are limited in scale and domain coverage, making it difficult to generalize to diverse and constantly evolving applications and operating systems. Exploration-based synthesis (e.g., BAGEL, OS-Genesis) introduces significant noise, while tutorial-driven synthesis relies on fragile LLM annotations that misalign with actual user behavior.

Key Challenge: A vast repository of human-operated tutorial videos exists on platforms like YouTube, naturally encoding cross-application task workflows, but no effective method exists to convert these into structured trajectories. Prior video-to-trajectory approaches such as MONDAY achieve only ~70% accuracy through cascaded pipelines, and TongUI's reliance on MLLM action annotation is similarly unreliable, with errors compounding across steps.

Goal: Develop a scalable, automatic pipeline to harvest internet tutorial videos and convert them into high-fidelity UI trajectories, applicable across operating systems and usable for both ICL and SFT paradigms.

Method¶

Overall Architecture¶

The W&L framework operates in three stages: (1) constructing a large-scale state-transition corpus and training an Inverse Dynamics Model (IDM); (2) task-aware retrieval of tutorial videos combined with IDM annotation to generate trajectories; (3) deploying trajectories as either ICL demonstrations or SFT training data to empower CUAs.

Key Designs¶

1. Inverse Dynamics Model (IDM)

Core Idea: Given consecutive screenshot pairs $(O_t, O_{t+1})$, the model predicts the user action $a_t$ that caused the transition — reducing trajectory recovery to single-step inverse dynamics prediction
Action Space: 6 atomic operations — Click (with coordinates), Release, Scroll, Type (with text), Wait, Move (with coordinates); Click + Move + Release compose drag operations
Architecture: SigLIP-2 vision encoder + 4-layer Transformer backbone + three prediction heads:
- Action classification head: 6-class action classifier
- Coordinate head: discretizes (x, y) into 0–999 classification bins (more stable than regression)
- Language head: lightweight GPT-2 decoder for autoregressive text generation
Training Data: Web pages sampled from Common Crawl entry points, automatically browsed to record 600K+ $(O_t, a_t, O_{t+1})$ triplets
Training Objective: Multi-task cross-entropy loss with action-type-conditional branch activation

2. Video Retrieval and Trajectory Generation

Inference-time retrieval: Task description + initial screenshot → Gemini 2.5 Flash optimizes search query → YouTube Search API retrieves top-15 videos → filters out non-screencast/blurry segments → retains top-3
Training-time retrieval: Covers 69 applications across 7 categories (productivity/programming/design/video editing/audio/system/science); Gemini generates diverse queries, yielding 53,125 tutorial videos
Filtering: 1fps frame sampling; Gemini 2.5 Flash automatically removes non-screencast content, cropping/zooming artifacts, and blurry transitions
Trajectory Annotation: IDM predicts actions frame-by-frame, assembling complete trajectories $\tau = (O_0, a_0, O_1, a_1, \ldots, O_T, a_T, O_{T+1})$

3. Trajectory Application Modes

ICL: Each trajectory is converted into (observation, action, reasoning) triplet demonstrations, where reasoning is generated by Gemini 2.5 Flash as natural language explanations
SFT: Annotated trajectories are aggregated into (state, action) sequence corpora and used for standard sequence modeling fine-tuning

Key Experimental Results¶

Main Results¶

Setting	Model	Method	Success Rate (%)
ICL	Gemini 2.5 Flash	Base	19.0
		+ W&L	22.0 (+3.0)
	OpenAI o3	Base	21.8
		+ TongUI	21.1 (-0.7)
		+ W&L	24.3 (+2.5)
	Claude 4 Sonnet	Base	43.9
		+ TongUI	43.4 (-0.5)
		+ W&L	45.5 (+1.6)
	Jedi (o3)	Base	50.6
		+ W&L	52.8 (+2.2)
SFT	Qwen 2.5VL 7B	Base	1.9
		+ TongUI	5.4 (+3.5)
		+ W&L	13.0 (+11.1)
	UI-TARS-1.5-7B	Base	27.3
		+ TongUI	23.8 (-3.5)
		+ W&L	31.1 (+3.8)

OSWorld-Verified (50-step). W&L consistently outperforms baselines and TongUI across both ICL and SFT paradigms.

WindowsAgentArena Results (15-step)¶

Model	Success Rate (%)
UI-TARS-1.5-7B (zero-shot)	18.1
+ TongUI SFT	12.9 (-5.2)
+ W&L SFT	24.0 (+5.9)
OpenCUA-7B	13.5
UltraCUA-7B	21.7

W&L achieves SOTA among 7B models, while TongUI annotation actually degrades performance.

Ablation Study¶

IDM Annotation Accuracy Comparison (held-out test set, 100 transitions per action type):

Metric	Gemini 2.5 Flash	TongUI	W&L IDM
ActionType Acc.	81.5%	84.3%	95.8%
Action Acc.	70.5%	72.3%	91.7%

ICL Component Ablation (OSWorld):

Configuration	Gemini Flash	o3	Claude Sonnet
No examples	19.0	21.8	43.9
+ Frames	18.4	21.8	43.9
+ Frames + Actions	20.1	23.0	44.4
+ Frames + Actions + Reasoning	22.0	24.3	45.5

Retrieval Strategy Ablation: Random retrieval provides no gain for o3 (21.8), while task-relevant retrieval yields +2.5 → 24.3.

Key Findings¶

IDM annotation accuracy (91.7%) substantially surpasses TongUI (72.3%) and Gemini (70.5%), with particularly large advantages on location-dependent actions such as click and scroll
TongUI annotation performs worse on Windows (TongUI's underlying UI-TARS was trained on Ubuntu), causing a 5.2-point SFT performance drop
In the ICL setting, action labels and reasoning traces contribute incremental gains, indicating that trajectories convey procedural and causal knowledge beyond visual context
Random retrieval does not harm performance (since annotations are inherently accurate), but task-relevant retrieval is necessary for significant improvements

Highlights & Insights¶

Inverse dynamics modeling is the core innovation: reduces trajectory recovery from end-to-end generation to single-step prediction, drastically lowering learning difficulty while naturally generalizing across applications
Exceptional scalability: 53K trajectories generated fully automatically without human annotation, at far lower cost than AgentNet-style approaches
Dual application pathway: the same trajectory set supports both ICL and SFT, flexibly accommodating both closed-source and open-source models
Cross-OS generalization: demonstrated effectiveness on both Ubuntu (OSWorld) and Windows (WAA), validating the platform robustness of IDM annotations
The +11.1 gain on Qwen 2.5VL 7B demonstrates that general-purpose multimodal models can acquire operational capabilities they originally lacked through this data

Limitations & Future Work¶

The IDM supports only 6 atomic actions, with limited coverage of complex interactions (e.g., right-click context menus, multi-touch gestures, keyboard shortcut combinations)
Dependence on YouTube video quality and availability; niche applications may lack tutorial coverage
Filtering and retrieval rely on Gemini 2.5 Flash, introducing dependency on a commercial API
The paper does not explore automatic segmentation of multi-task long videos; the current approach assumes one video corresponds to one trajectory
Reinforcement learning is not explored (only ICL + SFT); the authors list RL as future work
1fps sampling may miss intermediate states of rapid operations (e.g., consecutive clicks, fast scrolling)

Exploration-based synthesis: BAGEL, NNetNav, Explorer, OS-Genesis — random exploration + retrospective annotation, producing noisy data
Tutorial-driven synthesis: Synatra, AgentTrek (text tutorials), TongUI (multimodal tutorials + MLLM annotation) — broad coverage but fragile annotation
Self-improving agents: OpenWebVoyager, WebRL, ZeroGUI — require no human data but are limited in task distribution
IDM in robotics: VPT (Minecraft pretraining), DreamGen — inspired the IDM design in this work
Agent ICL: Workflow abstraction + example selection, complementary to this paper's ICL approach

Rating¶

Novelty: ⭐⭐⭐⭐ — Transferring inverse dynamics modeling from robotics to GUI agents is a genuinely original approach
Experimental Thoroughness: ⭐⭐⭐⭐ — Dual benchmarks + ICL/SFT dual paradigms + multi-model evaluation + comprehensive ablations
Writing Quality: ⭐⭐⭐⭐ — Clear structure with coherent motivation-method-experiment logic
Value: ⭐⭐⭐⭐⭐ — Opens a practical pathway for scalable CUA training data production from internet videos