Skip to content

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Conference: CVPR 2025
arXiv: 2501.09781
Code: Project Page
Area: Image Generation / Video Generation & Understanding
Keywords: Video Generation, Knowledge Learning, Latent Dynamics Models, Go AI, Robotic Control

TL;DR

VideoWorld explores whether purely visual video generation models can learn complex knowledge (rules, reasoning, planning) from unlabeled videos. It proposes a Latent Dynamics Model (LDM) to compress multi-step visual changes, achieving a professional 5-dan level in Go with only 300 million parameters.

Background & Motivation

Large language models have demonstrated powerful knowledge and reasoning capabilities through next-token prediction, but language cannot completely capture all knowledge of the real world. Organisms in nature mainly learn through visual observation: - Primates like chimpanzees learn foraging and social skills by observing adult behavior, without relying on language. - Existing works primarily focus on learning knowledge from text or labels, while research on learning from purely visual signals is limited. - Methods like UniPi utilize video generation for robotic control but still heavily rely on language instructions and are limited to simple commands. - ChessGPT, Othello-GPT, and others explore reasoning in board games, but they use state-annotated data instead of raw videos. - Core Problem: Can AI learn knowledge solely through visual inputs, just like chimpanzees learning from the environment?

Key Findings: (1) Pure video training is sufficient to learn knowledge (rules, reasoning, planning); (2) The representation of visual changes is crucial for knowledge acquisition.

Method

Overall Architecture

VideoWorld encodes video frames into discrete tokens using VQ-VAE (with MAGVITv2 + FSQ quantizer), and performs next-token prediction via an autoregressive Transformer based on the Llama architecture. The key innovation is the introduction of a Latent Dynamics Model (LDM), which compresses future multi-step visual changes into compact latent codes, interleaved with video tokens for autoregressive prediction. During inference, an Inverse Dynamics Model (IDM) maps the generated frames and latent codes to actionable steps.

Key Design 1: Latent Dynamics Model (LDM)

Function: Compresses multi-step visual changes into compact latent codes to enhance the efficiency and effectiveness of knowledge learning.

Mechanism: For each frame \(x_t\) and the subsequent \(H\) frames \(x_{t+1:t+H}\), visual features \(f_{t:t+H}\) are extracted using a causal encoder-decoder. Learnable query embeddings \(\{q^h\}_{h=1}^H\) are defined to capture change information from \(f_{t:t+h}\) via attention mechanisms, yielding continuous latent representations \(\tilde{z}_t^h\), which are quantized by FSQ to obtain discrete latent codes \(z_t^h\). The decoder utilizes \(f_t\) and \(\{z_t^h\}_{h=1}^H\) to reconstruct subsequent frames. The training objective is to minimize the \(\ell_2\) distance between the generated frames and ground truth frames.

Design Motivation: A single move in Go can be encoded with a few position tokens, but a video requires hundreds of tokens. LDM compresses key decision information into a compact representation while encoding look-ahead planning information. Quantization acts as an information bottleneck to prevent LDM from learning trivial copy-paste shortcuts.

Key Design 2: Seamless Integration of Autoregressive Transformer and LDM

Function: Unifies video frame tokens and LDM latent codes into a single autoregressive sequence for prediction.

Mechanism: The video decoder and LDM use different codebooks, and the vocabulary of the autoregressive Transformer is the union of the two. Discrete tokens of each frame and the corresponding latent codes \(\{z_t^h\}_{h=1}^H\) are combined into a sequence for autoregressive training. This allows the Transformer to utilize both fine-grained visual details captured by the visual encoder and compact temporal dynamics representations generated by the LDM.

Design Motivation: Combining rich visual information with compact change representations allows for more effective knowledge learning compared to using only videos or only state sequences.

Key Design 3: Inverse Dynamics Model (IDM) Mapping to Task Operations

Function: Translates the results of video generation into concrete task actions (such as placing Go stones or robotic control).

Mechanism: The IDM \(\pi\) consists of several MLP layers and is trained independently of the video generator, using a small amount of action-labeled video data. In the base framework \(\pi(\cdot|x_t, \hat{x}_{t+1})\), and with LDM integration it is expanded to \(\pi(\cdot|x_t, \hat{x}_{t+1}, \{\hat{z}_t^h\}_{h=1}^H)\), utilizing the temporal representations encoded by the LDM to improve temporal consistency and accuracy of action prediction.

Design Motivation: Decoupling the perception model from the action model allows the video generator to focus on learning knowledge representations, requiring only a small amount of annotated data for the IDM to perform action mapping.

Loss & Training

VQ-VAE training loss: standard reconstruction loss + FSQ quantization loss. LDM training loss: \(\ell_2\) pixel reconstruction loss. Autoregressive Transformer: cross-entropy next-token prediction loss.

Key Experimental Results

Main Results: Video-GoBench Go Evaluation

Agent Input Legal Rate Action-Value Acc Elo
VideoWorld 300M Video 99.7% 83.7% 88.1% 2317
VideoWorld 150M Video 99.7% 82.0% 86.7% 2218
Transformer 300M (Video) Video 99.6% 59.7% 58.9% 1998
Transformer 300M (State) State 99.8% 79.7% 87.2% 2308
KataGO-human-5d (RL) State 100% 83.5% 83.7% 2253

Ablation Study: Contribution of LDM

Setting Action-Value Elo Description
VideoWorld (Video+LDM) 83.7% 2317 Ours (Full Model)
Baseline Transformer (Video-Only) 59.7% 1998 Low efficiency of knowledge acquisition
State Sequence Transformer 79.7% 2308 Lacks visual information

Key Findings

  • VideoWorld 300M achieves a professional 5-dan level (Elo 2317), surpassing KataGO-human-5d (Elo 2253), without using search algorithms or reinforcement learning.
  • LDM significantly improves Action-Value from 59.7% to 83.7%, proving the criticality of compact change representations.
  • In CALVIN robot tasks, the performance of VideoWorld is close to the oracle model trained with ground truth action labels.
  • UMAP visualization shows that LDM latent codes learn meaningful Go patterns and robot movement directions.

Highlights & Insights

  • Pioneering Exploration: First to systematically verify that video generation models can learn complex reasoning and planning knowledge from purely visual data.
  • Small Model, Big Capability: The 300M parameter model achieves a professional 5-dan level in Go, implying that visual learning might be more efficient than language learning.
  • Generality of LDM: The Latent Dynamics Model is effective on two vastly different tasks: Go (discrete decisions) and robotics (continuous control).

Limitations & Future Work

  • The visual design of Go was deliberately simplified (texture removal), and the visual complexity of the real world has not yet been fully captured.
  • IDM still requires a small amount of action-labeled data, meaning fully unsupervised learning is not yet achieved.
  • It has not yet been validated on more complex real-world tasks (such as autonomous driving).
  • Future directions: better visual representations, large-scale pretraining, unified multi-task learning.
  • Compared to the latent action models of Genie and LAPO, VideoWorld's LDM supports multi-step look-ahead planning.
  • In sharp contrast to methods like ChessGPT that use annotated state data, learning from pure videos can meet or even exceed the performance of state-based learning.
  • Provides a new experimental paradigm for World Model research—directly evaluating the knowledge acquisition capabilities of video generation models.

Rating

⭐⭐⭐⭐⭐ — Highly pioneering work. For the first time, it rigorously proves that video generation models can learn complex reasoning and planning knowledge from purely visual inputs. Reaching a professional 5-dan level in Go with only 300 million parameters is highly impressive. The design of LDM is simple yet generic, providing profound inspiration for the fields of World Models and embodied AI.