MotionStream: Real-Time Video Generation with Interactive Motion Controls¶
Conference: ICLR 2026
arXiv: 2511.01266
Code: None
Area: Video Generation
Keywords: streaming video generation, motion control, causal distillation, attention sink, distribution matching distillation, real-time interaction
TL;DR¶
The authors propose MotionStream, the first real-time streaming video generation system with motion control. The method trains a bidirectional motion-controlled teacher with a lightweight track head, then distills it into a causal student via Self Forcing and DMD. By introducing attention sinks and a rolling KV cache, the system achieves a total match between training and inference distributions. It reaches 17 FPS (29 FPS with Tiny VAE) at 480P on a single H100 GPU, supporting infinite-length generation at constant speed.
Background & Motivation¶
Background: Motion-controlled video generation (e.g., Motion Prompting) has succeeded in generating high-quality trajectory-following videos. However, inference is extremely slow (e.g., 12 minutes for a 5-second video), non-causal (requiring complete control signals beforehand), and limited to finite lengths.
Limitations of Prior Work: - Bidirectional attention in diffusion models requires knowledge of future trajectories, preventing real-time interaction. - Causal distillation methods like CausVid suffer from severe drift (color shifts and quality degradation) outside the training horizon (>81 frames). - ControlNet-style architectures double FLOPs, further slowing down inference. - RoPE positions in sliding window attention grow unbounded, leading to fluctuations in latency and throughput.
Key Challenge: Interactive creation requires a "real-time + causal + infinite length" experience, which fundamentally conflicts with the "slow + bidirectional + finite length" paradigm of diffusion models.
Goal: Transform motion-controlled video generation from a "render-and-wait" mode to a "real-time creation" mode, where users see immediate results as they draw trajectories.
Key Insight: Break through three levels simultaneously: (1) a lightweight teacher architecture to reduce baseline overhead; (2) joint guidance embedding distillation to eliminate multiple NFEs; (3) attention sinks and training-time simulation of inference distributions to eliminate long-video drift.
Core Idea: A pipeline consisting of "Efficient Teacher → Causal Distillation → Attention Sink Extrapolation Training" enables real-time, infinite streaming generation of motion-controlled videos.
Method¶
Overall Architecture¶
MotionStream aims to transform motion-controlled video generation into an immediate experience. Since diffusion models are inherently slow and non-causal, the authors use a two-stage pipeline. Stage 1 attaches a lightweight track head to Wan DiT to train a high-quality but bidirectional and slow teacher. Stage 2 performs causal adaptation and distills the teacher into a single-step forward causal student using Self Forcing-style DMD. During distillation, attention sinks and a rolling KV cache are introduced so the context distribution during training exactly matches real-world streaming inference. The final student achieves 17/29 FPS at 480P on a single H100, with generation speed independent of video length.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
IN["Input: 2D Trajectories + Text Prompt"] --> ENC["Lightweight Track Head<br/>Sinusoidal Trajectory Encoding"]
ENC -->|"Channel Concatenation into Wan DiT"| TEACHER["Stage 1: Bidirectional Motion Teacher<br/>Flow Matching Training<br/>(High Quality, Slow, Non-causal)"]
TEACHER -->|"Joint Text-Motion Guidance"| CAUSAL["Stage 2: Causal Adaptation"]
CAUSAL --> DMD["Joint Guidance Embedding Distillation<br/>Self Forcing DMD into Single-Step"]
DMD --> ROLL["Attention Sink + Rolling KV Cache<br/>Extrapolation Training (Train=Test Dist.)"]
ROLL --> STUDENT["Causal Student"]
STUDENT --> OUT["Output: Real-time Streaming Video<br/>480P · 17/29 FPS · Infinite Length"]
Key Designs¶
1. Lightweight Track Head and Sinusoidal Encoding: Injecting 2D Trajectories Efficiently
ControlNet-style branches double FLOPs, which is unacceptable for real-time systems. MotionStream assigns a unique \(d\)-dimensional sinusoidal position encoding \(\phi_n\) to each trajectory as an ID. This encoding is written into spatial locations for each frame: \(c_m[t, \lfloor y_t^n/s \rfloor, \lfloor x_t^n/s \rfloor] = v[t,n] \cdot \phi_n\). After 4× temporal compression and a \(1\times1\times1\) convolution, this sparse motion map is concatenated with the video latent in the channel dimension. This is 40× faster than RGB-VAE encoding (24.8ms vs 1053ms) and improves tracking accuracy (EPE 6.54 vs 8.57) because sinusoidal encodings provide clearer identity signals than raw pixels.
2. Joint Text-Motion Guidance Embedding Distillation: Baking Guidance Costs into the Student
The teacher uses joint guidance to follow both trajectories and text:
where \(w_t=3.0\) and \(w_m=1.5\). This requires three passes per step. During distillation, the student does not mimic these three passes. Instead, the teacher's joint guidance is defined as the \(s_{\text{real}}\) target for DMD, while the student \(s_{\text{fake}}\) uses only a single \(f_\psi(c_t,c_m)\) pass without CFG. This "bakes" the quality of integrated guidance into a single inference step. Retaining text guidance is vital: pure motion guidance creates rigid 2D translations, whereas text guidance adds natural secondary motions (e.g., a background rainbow appearing as an elephant walks).
3. Attention Sinks and Rolling KV Cache: Constant Speed and Drift Prevention
Causal distillation methods typically drift after the training horizon (>81 frames). MotionStream maintains a fixed-size KV cache: \(S\) sink chunks (initial frames) and \(W\) local window chunks. As new tokens are generated, the window rolls forward while the cache size remains constant. RoPE is assigned based on intra-cache positions rather than absolute time, ensuring constant latency. Critically, the student is trained using self-rollout with the exact same rolling KV cache and attention sinks. This aligns the training context distribution with inference and prevents color/content drift. Retaining initial frames as anchors is based on observation (Figure 3) that many heads consistently attend to initial tokens, functioning as global anchors.
Loss & Training¶
Teacher Training: Flow matching loss \(\mathcal{L}_{\text{FM}} = \mathbb{E}_{z_0,z_1,t}[w_t \| v_\theta(z_{t'},t',c_t,c_m) - (z_1-z_0) \|^2]\) over two stages (OpenVid-1M 4.8K steps → synthetic finetune 800 steps). Causal adaptation: Regression using 4,000 teacher-generated ODE trajectories (2000 steps). Self Forcing DMD: 1:5 generator-to-critic update ratio, 400 steps convergence. Total hardware: 32×A100 for ~3 days (teacher) + 20 hours (distillation).
Key Experimental Results¶
Motion Transfer — Reconstruction Quality¶
| Method | Backbone | FPS | PSNR↑ | LPIPS↓ | EPE↓ |
|---|---|---|---|---|---|
| Go-With-The-Flow | CogVideoX-5B | 0.60 | 15.62 | 0.490 | 41.99 |
| Diffusion-As-Shader | CogVideoX-5B | 0.29 | 15.80 | 0.483 | 40.23 |
| ATI | Wan 2.1-14B | 0.23 | 15.33 | 0.473 | 17.41 |
| MotionStream Teacher | Wan 2.1-1.3B | 0.79 | 16.61 | 0.427 | 5.35 |
| MotionStream Causal | Wan 2.1-1.3B | 16.7 | 16.20 | 0.443 | 7.80 |
New View Synthesis (LLFF Dataset)¶
| Method | Resolution | FPS | PSNR↑ | LPIPS↓ |
|---|---|---|---|---|
| DepthSplat | 576P | 1.40 | 13.9 | 0.30 |
| ViewCrafter | 576P | 0.26 | 14.0 | 0.30 |
| SEVA | 576P | 0.20 | 14.1 | 0.29 |
| MotionStream Teacher | 480P | 0.79 | 16.0 | 0.21 |
| MotionStream Causal | 480P | 16.7 | 15.7 | 0.23 |
Ablation Study — Attention Configurations¶
| Config | LPIPS↓ | EPE↓ | Latency Jitter | Throughput |
|---|---|---|---|---|
| c3s1w1 (Standard) | 0.464 | 25.34 | 0.70±0.01 | 16.92±0.80 |
| c3s0w1 (No sink) | 0.501 | 26.64 | 0.68±0.005 | 17.43±0.88 |
| c1s1w1 (chunk=1) | 0.597 | 76.21 | 0.30±0.01 | 13.26±1.36 |
| Sliding window | 0.480 | 28.09 | 0.80±0.08 | 14.96±1.42 |
Key Findings¶
- MotionStream Causal is 20-70× faster than baselines while achieving SOTA on DAVIS/Sora motion tracking metrics.
- Outperforms specialized 3D methods (DepthSplat/ViewCrafter/SEVA) in zero-shot camera control (PSNR +1.6, LPIPS -0.07).
- Attention sinks are critical: removing them degrades LPIPS from 0.464 to 0.501 and causes visible color drift in long videos.
- Counter-intuitive finding: larger attention windows can degrade quality because attending to long-past history accumulates errors.
- Sliding window approaches show latency jitter (±0.08s vs ±0.01s) due to unstable computation from unbounded RoPE positions.
- Tiny VAE increases Wan 2.1 FPS from 16.7 to 29.5 with negligible quality loss.
Highlights & Insights¶
- Paradigm shift: The 2-order-of-magnitude speedup moves motion-controlled video from "rendering" to "interactive creation."
- Cross-domain Attention Sinks: The "initial token concentration" observed in LLMs is successfully applied to video diffusion, using initial frames as anchors to prevent drift.
- Matched Distribution Training: Unlike methods like TalkingMachines, MotionStream uses the exact inference rolling KV cache during self-rollout, eliminating the train-test mismatch.
- Guidance Complementarity: Pure trajectory guidance leads to rigid motion, while joint guidance (\(w_t=3.0, w_m=1.5\)) ensures both natural movement and precise tracking.
Limitations & Future Work¶
- Fixed attention sinks are unsuitable for complete scene cuts; dynamic anchor refreshing is needed.
- Physically impossible trajectories still cause temporal inconsistency or appearance distortion.
- Wan 2.1 (1.3B) is more robust at maintaining structure than larger models like Wan 2.2 (5B).
- Trajectory disappearance: Models struggle to distinguish between occlusion and "no specified control" (both being zero values).
Related Work & Insights¶
- vs Motion Prompting: Both use 2D trajectories, but Motion Prompting is slow/offline (12min/5s), whereas MotionStream is real-time/causal (29FPS).
- vs Self Forcing: MotionStream solves the latency jitter and long-video drift issues found in the original unbounded sliding window framework by using attention sinks.
- vs TalkingMachines: MotionStream avoids the temporal discontinuity issues of synchronized denoising by using full causal simulation during training.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First real-time streaming motion-controlled video system.
- Experimental Thoroughness: ⭐⭐⭐⭐ Broad coverage across motion transfer, camera control, and ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear system design and insightful ablation analysis.
- Value: ⭐⭐⭐⭐⭐ Significant advancement for interactive video creation.