Pixel Motion Diffusion Is What We Need for Robot Control¶
Conference: CVPR 2026 arXiv: 2509.22652 Code: Available Area: Image Generation Keywords: Pixel motion diffusion, robot control, vision-language-action, optical flow representation, hierarchical diffusion policy
TL;DR¶
DAWN proposes a two-stage fully diffusion-based framework — a Motion Director that generates dense pixel motion fields as interpretable intermediate representations, and an Action Expert that converts these fields into executable robot action sequences — achieving SOTA on CALVIN (Avg Len 4.00), MetaWorld (Overall 65.4%), and real-world benchmarks, with substantially smaller model capacity and training data than competing methods.
Background & Motivation¶
Background: Vision-Language-Action (VLA) models leverage large-scale web data to achieve broad generalization, yet remain limited in motion perception and spatial reasoning. Existing motion-guided approaches follow two lines: (1) sparse pixel trajectories (General Flow, FLIP, Track2Act, etc.) that extract motion cues from keypoints or sparse points; (2) future RGB frame prediction (SuSIE, UniPi, VPP, etc.) that uses video diffusion models to generate future observations and then derive actions.
Limitations of Prior Work: - Insufficient information from sparse trajectories: Tracking only a small number of keypoints fails to describe full-scene motion, losing critical spatial information in complex manipulation tasks. - High cost of RGB frame prediction: Generating complete video frames in high-dimensional RGB space is computationally expensive and lacks explicit motion structure. - Indirect extraction introduces complexity: Gen2Act first generates video and then tracks pixels to extract motion, introducing unnecessary indirection and error accumulation.
Key Challenge: Rather than generating complete RGB frames and indirectly extracting motion, directly predicting dense pixel motion preserves full-scene motion information while greatly reducing generation complexity — pixel motion fields are structurally simpler and more learnable than RGB frames.
Goal: DAWN (Diffusion is All We Need) — a two-stage pipeline in which both stages use diffusion models: a Motion Director predicts dense pixel motion fields in latent space, and an Action Expert converts these fields into action sequences, forming a fully trainable, end-to-end, interpretable control pipeline.
Method¶
Overall Architecture¶
DAWN adopts a dual-diffusion hierarchical structure:
- Motion Director (high-level controller): Built upon a pretrained latent diffusion model (LDM), it takes multi-view images (static view + gripper view) and language instructions as input, and generates dense pixel motion fields \(\mathbf{F}'_{t,k} = [u, v, (u+v)/2]\).
- Action Expert (low-level controller): Built upon a diffusion Transformer, it jointly encodes pixel motion fields, visual observations, language instructions, and robot states, and denoises to generate executable action sequences.
The two modules are connected via structured pixel motion representations, maintaining modular upgradability while providing an intuitive and interpretable intermediate abstraction.
Key Designs¶
1. Dense Pixel Motion Representation¶
- Function: Encodes scene-level motion intent as a structured intermediate representation, bridging high-level language understanding and low-level action generation.
- Mechanism: The pixel motion from frame \(\mathbf{I}_t\) to \(\mathbf{I}_{t+k}\) is defined as \(\mathbf{F}_{t,k} = [u, v]\), where \(u, v \in \mathbb{R}^{H \times W}\) denote horizontal and vertical displacements, respectively. To reuse pretrained RGB diffusion models, this is encoded as a three-channel image \(\mathbf{F}'_{t,k} = [u, v, (u+v)/2]\).
- Design Motivation: (1) Dense motion retains more complete scene dynamics than sparse keypoints; (2) three-channel encoding enables direct transfer of RGB-pretrained diffusion models; (3) motion fields are lower-dimensional and more regularly structured than RGB frames, reducing generation difficulty.
- Training Label Generation: Ground-truth pixel motion is extracted from frame pairs \((\mathbf{I}_t, \mathbf{I}_{t+k})\) using the RAFT optical flow model.
2. Motion Director Architecture¶
- Function: Conditionally generates dense pixel motion fields.
- Mechanism: Built on a pretrained LDM, the current frame's VAE encoding (without added noise) is concatenated with Gaussian noise as U-Net input. Language embeddings (CLIP text encoder), gripper-view embeddings (CLIP visual encoder), and time offset \(k\) are injected via cross-attention.
- Design Motivation: The noiseless current-frame encoding serves as a spatial alignment condition; zero-initialized weights introduce gripper-view conditioning without disrupting pretrained model behavior at the start of training.
- Training Strategy: Only the U-Net denoiser weights are updated; VAE, CLIP, and other pretrained modules are frozen. MSE noise estimation loss is used.
3. Action Expert Architecture¶
- Function: Converts pixel motion fields into low-level executable robot action sequences.
- Mechanism: Multimodal encoding → cross-attention conditioned denoising Transformer → iterative denoising to generate action chunks. Encoder configurations: DINOv3 ConvNeXt-S encodes pixel motion and visual observations; T5-small encodes language instructions; a 2-layer MLP encodes robot states.
- Design Motivation: (1) A diffusion Transformer, rather than an MLP denoiser, better models complex dependencies among multimodal conditions; (2) freezing visual/text encoders leverages pretrained representations, and training only the denoiser and state encoder reduces data requirements.
- Loss & Training: MSE noise estimation loss; action chunks are predicted in action space.
Training & Inference Pipeline¶
- Parallel Training: The Motion Director and Action Expert can be trained in parallel — the former uses RAFT optical flow as ground truth, the latter uses corresponding GT optical flow and actions. Optionally, the Action Expert can be further fine-tuned on actual Motion Director outputs.
- Inference Pipeline: Observation encoding → Motion Director (25 diffusion steps) → pixel motion field → Action Expert denoising → action sequence → execution → observation update → closed-loop.
- Modular Upgradability: The two modules can be independently replaced or upgraded, facilitating integration of future stronger visual or control models.
Key Experimental Results¶
Main Results¶
CALVIN ABC→D (no external robot data, Table 1):
| Method | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Avg Len ↑ |
|---|---|---|---|---|---|---|
| Diffusion Policy | 0.40 | 0.12 | 0.03 | 0.01 | 0.00 | 0.56 |
| Robo-Flamingo | 0.82 | 0.62 | 0.47 | 0.33 | 0.24 | 2.47 |
| MoDE | 0.92 | 0.79 | 0.67 | 0.56 | 0.45 | 3.39 |
| RoboUniview | 0.94 | 0.84 | 0.73 | 0.62 | 0.51 | 3.65 |
| Seer-Large | 0.93 | 0.85 | 0.76 | 0.69 | 0.60 | 3.83 |
| VPP | 0.96 | 0.88 | 0.78 | 0.71 | 0.60 | 3.93 |
| Enhanced DP (ours) | 0.82 | 0.67 | 0.53 | 0.41 | 0.35 | 2.78 |
| DAWN (ours) | 0.98 | 0.91 | 0.79 | 0.71 | 0.61 | 4.00 |
CALVIN ABC→D (with external data, Table 2):
| Method | Extra Data | Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Avg Len ↑ |
|---|---|---|---|---|---|---|---|
| GR-1 | Ego4D | 0.85 | 0.71 | 0.60 | 0.50 | 0.40 | 3.06 |
| LTM | OpenX | 0.97 | 0.82 | 0.73 | 0.67 | 0.61 | 3.81 |
| MoDE | Multiple | 0.96 | 0.89 | 0.81 | 0.72 | 0.65 | 4.01 |
| VPP | Multiple | 0.97 | 0.91 | 0.87 | 0.82 | 0.77 | 4.33 |
| DreamVLA | DROID | 0.98 | 0.95 | 0.90 | 0.83 | 0.78 | 4.44 |
| DAWN (ours) | DROID | 0.98 | 0.92 | 0.81 | 0.75 | 0.64 | 4.10 |
MetaWorld 11-task Success Rate (Table 3):
| Method | door-open | door-close | basketball | shelf-place | btn-press | faucet-close | hammer | assembly | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Diffusion Policy | 45.3 | 45.3 | 8.0 | 0.0 | 40.0 | 22.7 | 4.0 | 1.3 | 24.1 |
| ATM | 75.3 | 90.7 | 24.0 | 16.3 | 77.3 | 50.0 | 4.3 | 2.0 | 52.0 |
| LTM | 77.3 | 95.0 | 39.0 | 20.3 | 82.7 | 52.3 | 10.3 | 7.7 | 57.7 |
| DAWN (ours) | 94.7 | 97.3 | 42.0 | 24.7 | 92.0 | 76.3 | 12.7 | 10.7 | 65.4 |
Real-world lift-and-place experiment (Table 4, 20 random initializations per task):
| Method | Apple | Avocado | Banana | Grape | Kiwi | Orange | Inference Latency (ms) |
|---|---|---|---|---|---|---|---|
| Enhanced DP | 5→4 | 6→6 | 5→4 | 4→3 | 5→5 | 4→4 | 112.77 |
| π₀ | 10→9 | 6→6 | 5→3 | 8→5 | 5→3 | 8→7 | 571.89 |
| VPP | 16→14 | 15→15 | 15→14 | 17→17 | 15→15 | 16→14 | 190.55 |
| DAWN | 19→19 | 20→19 | 17→16 | 19→19 | 17→16 | 18→16 | 319.82 |
(→ left: number of successful lifts; right: number of successful placements)
Ablation Study (CALVIN ABC→D, Table 6)¶
(a) Pixel motion vs. other intermediate representations:
| Setting | Avg Len |
|---|---|
| No intermediate representation (Action Expert only) | 2.78 |
| RGB target image | 3.21 |
| Pixel motion (U-Net trained from scratch) | 3.42 |
| Pixel motion (pretrained LDM) | 4.00 |
(b) Gripper-view conditioning:
| Setting | Avg Len |
|---|---|
| VPP without gripper view | 3.58 |
| DAWN without gripper view | 3.74 |
| DAWN with gripper view | 4.00 |
(c) Motion Director diffusion steps:
| Steps | 2 | 10 | 25 | 40 |
|---|---|---|---|---|
| Avg Len | 3.88 | 3.96 | 4.00 | 3.95 |
Bimanual manipulation (Table 5): DAWN achieves an action prediction MSE of 0.117 on Galaxea R1-Lite bimanual manipulation, outperforming Enhanced DP (0.128).
Key Findings¶
- DAWN achieves SOTA without external data (4.00 > VPP 3.93), demonstrating high data efficiency.
- Pixel motion substantially outperforms RGB target images (4.00 vs. 3.21), and transfer from pretrained LDM further improves performance (4.00 vs. 3.42).
- Strong semantic understanding: the model performs well on semantically similar but distinct task pairs (door-open 94.7% vs. door-close 97.3%).
- Motion Director achieves Avg Len 3.88 with only 2 diffusion steps, indicating that motion information is highly concentrated in the early denoising steps.
- In real-world settings, reliable transfer is achieved with only 1,000 episodes and 100k fine-tuning steps, with near-zero grasp failure rate for DAWN.
- The framework generalizes effectively to bimanual manipulation scenarios.
Highlights & Insights¶
- Dense motion fields as a universal intermediate language: Compared to sparse trajectories and RGB frames, dense pixel motion preserves complete spatial information while reducing generation complexity; the three-channel encoding elegantly reuses RGB pretrained weights.
- Surprising effectiveness of pretrained transfer: An LDM trained on RGB images transfers efficiently to pixel motion generation (from-scratch 3.42 vs. pretrained 4.00), suggesting that motion fields and RGB images share meaningful structure in latent space.
- Modular design enables high data efficiency: Freezing pretrained encoders and training only the denoiser allows the model to match or exceed SOTA with significantly smaller model capacity and less training data than VLA methods.
- Interpretable intermediate representation: Pixel motion fields are directly visualizable, enabling users to understand the model's motion intent — a practically valuable property for safe robot deployment.
Limitations & Future Work¶
- Two-stage sequential inference introduces additional latency (319 ms vs. 113 ms for Enhanced DP), limiting real-time applicability.
- RAFT optical flow used as training labels may introduce noise in scenes with occlusion or large deformations.
- Performance with external data is lower than VPP/DreamVLA (4.10 vs. 4.33/4.44), leaving room to improve large-scale data utilization.
- Real-world experiments cover only lift-and-place and single-category bimanual tasks; complex long-horizon manipulation has not been validated.
- Single-step motion prediction (predicting offset \(k\)) lacks multi-step planning capability.
Related Work & Insights¶
- Gen2Act / FLIP: Indirectly extract motion trajectories from generated videos → DAWN bypasses video generation and directly predicts motion, achieving greater efficiency.
- Diffusion Policy: End-to-end diffusion action generation without motion intermediate abstraction → DAWN's Action Expert, conditioned on motion, yields a substantial gain (2.78→4.00).
- VPP: Extracts video diffusion features in RGB space as implicit motion representations → DAWN replaces these with explicit pixel motion, performing better without external data.
- π₀: Large-scale VLA flow-matching model → DAWN substantially outperforms it in real-world settings with a smaller model and less data.
- Insight: Pretrained image diffusion models can be viewed as general-purpose visual prediction engines whose capabilities extend well beyond RGB image generation, transferring to structured output tasks such as motion prediction.
Rating¶
| Dimension | Score (1–5) | Note |
|---|---|---|
| Novelty | 4 | Dense pixel motion + dual diffusion pipeline; three-channel encoding for reusing RGB pretrained weights is elegant |
| Technical Depth | 3.5 | Well-engineered system, though the core principle is relatively straightforward (LDM + optical flow GT) |
| Experimental Thoroughness | 4.5 | Three major benchmarks (CALVIN / MetaWorld / real world) + bimanual + detailed ablations |
| Writing Quality | 4 | Clear structure with thorough motivation and comparative analysis |
| Value | 4 | High data efficiency, interpretable, modular, and deployable |
| Overall | 4.0 | A practical framework achieving SOTA with smaller models and less data |