DAWN: Pixel Motion Diffusion is What We Need for Robot Control¶

Conference: CVPR 2026 arXiv: 2509.22652 Code: https://eronguyen.github.io/DAWN/ Area: Multimodal VLM / Robot Control Keywords: Pixel motion diffusion, vision-language-action, robot manipulation, two-stage diffusion, optical flow representation

TL;DR¶

This paper proposes DAWN, a two-stage fully diffusion-based vision-language-action framework. A Motion Director (latent diffusion model) generates dense pixel motion fields as interpretable intermediate representations, while an Action Expert (diffusion Transformer policy) translates pixel motion into executable robot actions. DAWN achieves state-of-the-art performance on the CALVIN benchmark (average length 4.00) and demonstrates strong generalization on real-world single-arm and dual-arm manipulation tasks.

Background & Motivation¶

While VLA (Vision-Language-Action) models have achieved notable progress, most directly map visual observations to actions without explicit modeling of motion intent. Some approaches employ video prediction as an intermediate step, but operating in RGB space increases learning difficulty. Pixel trajectories have proven effective as motion representations; however, existing methods rely on sparse pixel tracking or indirectly extract motion from generated videos—neither of which is as concise or efficient as directly predicting dense pixel motion fields.

Core Problem¶

How to design a structured, interpretable, and efficient intermediate motion representation that bridges high-level language intent and low-level robot actions?

Method¶

Overall Architecture¶

A two-stage diffusion architecture: 1. Motion Director: An LDM (Latent Diffusion Model) conditioned on current observations and language instructions, outputting dense pixel motion fields. 2. Action Expert: A Diffusion Transformer policy conditioned on pixel motion, observations, language, and robot state, outputting action sequences.

Key Designs¶

Motion Director (High-Level Motion Planning): Built on a pretrained Stable Diffusion U-Net. The current frame is VAE-encoded and concatenated with noise as input. Language embeddings (CLIP text encoder), gripper-view embeddings (CLIP visual encoder), and temporal offset \(k\) are injected via cross-attention. The decoded output is a 3-channel pixel motion map \(F'_{t,k} = [u, v, (u+v)/2]\). Ground-truth motion fields are generated using the RAFT optical flow model during training. Only U-Net parameters are updated; all encoders and the VAE are frozen.
Action Expert (Low-Level Action Execution): A Transformer-based diffusion policy with four separate encoders processing: pixel motion (DINOv3 ConvNeXt-S), visual observations, language instructions (T5-small), and robot state (2-layer MLP). All condition tokens are concatenated and injected into the denoising Transformer via cross-attention, iteratively denoising from Gaussian noise to produce action sequences.
Advantages of Pixel Motion as an Intermediate Representation:
- More structured than RGB video prediction—directly encodes motion direction and magnitude.
- More informative than sparse pixel tracking—provides dense coverage of the full scene.
- Interpretable—motion fields can be directly visualized as "how the model intends the scene to move."
- Embodiment-agnostic—pixel motion does not depend on specific robot joint configurations.
Modular Design with Parallel Training: Motion Director and Action Expert can be trained in parallel (using RAFT optical flow as ground truth), with optional fine-tuning of the Action Expert on actual Motion Director outputs.

Loss & Training¶

Both models use MSE noise estimation loss.
Motion Director: 100k steps, batch size 16/GPU, 4× A6000 GPUs.
Action Expert: 10k steps, batch size 64/GPU.
AdamW optimizer, lr=1e-4, mixed-precision training.
Inference uses 25 diffusion steps for the Motion Director.

Key Experimental Results¶

CALVIN ABC→D (No External Data)¶

Method	1st ↑	2nd ↑	3rd ↑	4th ↑	5th ↑	Avg Len ↑
Diffusion Policy	0.40	0.12	0.03	0.01	0.00	0.56
MoDE	0.92	0.79	0.67	0.56	0.45	3.39
Seer-Large	0.96	0.89	0.80	0.71	0.60	3.96
DAWN	0.97	0.89	0.82	0.72	0.60	4.00

MetaWorld (11 Tasks)¶

Method	Avg Success Rate ↑
LTM	57.7%
ATM	52.0%
DAWN	65.4%

Real-World Single-Arm Manipulation (xArm7, 1000 episodes, 6 object pick-and-place categories)¶

Method	Overall Success Rate	Inference Latency
Enhanced DP	Lower	112.77ms
\(\pi_0\)	Medium (frequent wrong grasps)	571.89ms
VPP	High	190.55ms
DAWN	Highest	319.82ms

DAWN achieves the highest success rate across nearly all object categories with a very low mis-grasp rate.

Ablation Study¶

Pixel motion vs. RGB target: Pixel motion (4.00) >> RGB target image (3.21) >> No intermediate representation (2.78).
Pretrained vs. training from scratch: Pretrained LDM for pixel motion (4.00) > From scratch (3.42)—pretraining on image generation substantially benefits pixel motion prediction.
Gripper view: Removing gripper view degrades performance to 3.74, indicating the importance of occlusion and hand-object interaction information.
Diffusion steps: 2 steps (3.88) → 25 steps (4.00) → 40 steps (3.95); 25 steps is optimal.
Dual-arm manipulation: Pixel motion similarly reduces action prediction MSE in dual-arm scenarios.

Highlights & Insights¶

Dense pixel motion as a universal motion representation: More concise than RGB prediction, richer than sparse trajectories, and more general than keypoints.
First adaptation of pretrained LDM for dense pixel motion generation: Effectively leverages large-scale image generation pretraining.
Modular interpretability: The output of Motion Director can be directly visualized, providing transparent insight into model decisions.
Data efficiency: Strong generalization achieved with only 1,000 real-world episodes, demonstrating the advantage of structured intermediate representations.
Dual-arm extensibility: The approach is validated on the Galaxea R1-Lite dual-arm platform, confirming its generality.

Limitations & Future Work¶

Inference latency is relatively high (319ms vs. 113ms for Enhanced DP) due to the two-stage diffusion pipeline.
Ground-truth optical flow from RAFT may be inaccurate in textureless regions or under fast motion.
When external data (DROID) is used, performance improves over the no-external-data setting but does not surpass VPP and DreamVLA at the top of the leaderboard.
The method has only been validated on surface manipulation tasks; contact-rich tasks such as assembly remain unexplored.

vs. VLA (OpenVLA, RT-2, \(\pi_0\)): VLA methods perform end-to-end mapping without interpretable intermediate representations; DAWN provides transparent motion planning via pixel motion fields.
vs. Gen2Act (video prediction → motion extraction): Gen2Act generates RGB video first and then tracks pixels to extract motion, introducing cascading errors across two steps; DAWN directly predicts motion in latent space.
vs. VPP (video diffusion features): VPP extracts predictive features from a video diffusion model without explicitly generating motion; DAWN's motion fields are more interpretable and demonstrate stronger advantages in low-data regimes.

Insights and Connections: - Pixel motion as a universal intermediate representation for robot control can be combined with the visual reasoning capabilities of VLMs, enabling models to understand and generate motion intent. - The successful adaptation of pretrained image diffusion models to pixel motion prediction suggests broader applications of LDMs beyond RGB outputs. - The modular design allows independent upgrades of high-level and low-level components, facilitating rapid iteration.

Rating¶

Novelty: ⭐⭐⭐⭐ Dense pixel motion as a unified intermediate representation combined with a dual-diffusion architecture constitutes an effective and novel design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated on CALVIN, MetaWorld, real-world single-arm, and real-world dual-arm settings with comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ The paper is well-structured with detailed method descriptions.
Value: ⭐⭐⭐⭐ Provides a concise and effective structured intermediate representation framework for robot learning.