Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy¶
Conference: ICCV 2025 arXiv: 2503.19757 Code: Project Area: Robot Policy / VLA Models Keywords: VLA, diffusion policy, DiT, in-context conditioning, cross-embodiment
TL;DR¶
This paper proposes Dita (Diffusion Transformer Policy), which, unlike prior methods that denoise on compressed embeddings using shallow networks, adopts in-context conditioning to directly condition denoising on raw visual tokens. A causal Transformer processes the full token sequence of language, images, timesteps, and noisy actions. With 334M parameters, Dita achieves state-of-the-art or competitive performance on SimplerEnv zero-shot, LIBERO, CALVIN, and other benchmarks.
Background & Motivation¶
- Background: Generalist robot policies have advanced through pretraining on large-scale cross-embodiment datasets such as OXE.
- Limitations of Prior Work: (1) Discretized actions (e.g., OpenVLA) limit adaptability to heterogeneous action spaces; (2) methods using MLP or small DiT diffusion heads (e.g., Octo, π₀) lack sufficient expressiveness under the diversity of large-scale data; (3) denoising on embeddings discards visual details from historical observations.
- Key Challenge: Heterogeneous cross-embodiment action spaces conflict with the need for a unified policy representation.
- Goal: Design an expressive, scalable generalist robot policy architecture.
- Key Insight: Place action denoising directly inside a causal Transformer that interacts with visual tokens.
- Core Idea: Action denoising should not operate on compressed embeddings but should instead perform in-context attention directly over raw visual patch tokens.
Method¶
Overall Architecture¶
CLIP encodes language → DINOv2 + Q-Former extract image features → concatenate [language, image, timestep, noisy action] token sequence → causal DiT denoising → output clean action chunk (16 steps).
Key Designs¶
Design 1: In-Context Conditioning Diffusion - Function: Noisy action tokens and visual/language tokens are processed together within a single causal Transformer. - Mechanism: Action tokens participate directly in attention computation and can attend to every image patch token, capturing subtle action increments and environmental details. - Design Motivation: Prior methods condition denoising on a single embedding, losing spatial detail; in-context conditioning preserves complete visual information.
Design 2: End-to-End DINOv2 Fine-Tuning + Q-Former - Function: DINOv2 extracts multi-scale features; Q-Former queries key visual features conditioned on language instructions. - Mechanism: DINOv2 pretrained on web data has a domain gap with robot data; end-to-end fine-tuning bridges this gap. Q-Former uses FiLM conditioning to select task-relevant information from DINOv2 patch features, reducing computational cost. - Design Motivation: Frozen visual encoders are insufficient for robotics, but full fine-tuning produces too many tokens, necessitating Q-Former compression.
Design 3: Lightweight and Scalable Architecture - Function: Achieves state-of-the-art performance with only 334M parameters. - Mechanism: LLaMA-style causal Transformer, requiring no large VLM (e.g., PaliGemma). DDPM training (1000 steps) + DDIM inference (20 steps). - Design Motivation: Provides a clean, lightweight, open-source baseline that lowers the barrier to entry for the community.
Loss & Training¶
Standard DDPM diffusion objective: \(\min \|\epsilon - \epsilon_\theta(z_t, t, c)\|^2\). AdamW optimizer, 100K steps, batch size 8192 (32× A100), 2-frame observation history → 16-step action chunk.
Key Experimental Results¶
Main Results¶
SimplerEnv Zero-Shot (Success Rate %)
| Method | coke_can (match/var) | move_near (match/var) |
|---|---|---|
| RT-1-X | 56.7 / 49.0 | 31.7 / 32.3 |
| OpenVLA | 16.3 / 54.5 | 46.2 / 47.7 |
| Dita | 83.7 / 85.5 | 76.0 / 73.0 |
LIBERO Fine-Tuning (Success Rate %)
| Method | SPATIAL | OBJECT | GOAL | LONG | Avg. |
|---|---|---|---|---|---|
| OpenVLA | 84.9 | 88.4 | 79.2 | 53.7 | 76.5 |
| Dita | 84.2 | 96.3 | 85.4 | 63.8 | 82.4 |
Ablation Study¶
| Configuration | Calvin Avg. Len |
|---|---|
| Diffusion head (non-in-context) | 3.16 |
| In-context Dita | 3.53 |
| No pretraining | 2.38 |
Key Findings¶
- In-context conditioning significantly outperforms diffusion heads, especially on long-horizon tasks (LIBERO-LONG +10%).
- Using only a third-person camera with 10-shot fine-tuning generalizes to novel real-world environments.
- 334M parameters surpass 7B (OpenVLA) and larger models, suggesting that architectural design matters more than scale.
Highlights & Insights¶
- The core insight of in-context conditioning — action denoising requires access to raw visual details rather than compressed embeddings.
- The lightweight open-source baseline provides significant value to the community.
- The paradigm of cross-embodiment pretraining followed by 10-shot real-world fine-tuning is highly practical.
Limitations & Future Work¶
- Only third-person cameras are used; incorporating wrist cameras or tactile sensing could further improve performance.
- The sensitivity of Q-Former query count to performance is not thoroughly analyzed.
- Validation on bimanual manipulation scenarios is absent.
Related Work & Insights¶
- Octo employs a diffusion head but with limited expressiveness; Dita demonstrates that internalizing denoising within the Transformer is superior.
- π₀ relies on a larger VLM, yet Dita achieves comparable performance with only 334M parameters.
- Insight: The key to robot policy learning may lie not in model size but in how actions and observations interact.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ★★★★☆ |
| Practicality | ★★★★★ |
| Experimental Thoroughness | ★★★★★ |
| Writing Quality | ★★★★☆ |