Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy¶

Conference: ICCV 2025 arXiv: 2503.19757 Code: Project Area: Robot Policy / VLA Models Keywords: VLA, diffusion policy, DiT, in-context conditioning, cross-embodiment

TL;DR¶

This paper proposes Dita (Diffusion Transformer Policy), which, unlike prior methods that denoise on compressed embeddings using shallow networks, adopts in-context conditioning to directly condition denoising on raw visual tokens. A causal Transformer processes the full token sequence of language, images, timesteps, and noisy actions. With 334M parameters, Dita achieves state-of-the-art or competitive performance on SimplerEnv zero-shot, LIBERO, CALVIN, and other benchmarks.

Background & Motivation¶

Background: Generalist robot policies have advanced through pretraining on large-scale cross-embodiment datasets such as OXE.
Limitations of Prior Work: (1) Discretized actions (e.g., OpenVLA) limit adaptability to heterogeneous action spaces; (2) methods using MLP or small DiT diffusion heads (e.g., Octo, π₀) lack sufficient expressiveness under the diversity of large-scale data; (3) denoising on embeddings discards visual details from historical observations.
Key Challenge: Heterogeneous cross-embodiment action spaces conflict with the need for a unified policy representation.
Goal: Design an expressive, scalable generalist robot policy architecture.
Key Insight: Place action denoising directly inside a causal Transformer that interacts with visual tokens.
Core Idea: Action denoising should not operate on compressed embeddings but should instead perform in-context attention directly over raw visual patch tokens.

Method¶

Overall Architecture¶

CLIP encodes language → DINOv2 + Q-Former extract image features → concatenate [language, image, timestep, noisy action] token sequence → causal DiT denoising → output clean action chunk (16 steps).

Key Designs¶

Design 1: In-Context Conditioning Diffusion - Function: Noisy action tokens and visual/language tokens are processed together within a single causal Transformer. - Mechanism: Action tokens participate directly in attention computation and can attend to every image patch token, capturing subtle action increments and environmental details. - Design Motivation: Prior methods condition denoising on a single embedding, losing spatial detail; in-context conditioning preserves complete visual information.

Design 2: End-to-End DINOv2 Fine-Tuning + Q-Former - Function: DINOv2 extracts multi-scale features; Q-Former queries key visual features conditioned on language instructions. - Mechanism: DINOv2 pretrained on web data has a domain gap with robot data; end-to-end fine-tuning bridges this gap. Q-Former uses FiLM conditioning to select task-relevant information from DINOv2 patch features, reducing computational cost. - Design Motivation: Frozen visual encoders are insufficient for robotics, but full fine-tuning produces too many tokens, necessitating Q-Former compression.

Design 3: Lightweight and Scalable Architecture - Function: Achieves state-of-the-art performance with only 334M parameters. - Mechanism: LLaMA-style causal Transformer, requiring no large VLM (e.g., PaliGemma). DDPM training (1000 steps) + DDIM inference (20 steps). - Design Motivation: Provides a clean, lightweight, open-source baseline that lowers the barrier to entry for the community.

Loss & Training¶

Standard DDPM diffusion objective: \(\min \|\epsilon - \epsilon_\theta(z_t, t, c)\|^2\). AdamW optimizer, 100K steps, batch size 8192 (32× A100), 2-frame observation history → 16-step action chunk.

Key Experimental Results¶

Main Results¶

SimplerEnv Zero-Shot (Success Rate %)

Method	coke_can (match/var)	move_near (match/var)
RT-1-X	56.7 / 49.0	31.7 / 32.3
OpenVLA	16.3 / 54.5	46.2 / 47.7
Dita	83.7 / 85.5	76.0 / 73.0

LIBERO Fine-Tuning (Success Rate %)

Method	SPATIAL	OBJECT	GOAL	LONG	Avg.
OpenVLA	84.9	88.4	79.2	53.7	76.5
Dita	84.2	96.3	85.4	63.8	82.4

Ablation Study¶

Configuration	Calvin Avg. Len
Diffusion head (non-in-context)	3.16
In-context Dita	3.53
No pretraining	2.38

Key Findings¶

In-context conditioning significantly outperforms diffusion heads, especially on long-horizon tasks (LIBERO-LONG +10%).
Using only a third-person camera with 10-shot fine-tuning generalizes to novel real-world environments.
334M parameters surpass 7B (OpenVLA) and larger models, suggesting that architectural design matters more than scale.

Highlights & Insights¶

The core insight of in-context conditioning — action denoising requires access to raw visual details rather than compressed embeddings.
The lightweight open-source baseline provides significant value to the community.
The paradigm of cross-embodiment pretraining followed by 10-shot real-world fine-tuning is highly practical.

Limitations & Future Work¶

Only third-person cameras are used; incorporating wrist cameras or tactile sensing could further improve performance.
The sensitivity of Q-Former query count to performance is not thoroughly analyzed.
Validation on bimanual manipulation scenarios is absent.

Octo employs a diffusion head but with limited expressiveness; Dita demonstrates that internalizing denoising within the Transformer is superior.
π₀ relies on a larger VLM, yet Dita achieves comparable performance with only 334M parameters.
Insight: The key to robot policy learning may lie not in model size but in how actions and observations interact.

Rating¶

Dimension	Score
Novelty	★★★★☆
Practicality	★★★★★
Experimental Thoroughness	★★★★★
Writing Quality	★★★★☆