Dense Policy: Bidirectional Autoregressive Learning of Actions¶

Conference: ICCV 2025 arXiv: 2503.13217 Code: https://selen-suyue.github.io/DspNet/ Area: Image Generation Keywords: autoregressive policy, bidirectional expansion, coarse-to-fine generation, diffusion policy, imitation learning

TL;DR¶

This paper proposes Dense Policy, a robot manipulation policy based on bidirectional autoregressive expansion, which achieves hierarchical coarse-to-fine action generation in logarithmic time and surpasses mainstream generative policies such as Diffusion Policy and ACT on both simulation and real-world tasks.

Background & Motivation¶

Action generation paradigms in imitation learning fall into two categories: - Holistic generative policies (ACT, Diffusion Policy): model the joint distribution of action sequences and generate all actions at once. Strong performance but high inference cost. - Autoregressive policies (ICRT, ARP): generate actions incrementally token-by-token or chunk-by-chunk. While effective in language and vision, they underperform in action prediction.

Challenges of autoregressive policies for action generation: 1. Next-token prediction struggles to capture long-range temporal dependencies. 2. Next-chunk prediction has a limited attention span. 3. Existing multi-scale methods (CARP) rely on discretization and codebook construction, resulting in insufficient precision.

Core observation: humans do not reason through action trajectories step by step during manipulation; instead, they first plan a few keyframes and then progressively refine them. This is analogous to the concept of "receptive fields" in vision — a coarse-to-fine perception process.

Method¶

Overall Architecture¶

Dense Policy adopts an encoder-only architecture. Starting from an initial zero vector, it recursively expands the action sequence level by level via a "Dense Process": $$P(A|O) = \prod_{i=1}^{n} P(A^i | A^{i-1}, A^{i-2}, ..., A^0, O)$$

The sequence length doubles at each level, reaching the target horizon $T$ after $\log_2 T$ recursive steps.

Key Designs¶

Bidirectional Expansion: The core mechanism is the Dense Process. Given the sparse keyframe actions $A^n$ from the previous level (containing $2^n$ action points), the sequence is upsampled to $2^{n+1}$ points via linear interpolation:
- Existing positions retain their original values.
- New positions are set to the linear interpolation of their two neighbors: $\tilde{a}_{t+j}^n = \frac{1}{2}(a_{t+j-T/2^{n+1}} + a_{t+j+T/2^{n+1}})$
- Boundary positions are filled by copying the nearest original point.

The upsampled sequence is then processed by a 4-layer BERT Encoder with cross-attention to produce the next level $A^{n+1} = \text{Enc}(A_{up}^n, O)$.

Key distinction from next-token and next-chunk prediction: bidirectional expansion simultaneously captures temporal dependencies in both directions, yielding more coherent action sequences. Inference complexity is $O(\log T)$ rather than $O(T)$.

Encoder-Only Architecture: A shared BERT Encoder handles all levels, integrating visual observation features into action representations via cross-attention. Advantages include:
- Fewer parameters (compared to Diffusion Head and CVAE Head).
- Faster inference (comparable to ACT, approximately 10× faster than Diffusion Policy).
- Stable training (no variational inference from VAE or multi-step denoising from diffusion models).
Flexible Visual Input:
- 2D: ResNet18 + GroupNorm (RGB images).
- 3D: Sparse Convolutional Network (point clouds).
- Can be seamlessly replaced with other visual backbones (e.g., RISE's sparse convolutional network).
- Partial proprioception masking during training to prevent positional memorization bias.

Loss & Training¶

L2 loss supervises the deviation between predicted actions at each level and the ground truth.
The initial action vector is set to $A^0 = \mathbf{0}$, providing an unbiased starting point.
Random masking of partial end-effector poses during proprioception training improves generalization.
The same number of training iterations and expert demonstrations as baseline methods are used.

Key Experimental Results¶

Main Results (Tables)¶

Simulation (11 tasks, 3 benchmarks):

Method	Adroit-Door	Adroit-Pen	DexArt-Laptop	DexArt-Toilet	MetaWorld-BinPick	MetaWorld-BoxClose	MetaWorld-Hammer	MetaWorld-PegInsert	MetaWorld-Disassemble	MetaWorld-ShelfPlace	MetaWorld-Reach	Avg
DP3	62±4	43±6	81±2	71±3	34±30	42±3	76±4	69±7	69±4	17±10	24±1	53±7
3D Dense	72±3	61±0	85±4	74±3	47±10	69±8	100±0	82±4	98±1	77±4	31±3	72±4
DP	37±2	13±2	31±4	26±8	15±4	30±5	15±6	34±7	43±7	11±3	18±2	25±5
2D Dense	59±8	65±1	28±7	36±8	25±2	51±3	86±4	60±7	71±6	59±6	27±4	52±5

3D Dense Policy achieves an average success rate 19% higher than DP3; 2D Dense Policy surpasses Diffusion Policy by 27%.

Real World (4 tasks):

Method	Put Bread	Open Drawer	Pour Balls (Complete)	Flower Arr. (Succ)	Flower (Avg Flowers)
ACT	35%	10%	20%	-	-
Diffusion Policy	40%	20%	20%	-	-
RISE	75%	40%	25%	50%	0.6/3.0
2D Dense	55%	20%	25%	-	-
3D Dense	85%	45%	60%	70%	1.0/3.0

Ablation Study (Tables)¶

Comparison of autoregressive paradigms (learning efficiency and final performance):

Paradigm	Door	Bin Picking	Shelf Place	Box Close
Next-Token	Lower	Lower	Lower	Lower
Next-Chunk	Medium	Medium	Medium	Medium
Bidirectional (Dense)	Highest	Highest	Highest	Highest

Bidirectional prediction demonstrates higher learning efficiency and a higher performance ceiling across all four challenging tasks.

Inference time and parameter count comparison:

Policy	Parameters (Action Head)	Inference Time
ACT	Larger	~Same as Dense
DP (Diffusion)	Dense + 9.19M	~10× Dense
Dense Policy	Smallest	Fastest

Key Findings¶

Bidirectional dependency is critical: temporal steps in action sequences exhibit bidirectional dependencies that next-token prediction cannot capture.
Greater advantage on long-horizon tasks: the relative advantage of Dense Policy is most pronounced in Flower Arrangement (long-horizon, multi-object), achieving 70% vs. 50% success rate.
Improved action precision: strong performance on Peg Insert Side (contact-rich) and Pen Rotation (high-DoF dexterous manipulation).
More stable training: avoids the training instability of ACT by eliminating variational inference and multi-step denoising.
Logarithmic-time inference: $O(\log T)$ complexity, significantly faster than next-token's $O(T)$.

Highlights & Insights¶

Coarse-to-fine action generation paradigm: analogous to how humans plan keyframes before refining details, offering a novel perspective on action sequence modeling.
First application of bidirectional autoregression to action space: challenges the assumption that autoregressive = unidirectional, demonstrating the effectiveness of bidirectional expansion in continuous action spaces.
Efficiency of the encoder-only architecture: a shared encoder handles multi-level action representations, substantially reducing parameter count.
Logarithmic-complexity inference: achieves extremely fast inference without sacrificing performance.

Limitations & Future Work¶

Extension of Dense Policy to a general-purpose VLA (Vision-Language-Action) model remains unexplored.
Scalability and stability on large-scale foundation models have not been validated.
2D Dense Policy is still limited by its 2D representation on tasks requiring complex spatial reasoning.
Occasional over-gripping issues arise in the ball-pouring task; adaptive error-correction capability warrants improvement.
The horizon $T$ must be a power of two, limiting flexibility.

BERT's bidirectional contextual learning → bidirectional action expansion in Dense Policy.
Non-raster-order autoregressive image generation in MAR/SAR → bidirectional generation in action space.
Complements rather than replaces Diffusion Policy and ACT, highlighting the untapped potential of autoregressive policies.
Inspires the application of similar hierarchical expansion strategies to trajectory planning and motion generation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Bidirectional autoregressive action generation is a genuinely new paradigm; the coarse-to-fine hierarchical design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 11 tasks across 3 simulation benchmarks plus 4 real-world tasks, with full 2D/3D coverage and thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Method motivation is clear with well-articulated differentiation from prior work.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for autoregressive policies in robotic manipulation.