Skip to content

Dense Policy: Bidirectional Autoregressive Learning of Actions

Conference: ICCV 2025 arXiv: 2503.13217 Code: https://selen-suyue.github.io/DspNet/ Area: Image Generation Keywords: autoregressive policy, bidirectional expansion, coarse-to-fine generation, diffusion policy, imitation learning

TL;DR

This paper proposes Dense Policy, a robot manipulation policy based on bidirectional autoregressive expansion, which achieves hierarchical coarse-to-fine action generation in logarithmic time and surpasses mainstream generative policies such as Diffusion Policy and ACT on both simulation and real-world tasks.

Background & Motivation

Action generation paradigms in imitation learning fall into two categories: - Holistic generative policies (ACT, Diffusion Policy): model the joint distribution of action sequences and generate all actions at once. Strong performance but high inference cost. - Autoregressive policies (ICRT, ARP): generate actions incrementally token-by-token or chunk-by-chunk. While effective in language and vision, they underperform in action prediction.

Challenges of autoregressive policies for action generation: 1. Next-token prediction struggles to capture long-range temporal dependencies. 2. Next-chunk prediction has a limited attention span. 3. Existing multi-scale methods (CARP) rely on discretization and codebook construction, resulting in insufficient precision.

Core observation: humans do not reason through action trajectories step by step during manipulation; instead, they first plan a few keyframes and then progressively refine them. This is analogous to the concept of "receptive fields" in vision — a coarse-to-fine perception process.

Method

Overall Architecture

Dense Policy adopts an encoder-only architecture. Starting from an initial zero vector, it recursively expands the action sequence level by level via a "Dense Process": $\(P(A|O) = \prod_{i=1}^{n} P(A^i | A^{i-1}, A^{i-2}, ..., A^0, O)\)$

The sequence length doubles at each level, reaching the target horizon \(T\) after \(\log_2 T\) recursive steps.

Key Designs

  1. Bidirectional Expansion: The core mechanism is the Dense Process. Given the sparse keyframe actions \(A^n\) from the previous level (containing \(2^n\) action points), the sequence is upsampled to \(2^{n+1}\) points via linear interpolation:

    • Existing positions retain their original values.
    • New positions are set to the linear interpolation of their two neighbors: \(\tilde{a}_{t+j}^n = \frac{1}{2}(a_{t+j-T/2^{n+1}} + a_{t+j+T/2^{n+1}})\)
    • Boundary positions are filled by copying the nearest original point.

The upsampled sequence is then processed by a 4-layer BERT Encoder with cross-attention to produce the next level \(A^{n+1} = \text{Enc}(A_{up}^n, O)\).

Key distinction from next-token and next-chunk prediction: bidirectional expansion simultaneously captures temporal dependencies in both directions, yielding more coherent action sequences. Inference complexity is \(O(\log T)\) rather than \(O(T)\).

  1. Encoder-Only Architecture: A shared BERT Encoder handles all levels, integrating visual observation features into action representations via cross-attention. Advantages include:

    • Fewer parameters (compared to Diffusion Head and CVAE Head).
    • Faster inference (comparable to ACT, approximately 10× faster than Diffusion Policy).
    • Stable training (no variational inference from VAE or multi-step denoising from diffusion models).
  2. Flexible Visual Input:

    • 2D: ResNet18 + GroupNorm (RGB images).
    • 3D: Sparse Convolutional Network (point clouds).
    • Can be seamlessly replaced with other visual backbones (e.g., RISE's sparse convolutional network).
    • Partial proprioception masking during training to prevent positional memorization bias.

Loss & Training

  • L2 loss supervises the deviation between predicted actions at each level and the ground truth.
  • The initial action vector is set to \(A^0 = \mathbf{0}\), providing an unbiased starting point.
  • Random masking of partial end-effector poses during proprioception training improves generalization.
  • The same number of training iterations and expert demonstrations as baseline methods are used.

Key Experimental Results

Main Results (Tables)

Simulation (11 tasks, 3 benchmarks):

Method Adroit-Door Adroit-Pen DexArt-Laptop DexArt-Toilet MetaWorld-BinPick MetaWorld-BoxClose MetaWorld-Hammer MetaWorld-PegInsert MetaWorld-Disassemble MetaWorld-ShelfPlace MetaWorld-Reach Avg
DP3 62±4 43±6 81±2 71±3 34±30 42±3 76±4 69±7 69±4 17±10 24±1 53±7
3D Dense 72±3 61±0 85±4 74±3 47±10 69±8 100±0 82±4 98±1 77±4 31±3 72±4
DP 37±2 13±2 31±4 26±8 15±4 30±5 15±6 34±7 43±7 11±3 18±2 25±5
2D Dense 59±8 65±1 28±7 36±8 25±2 51±3 86±4 60±7 71±6 59±6 27±4 52±5

3D Dense Policy achieves an average success rate 19% higher than DP3; 2D Dense Policy surpasses Diffusion Policy by 27%.

Real World (4 tasks):

Method Put Bread Open Drawer Pour Balls (Complete) Flower Arr. (Succ) Flower (Avg Flowers)
ACT 35% 10% 20% - -
Diffusion Policy 40% 20% 20% - -
RISE 75% 40% 25% 50% 0.6/3.0
2D Dense 55% 20% 25% - -
3D Dense 85% 45% 60% 70% 1.0/3.0

Ablation Study (Tables)

Comparison of autoregressive paradigms (learning efficiency and final performance):

Paradigm Door Bin Picking Shelf Place Box Close
Next-Token Lower Lower Lower Lower
Next-Chunk Medium Medium Medium Medium
Bidirectional (Dense) Highest Highest Highest Highest

Bidirectional prediction demonstrates higher learning efficiency and a higher performance ceiling across all four challenging tasks.

Inference time and parameter count comparison:

Policy Parameters (Action Head) Inference Time
ACT Larger ~Same as Dense
DP (Diffusion) Dense + 9.19M ~10× Dense
Dense Policy Smallest Fastest

Key Findings

  • Bidirectional dependency is critical: temporal steps in action sequences exhibit bidirectional dependencies that next-token prediction cannot capture.
  • Greater advantage on long-horizon tasks: the relative advantage of Dense Policy is most pronounced in Flower Arrangement (long-horizon, multi-object), achieving 70% vs. 50% success rate.
  • Improved action precision: strong performance on Peg Insert Side (contact-rich) and Pen Rotation (high-DoF dexterous manipulation).
  • More stable training: avoids the training instability of ACT by eliminating variational inference and multi-step denoising.
  • Logarithmic-time inference: \(O(\log T)\) complexity, significantly faster than next-token's \(O(T)\).

Highlights & Insights

  • Coarse-to-fine action generation paradigm: analogous to how humans plan keyframes before refining details, offering a novel perspective on action sequence modeling.
  • First application of bidirectional autoregression to action space: challenges the assumption that autoregressive = unidirectional, demonstrating the effectiveness of bidirectional expansion in continuous action spaces.
  • Efficiency of the encoder-only architecture: a shared encoder handles multi-level action representations, substantially reducing parameter count.
  • Logarithmic-complexity inference: achieves extremely fast inference without sacrificing performance.

Limitations & Future Work

  • Extension of Dense Policy to a general-purpose VLA (Vision-Language-Action) model remains unexplored.
  • Scalability and stability on large-scale foundation models have not been validated.
  • 2D Dense Policy is still limited by its 2D representation on tasks requiring complex spatial reasoning.
  • Occasional over-gripping issues arise in the ball-pouring task; adaptive error-correction capability warrants improvement.
  • The horizon \(T\) must be a power of two, limiting flexibility.
  • BERT's bidirectional contextual learning → bidirectional action expansion in Dense Policy.
  • Non-raster-order autoregressive image generation in MAR/SAR → bidirectional generation in action space.
  • Complements rather than replaces Diffusion Policy and ACT, highlighting the untapped potential of autoregressive policies.
  • Inspires the application of similar hierarchical expansion strategies to trajectory planning and motion generation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Bidirectional autoregressive action generation is a genuinely new paradigm; the coarse-to-fine hierarchical design is elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 11 tasks across 3 simulation benchmarks plus 4 real-world tasks, with full 2D/3D coverage and thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Method motivation is clear with well-articulated differentiation from prior work.
  • Value: ⭐⭐⭐⭐⭐ Opens a new direction for autoregressive policies in robotic manipulation.