Skip to content

STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction

Conference: ICML 2026
arXiv: 2602.08245
Code: https://github.com/Kimho666/STEP
Area: Robotics / Embodied Intelligence / Diffusion Policy Acceleration
Keywords: diffusion policy, warm-start, spatiotemporal consistency, local contraction, velocity-aware perturbation

TL;DR

STEP attaches a lightweight "previous action history + current observation → next action" Transformer predictor to a diffusion policy, using its output as a denoising warm-start. This compresses 100 denoising steps to just 2, and introduces an execution deadlock defense: if the action change is too small, a bit of noise is injected. Across 9 simulation and 2 real-world tasks, STEP outperforms BRIDGER / DDIM by 21.6% / 27.5% average success rate.

Background & Motivation

Background: Diffusion Policy (DP) is the de facto standard for visuomotor control: it models action sequences as a generative distribution, iteratively denoising from Gaussian noise over 100 steps to produce action blocks. It captures multimodality and long-range dependencies, achieving the highest success rates, but also incurs the highest latency.

Limitations of Prior Work: Existing DP acceleration methods fall into three categories: (1) Numerical solvers (DDIM, DPM-Solver series) can reduce 100 steps to 4–2, but performance collapses at 2 steps (Push-T only 0.29); (2) Distillation/direct prediction (CP, OneDP, BRIDGER) use small predictors to replace the denoising process, but lack expressiveness and degrade on complex tasks; (3) Action reuse (RTI-DP, RNR-DP, Falcon) uses the previous action as a warm-start, which fails when state changes rapidly. Each only partially addresses either "speed" or "accuracy".

Key Challenge: The crux of acceleration is "providing a good denoising starting point", which must satisfy both: spatial consistency (close to the target action manifold under current state) and temporal consistency (smooth transition from previous action). Existing methods satisfy at most one (BRIDGER: spatial only, Falcon: temporal only), which is insufficient.

Goal: (a) Design a warm-start that preserves DP expressiveness and achieves both spatial and temporal consistency; (b) Ensure stability even with only 2 denoising steps; (c) Prevent the robot from getting stuck at static friction zero points due to overly "smooth" warm-starts during real-world deployment.

Key Insight: Rather than replacing or distilling the original DP, attach a lightweight predictor that maps \((\mathbf o_t, \mathbf A_{t-H})\) to \(\hat{\mathbf A}_t\) as the starting point, then inject a small amount of noise from an intermediate denoising step \(K' < K\)—thus combining the speed of warm-start with the multimodal generation of DP.

Core Idea: Use a conditional Transformer on "previous action block + current observation" to achieve initialization with both temporal (conditioned on previous actions) and spatial (conditioned on observation) consistency; add a velocity-aware perturbation mechanism to counteract real-world deadlocks; finally, use contraction-mapping theory to prove improved convergence of such starting points.

Method

Overall Architecture

Inference pipeline (Algorithm 1): (1) Observe \(\mathbf o_t\); if the action cache is filled with \(H\) steps, use the predictor to compute \(\hat{\mathbf A}_t=f_\theta(\mathbf o_t,\mathbf A_{cache})\); (2) Construct warm-start \(\tilde{\mathbf A}_{K'}=\sigma\hat{\mathbf A}_t+\sigma_t\boldsymbol\epsilon_t\), where \(K'\ll K\) is the intermediate denoising step; (3) Run reverse diffusion from \(K'\) to \(0\) to obtain the final action \(\mathbf A_t\) and execute; (4) Update cache with \(\mathbf A_t\) for the next cycle. During training, the predictor and DP are trained separately: DP uses standard noise prediction (Eq. 5), the predictor uses MSE; at inference, they are cascaded.

Key Designs

  1. Spatiotemporal Consistency Predictor:

    • Function: A single forward pass yields an action starting point satisfying both temporal consistency (\(\|\tilde a_t-a_{t-1}\|\le\epsilon_t\)) and spatial consistency (\(\mathrm{dist}(\tilde a_t,\mathcal M(s_t))\le\epsilon_s\)).
    • Mechanism: \(f_\theta:\mathcal O\times\mathcal A^H\to\mathcal A^H\), implemented as a 2-layer cross-attention Transformer (actions as query, observations as key/value, 128-dim embedding), mapping \((\mathbf o_t,\mathbf A_{t-H})\) to \(\hat{\mathbf A}_t\); training objective is \(\mathcal L_{pred}=\mathbb E\|\hat{\mathbf A}_t-\mathbf A_t\|^2\), learning the conditional expectation \(\mathbb E[\mathbf A_t\mid\mathbf o_t,\mathbf A_{t-H}]\).
    • Design Motivation: Temporal consistency is ensured by conditioning on \(\mathbf A_{t-H}\), spatial consistency by conditioning on \(\mathbf o_t\); no extra regularization needed, making the design extremely simple. Cross-attention is more suitable than self-attention for mixing two heterogeneous sequences. Experiments (Fig. 3) show 2 blocks suffice; more only increase latency without benefit, so 2 blocks are used—an example of engineering "sweet spot".
  2. Velocity-Aware Perturbation Injection:

    • Function: When the predictor outputs minimal action change (robot about to enter "static friction deadlock"), automatically injects noise to help the actuator overcome dead zones; otherwise, keeps the original signal.
    • Mechanism: Compute action difference \(\Delta\mathbf A_t=\mathbf A_{cache}-\mathbf A_{t-2H}\), use indicator \(\mathbb I_t=\mathbb I(\|\Delta\mathbf A_t\|<\epsilon_a)\) to detect stalling; warm-start scale \(\sigma\) and noise amplitude \(\sigma_t\) switch between two modes per Eq. 14—normally \((\sigma,\sigma_t)=(1,0)\) fully trusts the predictor, in stalling \((\sigma,\sigma_t)=(\sigma_{scale},\sigma_{stall})\) shrinks amplitude and injects small Gaussian noise (\(\epsilon_a=0.01\), simulation \(\sigma_{stall}=0.1\), larger for real robots).
    • Design Motivation: Real robots face "control dead zones + static friction" issues; predictor-learned "correct actions" work perfectly in simulation but cause motors to "refuse to move" in reality. Vanilla DDPM's randomness helps cross dead zones—this observation led to "on-demand randomness injection" as a switch, not always-on noise.
  3. Convergence Proof via Local Contraction Mapping:

    • Function: Provides theoretical explanation for why "good warm-start + few reverse steps" can stably converge to correct actions.
    • Mechanism: Unifies DDPM, DDIM, DPM-Solver as \(\mathbf A_{k-1}=\mu_k(\mathbf A_k,\mathbf o_t)+\boldsymbol\xi_k\) (Eq. 15); assuming the denoising network \(\epsilon_\theta\) is Lipschitz with constant \(L\) in the data manifold neighborhood \(\mathcal U\), the reverse mean \(\mu_k\) is also Lipschitz with coefficient \(c_k<1\) (Eq. 16); recursively, \(\|\tilde{\mathbf A}_0-\mathbf A_0\|\le\prod_{k=1}^{K'}c_k\|\tilde{\mathbf A}_{K'}-\mathbf A_{K'}\|\) (Eq. 18), so as long as the predictor brings the starting point into \(\mathcal U\), the error decays exponentially.
    • Design Motivation: Uses a unified contraction framework to explain "why starting denoising from an intermediate step is better than from pure noise", and proves it holds for DDIM / DPM-Solver, independent of the solver—key for elevating an engineering trick to a generalizable method.

Loss & Training

  • Predictor: \(\mathcal L_{pred}=\mathbb E\|\hat{\mathbf A}_t-\mathbf A_t\|^2\), 100k steps.
  • DP: Uses default codebase configuration, unchanged.
  • Inference hyperparameters: \(K'\) = starting denoising step (i.e., STEP = 2 / 4); \(\sigma=1, \sigma_t=0.1\) default for simulation; real robot uses larger \(\sigma_{stall}\) to overcome dead zones.

Key Experimental Results

Main Results

State-based RoboMimic / Push-T (Table 2 excerpt): Score (higher is better) / Time (ms, lower is better).

Method Step Push-T Square ToolHang
Vanilla DDPM 100 0.94 0.94 0.68
DDIM 2 0.29 0.84 0.06
DPM-Solver++ 2 0.20 1.00 0
BRIDGER 2 0.37 0.84 0.08
Falcon 2 0.21 1.00 0
STEP (Ours) 2 0.49 0.96 0.64

Image-based RoboMimic (Table 3 excerpt): For vision input, long-horizon tasks like ToolHang show especially clear differences.

Method Step Square ToolHang
DDIM 2 0.74 0.5
BRIDGER 2 0.92 0.72
STEP (Ours) 2 – (>BRIDGER, see paper)

Core conclusion: STEP at 2 steps achieves +21.6% average over BRIDGER and +48.8% over Falcon (temporal-only) on RoboMimic; on real robots, +27.5% average success rate over DDIM; average episode execution time on real robots reduced by 59% due to velocity-aware perturbation.

Ablation Study

Configuration Key Metric Notes
Full STEP (2 step) Push-T 0.49 / Lift 1.0 / Square 0.96 Spatiotemporal consistency + perturbation + intermediate step
No predictor (= DDIM) Push-T 0.29 / Lift 0.80 / Square 0.84 Starting point degrades to pure noise, performance collapses
Spatial only (BRIDGER) Push-T 0.37 / Lift 1.0 / Square 0.84 No temporal continuity, long-horizon tasks degrade
Temporal only (Falcon) Push-T 0.21 / Square 1.00 / ToolHang 0 No spatial consistency, ToolHang drops to 0
Cross-attn block 1/2/4 2 is sweet spot Fig 3, 4 blocks increase latency without gain

Key Findings

  • 2-step inference is STEP's core selling point: Other methods collapse at 2 steps (Falcon ToolHang=0), but STEP maintains success rates close to 100-step DDPM, pushing the Pareto frontier (latency vs. success) to the lower right.
  • Both spatial and temporal consistency are essential: Table 1 shows that any method with only TC or SC drops at least one task to 0 at 2 steps; only STEP remains robust across all tasks.
  • Sim-to-real gap: \(\sigma_{stall}\) suffices at 0.1 in simulation, but needs to be larger on real robots, highlighting "friction/dead zone" as a neglected sim-to-real bottleneck.
  • Predictor is very small: Only 128-dim, 2 cross-attention blocks, 100k training steps; extremely lightweight and easy to embed.

Highlights & Insights

  • Clear conceptual contribution: Explicitly formalizes "warm-start must be spatiotemporally consistent" (Eq. 7-8 + Table 1), providing a simple analytical dimension for DP acceleration.
  • Decoupled predictor + DP training: Retains DP's multimodal generation (no distillation or replacement), yet enables lightweight acceleration—a practical design pattern applicable to any DP backbone.
  • Velocity-aware perturbation: The "on-demand randomness injection" idea can transfer to any domain needing dynamic switching between deterministic prediction and exploration (e.g., imitation + RL hybrid training, model predictive control).
  • Contraction proof: Simple yet strong—provides a unified theoretical explanation for all "intermediate-step warm-start" methods, not just anecdotal evidence.

Limitations & Future Work

  • Temporal consistency relies on conditioning on previous \(\mathbf A_{t-H}\), which may introduce bias during abrupt state transitions (e.g., sudden obstacles); the perturbation mechanism compensates, but its trigger only considers action change, not observation change, making it coarse.
  • The predictor is a single forward pass, without explicit multimodal modeling; if the current state corresponds to multiple equivalent action modes, the predictor may "average out" to meaningless actions (the paper acknowledges this issue for BRIDGER-type methods).
  • Only validated in imitation learning; compatibility of warm-start with policy drift in true closed-loop RL remains untested.
  • Real robot \(\sigma_{stall}\) requires manual tuning, lacking adaptive strategies; a learned critic could be added to determine "whether perturbation is needed".
  • vs DDIM / DPM-Solver++: These only modify the solver, do not introduce warm-start, and collapse at 2 steps; STEP is orthogonal and can be applied to any solver.
  • vs BRIDGER (spatial-only): BRIDGER uses a predictor for the starting point but only considers current state, not time; STEP adds just previous \(\mathbf A_{t-H}\) and achieves 21.6% average improvement.
  • vs Falcon / RTI-DP (temporal-only): These assume smooth dynamics, but fail on tasks with rapid state changes (ToolHang / Push-T); STEP leverages observation conditioning to handle state transitions.
  • vs CP / OneDP (distillation): Distillation destroys DP's multimodal generation; STEP preserves DP, allowing "more denoising steps for complex tasks, just 2 for simple ones".

Rating

  • Novelty: ⭐⭐⭐⭐ Spatiotemporal consistency as a two-dimensional criterion + intermediate-step warm-start + velocity-aware perturbation, all simple yet effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 simulation + 2 real-world tasks × 8 baselines × state/image inputs, comprehensive ablation tables.
  • Writing Quality: ⭐⭐⭐⭐ Conceptual diagrams (Fig 1) + consistency table (Table 1) help readers quickly grasp the framework; contraction proof is concise and elegant.
  • Value: ⭐⭐⭐⭐⭐ Directly usable engineering solution, code open-sourced, a plug-and-play acceleration tool for all teams deploying diffusion policy in robotics.