Skip to content

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

Conference: NeurIPS 2025 arXiv: 2506.06072
Code: https://intuitive-robots.github.io/beast_website/
Area: Imitation Learning / Robotics Keywords: Action Tokenizer, B-Spline, Parallel Decoding, Smooth Trajectory, Efficient Inference

TL;DR

BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed \(C^0\) continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).

Background & Motivation

Background: Action representations in imitation learning directly affect policy quality and inference efficiency. VQ-VAE requires separately trained codebooks; FAST uses BPE to produce variable-length sequences; per-step discretization (binning) yields token counts proportional to sequence length.

Limitations of Prior Work: (a) VQ-VAE codebook training is decoupled from policy training, leading to potential misalignment; (b) FAST's variable-length tokens are ill-suited for parallel decoding; (c) per-step binning offers low compression—100 steps require 100 tokens; (d) no existing method guarantees smooth transitions between action chunks (requiring temporal blending as post-processing).

Key Challenge: Simultaneously satisfying high compression (fewer tokens = faster decoding), fixed length (parallel decoding), smooth transitions (no discontinuities), and high accuracy is fundamentally challenging.

Goal: Design an action tokenizer that satisfies all of the above requirements.

Key Insight: B-splines naturally provide a continuous and smooth representation; the number of control points is fixed (equal to the token count) and independent of the number of sampled steps; they can be fitted rapidly via ridge regression; and clamping guarantees inter-chunk continuity.

Core Idea: Fit B-splines to action sequences → uniformly quantize control points into fixed-length tokens → clamp the starting point to guarantee \(C^0\) continuity across chunks → achieve 20× compression with mathematically guaranteed smoothness.

Method

Overall Architecture

Action sequence \(a_{1:T}\) (\(T\) steps × \(D\) degrees of freedom) → B-spline fitting (ridge regression \(\mathbf{c} = (\Phi^T\Phi + \lambda I)^{-1}\Phi^T a\), Cox–de Boor basis functions) → \(N\) control points \(C \in \mathbb{R}^{D \times N}\)uniform quantization to [0, 255] → interleaved flattening into fixed-length tokens → Transformer/VLM decoder generation → dequantization + B-spline reconstruction to recover continuous actions.

Key Designs

  1. B-Spline Parameterization + Ridge Regression:

    • Function: Compactly represent \(T\)-step action sequences using \(N\) control points.
    • Mechanism: Select \(N=10\) B-spline basis functions (degree 3), compute the basis matrix \(\Phi\) via Cox–de Boor recursion, and solve for control points via ridge regression. Each degree of freedom is solved independently (parallelizable).
    • Design Motivation: \(N \ll T\) (e.g., 10 vs. 100) achieves 10–20× compression. The local support property of B-splines ensures that modifying one control point affects only a local portion of the trajectory.
  2. Clamped B-Splines (Inter-Chunk Continuity):

    • Function: Mathematically guarantee discontinuity-free transitions between consecutive action chunks.
    • Mechanism: The first control point of the current chunk is fixed to the last action value of the previous chunk \(c_0\); the residual \(\hat{a} = a - c_0\Phi_0^P\) is computed by subtracting the contribution of \(c_0\), and the remaining control points are fitted to the residual.
    • Design Motivation: Existing methods rely on temporal blending to smooth inter-chunk transitions—this is post-processing rather than a guarantee. Clamped B-splines provide mathematical \(C^0\) continuity by construction.
  3. Uniform Quantization + Interleaved Flattening:

    • Function: Convert control points into discrete tokens processable by a Transformer.
    • Mechanism: Control point values are normalized to [0, 255] (8-bit uniform quantization) and arranged in an interleaved order by basis function, so that adjacent tokens correspond to different degrees of freedom but the same temporal segment.
    • Design Motivation: Interleaved arrangement allows the Transformer to exploit dependencies among different degrees of freedom within the same temporal segment.

Loss & Training

  • Discrete tokens: cross-entropy loss; continuous variant (BEAST-CT): ELBO.
  • Compatible with multiple architectures including decoder-only Transformers + CLIP, ACT (CVAE), and Florence-2 VLM.
  • Supports both autoregressive and parallel decoding modes.

Key Experimental Results

Main Results

Benchmark Method Success Rate Rank
LIBERO-Long BEAST 86.4% #1
LIBERO-Long π₀ 79.6% #2
LIBERO Average BEAST 92.5% π₀ 94.2% (#1)
CALVIN ABC→D (5 tasks) BEAST 74.4% Close to VPP 75.0%
ALOHA Bimanual BEAST-ACT 70% ACT 49% (+21%)
Franka Challenge BEAST-D 76.57% π₀ 53.43%

Efficiency Comparison

Method Throughput (Hz) Latency (s) vs. BEAST
BEAST-F 617.3 0.019
π₀ 288.1 0.103 0.47×
FAST
OpenVLA 6.1 0.164 0.01×

Ablation Study

Variant CALVIN Avg. Length Notes
BEAST-F (N=10) 4.43 Optimal
BEAST-F (N=5) 3.88 Too few basis functions (−12%)
BEAST-F (N=15) 4.20 Diminishing returns
Binning-F 1.41 68% worse (no compression)
BEAST-CT (continuous) 3.88 Slightly below discrete

Key Findings

  • 20× B-spline compression directly translates to inference speedup—617 Hz meets real-time control requirements.
  • Clamped design yields +21% success rate on ALOHA bimanual tasks—eliminating failures caused by inter-chunk discontinuities.
  • Ranks #1 on LIBERO-Long (longest sequences)—B-splines are particularly effective for long-horizon tasks.
  • Training convergence is also faster—80% success rate reached at 20K steps (vs. ~20% for π₀ at the same point).
  • 1D toy experiment: BEAST MSE 0.0004 vs. Binning 0.0215 (50× more accurate).

Highlights & Insights

  • B-splines are an ideal action representation: fixed length + continuous smoothness + high compression + fast fitting—all four requirements satisfied simultaneously, with no training required.
  • Mathematical elegance of the clamped design: fixing the first control point guarantees inter-chunk continuity—a minimal constraint that yields a critical quality guarantee.
  • 101× speedup vs. OpenVLA underscores that action representation efficiency is critical for real-time deployment.

Limitations & Future Work

  • The number of basis functions \(N\) must be selected manually, depending on trajectory smoothness and sampling rate.
  • May underfit abrupt motions (e.g., collision responses)—B-splines are inherently biased toward smoothness.
  • Uniform quantization may be less accurate than adaptive quantization for high-precision tasks.
  • Real-world success rates (52–76%) still leave room for improvement.
  • vs. FAST (BPE): Variable-length tokens are unfavorable for parallel decoding; BEAST uses fixed-length tokens.
  • vs. VQ-VAE: Requires separately trained codebooks; BEAST requires no training (purely analytic).
  • vs. ACT: BEAST-ACT integrates B-splines into the ACT framework, improving success rate by 21%.
  • vs. RT-2/Octo/OpenVLA: These methods use per-step binning; BEAST achieves 4–8× compression.
  • vs. π₀: Flow matching generates continuous actions; BEAST achieves comparable effectiveness more simply via B-splines and discrete tokens.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ B-spline action tokenizer is original and elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Simulation + real robot + efficiency comparison + ablation.
  • Writing Quality: ⭐⭐⭐⭐⭐ Method derivation is clear; experiments are comprehensive.
  • Value: ⭐⭐⭐⭐⭐ Likely to become a standard action representation for robot imitation learning.

| CALVIN ABC-D | BEAST-D | 74.4% (5-task success rate) | SOTA | | CALVIN ABCD-D | BEAST-D | 84.8% (5-task success rate) | SOTA | | LIBERO-LONG | BEAST-F (0.77B) | Competitive | vs. π₀ (3.3B) | | Inference Speed | BEAST | 617 Hz | vs. OpenVLA 6.1 Hz (101×) | | Inference Latency | BEAST | 19 ms | vs. π₀ 40 ms (2.1×) |

Ablation Study

Configuration Key Finding Notes
No. of control points N 5–10 optimal Too many → overfitting; too few → underfitting
B-spline degree P P=3 optimal Cubic spline is the standard choice
Compression ratio 4–8× vs. per-step binning 20× fewer tokens (toy task)
Parallel vs. autoregressive Comparable accuracy, significantly faster Action-level parallelism is feasible
Real-world 52.86% (Franka), 70% (ALOHA) Successful sim-to-real transfer
Training efficiency 80% at 20K steps vs. π₀'s 20% ~4× faster convergence

Key Findings

  • 101× inference speedup is the most prominent result—arising from parallel decoding combined with compression.
  • B-spline smoothness guarantees eliminate inter-chunk action discontinuities—particularly important for high-frequency control (100+ Hz).
  • No tokenizer training is required—completely avoiding the problem of VQ tokenizers needing retraining on target domains.

Highlights & Insights

  • "The best tokenizer is one that requires no training": B-spline fitting is purely mathematical (ridge regression), avoiding the difficulty of joint tokenizer–policy optimization.
  • Correctly exploiting the continuity prior of actions: Prior work treats actions as discrete sequences, discarding the smoothness prior. B-splines encode it naturally.
  • Feasibility of parallel decoding: Text generation must be autoregressive (each word depends on prior words), but action generation need not be—inter-control-point dependencies can be handled internally by the model.