BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning¶

Conference: NeurIPS 2025 arXiv: 2506.06072
Code: https://intuitive-robots.github.io/beast_website/
Area: Imitation Learning / Robotics Keywords: Action Tokenizer, B-Spline, Parallel Decoding, Smooth Trajectory, Efficient Inference

TL;DR¶

BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed \(C^0\) continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).

Background & Motivation¶

Background: Action representations in imitation learning directly affect policy quality and inference efficiency. VQ-VAE requires separately trained codebooks; FAST uses BPE to produce variable-length sequences; per-step discretization (binning) yields token counts proportional to sequence length.

Limitations of Prior Work: (a) VQ-VAE codebook training is decoupled from policy training, leading to potential misalignment; (b) FAST's variable-length tokens are ill-suited for parallel decoding; (c) per-step binning offers low compression—100 steps require 100 tokens; (d) no existing method guarantees smooth transitions between action chunks (requiring temporal blending as post-processing).

Key Challenge: Simultaneously satisfying high compression (fewer tokens = faster decoding), fixed length (parallel decoding), smooth transitions (no discontinuities), and high accuracy is fundamentally challenging.

Goal: Design an action tokenizer that satisfies all of the above requirements.

Key Insight: B-splines naturally provide a continuous and smooth representation; the number of control points is fixed (equal to the token count) and independent of the number of sampled steps; they can be fitted rapidly via ridge regression; and clamping guarantees inter-chunk continuity.

Core Idea: Fit B-splines to action sequences → uniformly quantize control points into fixed-length tokens → clamp the starting point to guarantee \(C^0\) continuity across chunks → achieve 20× compression with mathematically guaranteed smoothness.

Method¶

Overall Architecture¶

Action sequence \(a_{1:T}\) (\(T\) steps × \(D\) degrees of freedom) → B-spline fitting (ridge regression \(\mathbf{c} = (\Phi^T\Phi + \lambda I)^{-1}\Phi^T a\), Cox–de Boor basis functions) → \(N\) control points \(C \in \mathbb{R}^{D \times N}\) → uniform quantization to [0, 255] → interleaved flattening into fixed-length tokens → Transformer/VLM decoder generation → dequantization + B-spline reconstruction to recover continuous actions.

Key Designs¶

B-Spline Parameterization + Ridge Regression:
- Function: Compactly represent \(T\)-step action sequences using \(N\) control points.
- Mechanism: Select \(N=10\) B-spline basis functions (degree 3), compute the basis matrix \(\Phi\) via Cox–de Boor recursion, and solve for control points via ridge regression. Each degree of freedom is solved independently (parallelizable).
- Design Motivation: \(N \ll T\) (e.g., 10 vs. 100) achieves 10–20× compression. The local support property of B-splines ensures that modifying one control point affects only a local portion of the trajectory.
Clamped B-Splines (Inter-Chunk Continuity):
- Function: Mathematically guarantee discontinuity-free transitions between consecutive action chunks.
- Mechanism: The first control point of the current chunk is fixed to the last action value of the previous chunk \(c_0\); the residual \(\hat{a} = a - c_0\Phi_0^P\) is computed by subtracting the contribution of \(c_0\), and the remaining control points are fitted to the residual.
- Design Motivation: Existing methods rely on temporal blending to smooth inter-chunk transitions—this is post-processing rather than a guarantee. Clamped B-splines provide mathematical \(C^0\) continuity by construction.
Uniform Quantization + Interleaved Flattening:
- Function: Convert control points into discrete tokens processable by a Transformer.
- Mechanism: Control point values are normalized to [0, 255] (8-bit uniform quantization) and arranged in an interleaved order by basis function, so that adjacent tokens correspond to different degrees of freedom but the same temporal segment.
- Design Motivation: Interleaved arrangement allows the Transformer to exploit dependencies among different degrees of freedom within the same temporal segment.

Loss & Training¶

Discrete tokens: cross-entropy loss; continuous variant (BEAST-CT): ELBO.
Compatible with multiple architectures including decoder-only Transformers + CLIP, ACT (CVAE), and Florence-2 VLM.
Supports both autoregressive and parallel decoding modes.

Key Experimental Results¶

Main Results¶

Benchmark	Method	Success Rate	Rank
LIBERO-Long	BEAST	86.4%	#1
LIBERO-Long	π₀	79.6%	#2
LIBERO Average	BEAST	92.5%	π₀ 94.2% (#1)
CALVIN ABC→D (5 tasks)	BEAST	74.4%	Close to VPP 75.0%
ALOHA Bimanual	BEAST-ACT	70%	ACT 49% (+21%)
Franka Challenge	BEAST-D	76.57%	π₀ 53.43%

Efficiency Comparison¶

Method	Throughput (Hz)	Latency (s)	vs. BEAST
BEAST-F	617.3	0.019	1×
π₀	288.1	0.103	0.47×
FAST	—	—	—
OpenVLA	6.1	0.164	0.01×

Ablation Study¶

Variant	CALVIN Avg. Length	Notes
BEAST-F (N=10)	4.43	Optimal
BEAST-F (N=5)	3.88	Too few basis functions (−12%)
BEAST-F (N=15)	4.20	Diminishing returns
Binning-F	1.41	68% worse (no compression)
BEAST-CT (continuous)	3.88	Slightly below discrete

Key Findings¶

20× B-spline compression directly translates to inference speedup—617 Hz meets real-time control requirements.
Clamped design yields +21% success rate on ALOHA bimanual tasks—eliminating failures caused by inter-chunk discontinuities.
Ranks #1 on LIBERO-Long (longest sequences)—B-splines are particularly effective for long-horizon tasks.
Training convergence is also faster—80% success rate reached at 20K steps (vs. ~20% for π₀ at the same point).
1D toy experiment: BEAST MSE 0.0004 vs. Binning 0.0215 (50× more accurate).

Highlights & Insights¶

B-splines are an ideal action representation: fixed length + continuous smoothness + high compression + fast fitting—all four requirements satisfied simultaneously, with no training required.
Mathematical elegance of the clamped design: fixing the first control point guarantees inter-chunk continuity—a minimal constraint that yields a critical quality guarantee.
101× speedup vs. OpenVLA underscores that action representation efficiency is critical for real-time deployment.

Limitations & Future Work¶

The number of basis functions \(N\) must be selected manually, depending on trajectory smoothness and sampling rate.
May underfit abrupt motions (e.g., collision responses)—B-splines are inherently biased toward smoothness.
Uniform quantization may be less accurate than adaptive quantization for high-precision tasks.
Real-world success rates (52–76%) still leave room for improvement.

vs. FAST (BPE): Variable-length tokens are unfavorable for parallel decoding; BEAST uses fixed-length tokens.
vs. VQ-VAE: Requires separately trained codebooks; BEAST requires no training (purely analytic).
vs. ACT: BEAST-ACT integrates B-splines into the ACT framework, improving success rate by 21%.
vs. RT-2/Octo/OpenVLA: These methods use per-step binning; BEAST achieves 4–8× compression.
vs. π₀: Flow matching generates continuous actions; BEAST achieves comparable effectiveness more simply via B-splines and discrete tokens.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ B-spline action tokenizer is original and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Simulation + real robot + efficiency comparison + ablation.
Writing Quality: ⭐⭐⭐⭐⭐ Method derivation is clear; experiments are comprehensive.
Value: ⭐⭐⭐⭐⭐ Likely to become a standard action representation for robot imitation learning.

Ablation Study¶

Configuration	Key Finding	Notes
No. of control points N	5–10 optimal	Too many → overfitting; too few → underfitting
B-spline degree P	P=3 optimal	Cubic spline is the standard choice
Compression ratio	4–8× vs. per-step binning	20× fewer tokens (toy task)
Parallel vs. autoregressive	Comparable accuracy, significantly faster	Action-level parallelism is feasible
Real-world	52.86% (Franka), 70% (ALOHA)	Successful sim-to-real transfer
Training efficiency	80% at 20K steps vs. π₀'s 20%	~4× faster convergence

Key Findings¶

101× inference speedup is the most prominent result—arising from parallel decoding combined with compression.
B-spline smoothness guarantees eliminate inter-chunk action discontinuities—particularly important for high-frequency control (100+ Hz).
No tokenizer training is required—completely avoiding the problem of VQ tokenizers needing retraining on target domains.

Highlights & Insights¶

"The best tokenizer is one that requires no training": B-spline fitting is purely mathematical (ridge regression), avoiding the difficulty of joint tokenizer–policy optimization.
Correctly exploiting the continuity prior of actions: Prior work treats actions as discrete sequences, discarding the smoothness prior. B-splines encode it naturally.
Feasibility of parallel decoding: Text generation must be autoregressive (each word depends on prior words), but action generation need not be—inter-control-point dependencies can be handled internally by the model.

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning¶

TL;DR¶

Background & Motivation¶

Method¶

Overall Architecture¶

Key Designs¶

Loss & Training¶

Key Experimental Results¶

Main Results¶

Efficiency Comparison¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Related Work & Insights¶

Rating¶

Ablation Study¶

Key Findings¶

Highlights & Insights¶

Related Papers¶