MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space¶
Conference: ICCV 2025 arXiv: 2503.15451 Code: Project Page Area: Motion Generation · Diffusion Models · Autoregressive Keywords: streaming motion generation, causal latent space, diffusion head, autoregressive, text-to-motion
TL;DR¶
This paper proposes MotionStreamer, which integrates a continuous causal latent space with a diffusion head into an autoregressive framework for text-conditioned streaming human motion generation, supporting online multi-turn generation and dynamic motion composition.
Background & Motivation¶
Streaming motion generation requires a model to incrementally generate coherent human motion sequences in response to incremental text inputs, which is critical for real-time applications in gaming, animation, and robotics. Existing methods face the following core challenges:
Fixed-length limitation of diffusion models: Conventional diffusion-based motion generation models (e.g., MDM, MLD) require predefined motion lengths and cannot dynamically respond to online text inputs, lacking incremental generation capability.
Latency and error accumulation in GPT-based methods: Autoregressive methods based on discrete VQ (e.g., T2M-GPT, MotionGPT) use non-causal tokenizers, which prevent online decoding of partial tokens and introduce high latency; quantization errors from discretization also accumulate progressively in long-sequence autoregressive generation.
Limitations of fixed-window methods: Real-time methods such as DART rely on fixed-window local motion primitives and cannot model variable-length historical context.
The core mechanism of MotionStreamer is to integrate a diffusion head into an autoregressive framework to predict continuous motion latent variables, and to introduce a causal motion compressor for online decoding, thereby achieving streaming generation, online responsiveness, and long-term consistency simultaneously.
Method¶
Overall Architecture¶
MotionStreamer consists of three core components (as shown in Fig. 2):
- Pre-trained text encoder: Uses T5-XXL to extract text features \(T_i \in \mathbb{R}^{1 \times d_t}\)
- Causal Temporal Autoencoder (Causal TAE): Encodes raw motion sequences into continuous causal latent variable sequences
- Diffusion autoregressive model: Based on a Transformer with a diffusion head, autoregressively predicts the next motion latent variable in the causal latent space
Key Design 1: Causal Temporal Autoencoder (Causal TAE)¶
The Causal TAE employs 1D causal convolutions to construct encoder \(\mathcal{E}\) and decoder \(\mathcal{D}\), ensuring causality through a dedicated temporal padding scheme: for a convolutional layer with kernel size \(k_t\), stride \(s_t\), and dilation rate \(d_t\), \((k_t - 1) \times d_t + (1 - s_t)\) frames are padded at the beginning of the sequence.
Given a motion sequence \(X = \{x_1, \ldots, x_N\}\) (\(x_t \in \mathbb{R}^{272}\)), the Causal TAE outputs continuous latent variables \(Z = \{z_1, \ldots, z_{N/l}\}\) (\(z_i \in \mathbb{R}^{d_c}\)), where \(l=4\) is the temporal downsampling rate and \(d_c=16\) is the latent dimension.
The motion representation adopts 272-dimensional SMPL 6D rotation vectors: \(x = \{\dot{r}^x, \dot{r}^z, \dot{r}^a, j^p, j^v, j^r\}\), which directly drive the SMPL character without post-processing.
Key Design 2: Diffusion Autoregressive Generator¶
Each training sample is represented as \(S_i = (T_i, C_i, Z_i)\), where \(C_i\) is the historical motion latent and \(Z_i\) is the current motion latent. These are concatenated along the temporal axis and fed into a Transformer with causal masking to obtain intermediate latent variables \(\{c_i^1, \ldots, c_i^n\}\), which are then passed to a diffusion head (a lightweight MLP) to predict the motion latents.
The training loss follows the standard diffusion objective:
Key Design 3: Two-Forward Training Strategy¶
To mitigate exposure bias in autoregressive training, a two-forward strategy is proposed: the first forward pass uses ground-truth latents; the second replaces a portion of the ground-truth latents with predictions from the first pass for a mixed forward pass, with gradients backpropagated only through the second pass. The replacement ratio is controlled by a cosine scheduler.
Continuous Stop Condition¶
An "impossible pose" (an all-zero vector) is encoded as a reference terminal latent variable. Generation terminates when the distance between the generated latent and this reference falls below a threshold, enabling automatic determination of generation length.
Loss & Training¶
Causal TAE training uses the \(\sigma\)-VAE loss augmented with a root joint loss:
Key Experimental Results¶
Main Results: Text-to-Motion Generation (Tab. 1)¶
| Method | FID ↓ | R@3 ↑ | MM-Dist ↓ | Div → |
|---|---|---|---|---|
| Real motion | 0.002 | 0.914 | 15.151 | 27.492 |
| MDM | 23.454 | 0.764 | 17.423 | 26.325 |
| T2M-GPT | 12.475 | 0.838 | 16.812 | 27.275 |
| MoMask | 12.232 | 0.846 | 16.138 | 27.127 |
| Ours | 11.790 | 0.859 | 16.081 | 27.284 |
MotionStreamer outperforms all baseline methods on FID, R@3, and MM-Dist.
Long-Sequence Generation (Tab. 2, BABEL Dataset)¶
| Method | Sub-seq FID ↓ | Trans. FID ↓ | PJ → | AUJ ↓ |
|---|---|---|---|---|
| DoubleTake | 23.937 | 51.232 | 0.48 | 1.83 |
| FlowMDM | 18.736 | 34.721 | 0.06 | 0.51 |
| VQ-LLaMA | 24.342 | 36.293 | 0.08 | 1.20 |
| Ours | 15.743 | 32.888 | 0.04 | 0.90 |
For long-horizon generation, MotionStreamer achieves substantially lower sub-sequence FID and transition FID compared to all baselines.
Ablation Study (Tab. 3)¶
| Compressor | Recon. FID ↓ | MPJPE ↓ | Gen. FID ↓ |
|---|---|---|---|
| VQ-VAE | 5.173 | 63.9 mm | 13.226 |
| AE | 0.001 | 1.7 mm | 43.828 |
| VAE (non-causal) | 2.092 | 26.2 mm | 19.902 |
| Causal TAE | 0.661 | 22.9 mm | 11.790 |
Key finding: the AE achieves the best reconstruction quality but the worst generation performance (due to lack of latent space regularization); the Causal TAE achieves the best overall balance between reconstruction and generation.
Key Findings¶
- The continuous latent space avoids information loss from VQ discretization, effectively reducing error accumulation.
- The causal structure of the latent space naturally aligns with the causal masking used in autoregressive generation.
- First-frame latency experiments demonstrate that the Causal TAE achieves the lowest first-frame latency, which does not grow with sequence length.
Highlights & Insights¶
- The continuous and causal latent space design is the key innovation, simultaneously addressing both VQ error accumulation and online decoding.
- The Two-Forward strategy effectively mitigates exposure bias in autoregressive training while preserving parallel training efficiency.
- The continuous stop condition is more elegant than binary classifier-based stopping, avoiding class imbalance issues.
- The framework supports a rich set of applications including multi-turn generation, long-sequence generation, and dynamic motion composition.
Limitations & Future Work¶
- The reliance on the SMPL skeleton representation makes it difficult to generalize directly to non-humanoid characters.
- The historical context length is bounded by the Transformer sequence length.
- Semantic drift remains present to some degree in long-sequence generation.
Related Work & Insights¶
- Diffusion-based motion generation: MDM, MLD, MotionDiffuse
- Autoregressive motion generation: T2M-GPT, MotionGPT, MoMask
- Real-time control: CAMDM, AMDM, DART, CLoSD
Rating¶
- Novelty: ★★★★☆ — The combination of a causal latent space and a diffusion head is a novel design.
- Technical Depth: ★★★★☆ — The Two-Forward strategy and continuous stop condition are elegantly designed.
- Experimental Thoroughness: ★★★★☆ — Comprehensive validation across multiple benchmarks with detailed ablations.
- Writing Quality: ★★★★☆ — Well-structured with expressive figures and tables.