MoLingo: Motion-Language Alignment for Text-to-Human Motion Generation¶
Conference: CVPR 2026 arXiv: 2512.13840 Code: https://hynann.github.io/molingo/MoLingo.html Area: Human Understanding Keywords: text-driven motion generation, semantically aligned latent space, cross-attention conditioning, autoregressive diffusion, continuous latent space
TL;DR¶
MoLingo achieves comprehensive state-of-the-art performance on text-to-human motion generation—across FID, R-Precision, and user studies—by combining a Semantic Alignment Encoder (SAE) with multi-token cross-attention text conditioning, performing masked autoregressive rectified flow in a continuous latent space.
Background & Motivation¶
Background: Text-driven human motion generation is a key technology for computer animation, AR/VR, and human-computer interaction. Current mainstream approaches fall into two categories: (1) diffusion directly in pose space (e.g., MDM), and (2) encoding into a latent space prior to diffusion (e.g., MLD, MARDM). The latter further splits into compressing entire sequences into a single latent vector versus autoregressively generating multiple latent vectors.
Limitations of Prior Work: Diffusion in pose space struggles with complex joint distributions and is prone to preserving mocap noise as artifacts. Single-vector latent diffusion discards important temporal details. VQ-based methods introduce quantization error by mapping continuous motion to a finite codebook, reducing the realism of fine-grained motion. Furthermore, existing text conditioning strategies—such as single-token conditioning or AdaLN modulation—have limited expressive capacity, constraining text–motion alignment quality.
Key Challenge: How can one construct a diffusion-friendly latent space in which semantically similar motions are geometrically proximate? How can text conditioning be injected more effectively so that generated motions are more faithful to textual descriptions?
Goal: (1) What kind of latent space is best suited for motion diffusion? (2) How should text conditioning be injected most effectively?
Key Insight: Inspired by works such as REPA in image generation, the paper trains a semantically aligned motion encoder using frame-level text labels, so that semantically similar latent vectors are closer in the latent space. It also finds that multi-token cross-attention substantially outperforms single-token conditioning.
Core Idea: Align the semantic structure of the motion latent space using frame-level text labels, combined with multi-token cross-attention conditioning, and perform masked autoregressive rectified flow over the continuous latent space for motion generation.
Method¶
Overall Architecture¶
MoLingo comprises two core components: (1) a Semantic Alignment Encoder (SAE) that encodes an \(N\)-frame motion sequence into \(l = N/h\) continuous latent vectors \(m_{1:l} \in \mathbb{R}^{l \times d}\); and (2) a masked autoregressive Transformer with a rectified flow MLP, conditioned on multi-token T5-encoded text representations, which iteratively denoises latent vectors and decodes them back into motion sequences. Training proceeds in two stages: the SAE is trained first, followed by the autoregressive generation model.
Key Designs¶
-
Semantic Alignment Encoder (SAE):
- Function: Encodes motion sequences into a semantically rich, diffusion-friendly continuous latent space.
- Mechanism: Extends a standard VAE by introducing frame-level text semantic alignment. Using frame-level text labels from the BABEL dataset, the method collects temporally aligned text labels for each motion latent vector \(m_j\), encodes them with a frozen text encoder, and averages the projections to obtain a class token \(\kappa_j\). A cosine similarity loss \(\mathcal{L}_\text{sem} = \frac{1}{|\mathcal{I}|}\sum_{i \in \mathcal{I}}\left(1 - \frac{m_i \cdot \kappa_i}{\|m_i\| \|\kappa_i\|}\right)\) pulls semantically similar motion latent vectors closer together. To prevent over-alignment caused by abundant repeated labels in BABEL, the cosine similarity \(\Delta_i\) between adjacent class tokens is computed and samples exceeding a threshold \(\tau\) are filtered out.
- Design Motivation: Semantic alignment simplifies the diffusion process—semantically similar motions are closer in the latent space, reducing the mapping complexity the diffusion model must learn. A soft cosine loss is preferred over InfoNCE because human motion is continuous and ambiguous; the hard contrastive constraints of InfoNCE are overly rigid in this setting.
-
Multi-Token Cross-Attention Text Conditioning:
- Function: More effectively injects textual descriptions into the motion generation process.
- Mechanism: Text is encoded by T5-Large and further enhanced through an \(l_\text{adapter}=6\)-layer Transformer encoder adapter for cross-modal interaction, yielding a multi-token text representation \(\mathbf{w} = \{w_1, \ldots, w_{l_\text{text}}\}\). In the decoder-only Transformer, self-attention and MLP layers operate solely on motion latent vectors, while cross-attention uses motion latent vectors as queries and text tokens as keys and values. Ablation studies show this scheme substantially outperforms single-token conditioning with AdaLN modulation.
- Design Motivation: Single-token conditioning compresses all textual information into a single vector, which is insufficient in expressiveness. Experiments demonstrate that multi-token cross-attention yields significant improvements in both FID and R-Precision (e.g., FID decreases from 0.114 to 0.049).
-
Masked Autoregressive Rectified Flow Generation:
- Function: High-quality autoregressive motion generation in a continuous latent space.
- Mechanism: The joint distribution of latent vectors is factored via the chain rule: \(p(m_1,\ldots,m_l) = \prod_i p(m_i \mid c, m_1,\ldots,m_{i-1})\). During training, a random subset of latent vectors is masked; a decoder-only Transformer produces conditioning vectors \(z_i\), which are fed into an MLP \(v_\theta\) that approximates the reverse distribution using a rectified flow objective: \(\mathcal{L} = \mathbb{E}\left[\|v_\theta(m_i^t, t, z_i) - (\epsilon - m_i)\|^2\right]\). At inference, generation begins from fully masked tokens and proceeds iteratively, with classifier-free guidance (scale = 5.5) applied to enhance conditional fidelity.
- Design Motivation: The autoregressive formulation preserves richer temporal detail than single-vector diffusion; the continuous latent space avoids the quantization error of VQ-based discrete methods; and rectified flow offers higher sampling efficiency than standard diffusion.
Loss & Training¶
The total SAE loss is \(\mathcal{L}_\text{SAE} = \mathcal{L}_\text{recon} + \lambda_\text{sem}\mathcal{L}_\text{sem} + \lambda_\text{KL}\mathcal{L}_\text{KL}\), where \(\mathcal{L}_\text{recon}\) comprises three terms: feature reconstruction, joint position, and joint velocity. The SAE is trained for 5,000 epochs with a batch size of 256. The autoregressive model is trained for approximately 800 epochs with EMA for training stability; 10% of text inputs are replaced with null prompts for CFG. Total training time is approximately 10 hours on 4 H100 GPUs.
Key Experimental Results¶
Main Results¶
| Method | FID ↓ | R-Precision Top-1 ↑ | CLIP-Score ↑ | MModality ↑ |
|---|---|---|---|---|
| MDM | 0.518 | 0.440 | 0.578 | 3.604 |
| MLD | 0.431 | 0.461 | 0.610 | 3.506 |
| MoMask | 0.116 | 0.490 | 0.637 | 1.309 |
| ACMDM-XL | 0.058 | 0.522 | 0.652 | 2.077 |
| DisCoRD | 0.053 | 0.506 | 0.645 | 1.303 |
| MoLingo (VAE) | 0.049 | 0.528 | 0.672 | 1.414 |
| MoLingo (SAE) | 0.066 | 0.544 | 0.686 | 1.226 |
Under the MARDM-67 evaluation protocol, MoLingo (VAE) achieves the best FID, while MoLingo (SAE) achieves the best text–motion alignment in terms of R-Precision and CLIP-Score.
Ablation Study¶
| Conditioning | Text Encoder | Autoencoder | FID ↓ | R-Precision Top-1 ↑ |
|---|---|---|---|---|
| AdaLN | CLIP | AE | 0.114 | 0.500 |
| AdaLN | T5 | AE | 0.077 | 0.508 |
| CrossAttn | T5 | VAE | 0.049 | 0.528 |
| CrossAttn | T5 | AE | 0.051 | 0.533 |
| CrossAttn | T5 | SAE | 0.066 | 0.544 |
Key Findings¶
- CrossAttn vs. AdaLN: Multi-token cross-attention reduces FID from 0.077 to 0.049 and improves R-Precision Top-1 from 0.508 to 0.528–0.544, constituting a substantial gain.
- Text Alignment Effect of SAE: SAE outperforms both AE and VAE across R-Precision and CLIP-Score, but yields a slightly higher FID (0.066 vs. 0.049), indicating that semantic alignment trades a modest distribution-matching cost for stronger text faithfulness.
- Cosine Loss vs. InfoNCE: InfoNCE yields an FID of 0.129, far worse than the cosine loss result of 0.066, because the continuity of human motion makes hard contrastive constraints overly rigid.
- User Study: Users prefer MoLingo in 83.75% of comparisons against DisCoRD, 77.70% against MoMask, and 84.70% against MotionStreamer.
- A 4× temporal downsampling factor combined with a 16-dimensional latent space constitutes the optimal configuration; increasing latent dimensionality is detrimental.
Highlights & Insights¶
- Semantically Aligned Latent Space: Using frame-level text labels to guide the structure of the motion latent space—so that semantically similar motions are geometrically proximate—is a transferable idea applicable to other sequence generation tasks (e.g., audio, trajectory generation) wherever aligned semantic labels are available.
- Soft Cosine vs. Hard Contrastive Constraints: For continuous, ambiguous signals such as motion, soft constraints outperform hard contrastive objectives. This insight has broader implications for representation learning on continuous signals.
- Substantial Advantage of Multi-Token Cross-Attention: Single-token conditioning discards too much information; multi-token representations preserve the structured semantics of text. While this has been validated in T2I generation (e.g., DALL-E 3), this paper successfully transfers the insight to T2M.
Limitations & Future Work¶
- The method generates only body-level motion and does not include fine-grained hand motion, which is a notable limitation for real-world applications.
- The SAE relies on frame-level annotations from the BABEL dataset, which have limited coverage and high label redundancy; scaling to larger datasets may require automatic annotation pipelines.
- A trade-off exists between FID and R-Precision (SAE achieves the best R-Precision but not the best FID); whether both can be simultaneously optimized remains an open question.
- The evaluation metrics themselves are debatable, as different protocols yield inconsistent results, motivating the need for more robust evaluation frameworks.
Related Work & Insights¶
- vs. MARDM: MARDM uses CLIP single-token conditioning with AdaLN; MoLingo uses T5 multi-token cross-attention with SAE. These three improvements each contribute distinct performance gains.
- vs. MotionStreamer: MotionStreamer employs a 272D representation to avoid IK artifacts; MoLingo operates on both 263D and 272D representations and achieves superior performance in both cases.
- vs. VQ-based Methods (MoMask, DisCoRD): VQ-based methods perform better on diversity (MModality), while continuous latent space methods demonstrate clear advantages in realism (FID) and text alignment.
- The work is inspired by REPA (representation alignment in image generation), transplanting the semantic alignment idea into the motion generation domain.
Rating¶
- Novelty: ⭐⭐⭐⭐ The frame-level semantic alignment in SAE is the core innovation; multi-token conditioning has precedent in image generation but is effectively transferred here.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four evaluation protocols (MARDM-67, TMR-263, MS-272, user study) with extensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Problem-driven exposition is clear, with two core questions guiding a coherent progression.
- Value: ⭐⭐⭐⭐ State-of-the-art results combined with a transferable latent space design paradigm that meaningfully advances the motion generation field.