MaskControl: Spatio-Temporal Control for Masked Motion Synthesis¶

Conference: ICCV 2025 arXiv: 2410.10780 Code: Project Page Area: Motion Generation / Controllable Generation Keywords: Masked Motion Model, Joint Control, Logits Optimization, Differentiable Sampling, Zero-Shot Goal Control

TL;DR¶

MaskControl is the first work to introduce spatial controllability into generative masked motion models. It manipulates the logits of the token classifier via two core components — Logits Regularizer (implicit alignment during training) and Logits Optimization (explicit optimization during inference) — simultaneously achieving high-quality motion generation (FID reduced by 77%) and high-precision joint control (average error 0.91 cm vs. 1.08 cm).

Background & Motivation¶

Text-driven human motion generation has broad applications in animation, film, and VR/AR. However, textual descriptions cannot precisely specify the spatial positions of particular joints (e.g., pelvis, hands), making natural scene interaction and 3D spatial navigation remain challenging.

Limitations of Prior Work:

Sparse vs. dense control dilemma: Some models excel at sparse waypoint navigation (GMD, Trace and Pace) while others handle dense trajectory following (TLControl), but none supports both simultaneously.

Quality–precision trade-off: OmniControl supports arbitrary joint control but with low precision (average error 3.38 cm); TLControl achieves high precision via test-time optimization (1.08 cm) but at the cost of generation quality (FID 0.271).

Diffusion-model dependency: Existing controllable motion generation methods are almost exclusively based on diffusion models, suffering from redundancy in motion space, high computational overhead, and slow generation.

Lack of zero-shot generalization: Existing methods cannot adapt to arbitrary objective functions at inference time.

Core Insight: Masked motion models generate motion sequences by training a multi-class token classifier and sampling from the learned categorical distribution. The core idea of this paper is: by implicitly and explicitly manipulating the classifier's logits, the token distribution can be aligned to the input joint control signals, enabling precise control without sacrificing generation quality.

Method¶

Overall Architecture¶

MaskControl is built upon a generative masked motion model and comprises three core designs:

Motion Tokenizer: Quantizes motion sequences into discrete tokens.
Logits Regularizer: Implicitly perturbs logits during training to align with control signals.
Logits Optimization: Explicitly optimizes logits at inference time to minimize control error.
DES (Differentiable Expectation Sampling): A key technique that resolves the non-differentiability of categorical sampling.

Key Designs¶

1. Logits Regularizer (Training Stage)¶

Inspired by ControlNet's approach of injecting control signals into pretrained models, this work introduces that design into masked generative models for the first time:

Architecture: A trainable copy of the pretrained masked motion model, connected to the original model layer-by-layer via zero-initialized linear layers.
Dual conditioning: Text $W$ influences tokens via attention; joint control signals $S$ are added directly to the token sequence via a projection layer.
Motion consistency loss: Measures alignment between the generated motion and the joint control signals: $$L_s(e_c, s) = \frac{\sum_n \sum_j \sigma_{nj} \odot \|s_{nj} - R(D(e_c))\|}{\sum_n \sum_j \sigma_{nj}}$$ where $\sigma_{nj}$ is a binary indicator, $D(\cdot)$ is the tokenizer decoder, and $R(\cdot)$ converts local coordinates to global coordinates.
Logits consistency loss: Computes negative log-likelihood over all positions (including unmasked positions): $$\mathcal{L}_{\text{logits}} = -\sum_{\forall i \in [1, L]} \log p(x_i | X_{\overline{M}}, W, S)$$
Total loss: $\mathcal{L} = \alpha \mathcal{L}_{\text{logits}} + (1-\alpha) L_s(e_c, s)$

2. Logits Optimization (Inference Stage)¶

Gradient-based optimization is applied at inference time to further improve control precision, by directly modifying logits to shift the token distribution:

\[l_{m+1} = l_m - \eta \nabla_{l_m} L_s(l_m, s)\]

Key advantages: - Requires no additional pretraining and can handle arbitrary objective functions. - Iterative optimization is performed at every step of the unmasking process. - At the final step, codebook embeddings can be directly optimized: $e_{m+1} = e_m - \eta \nabla_{e_m} L_s(e_m, s)$ - Replacing $L_s$ with any differentiable loss enables zero-shot goal control.

3. Differentiable Expectation Sampling (DES)¶

Both the Logits Regularizer and Optimization require backpropagating through categorical sampling, which is non-differentiable. DES addresses this with two techniques:

Gumbel-Softmax reparameterization: Enables differentiable sampling via the Straight-Through estimator: $$p_\theta(x_k | X_{\overline{M}}, W, S) = \frac{\exp((\ell_k + g_k) / \tau)}{\sum_{j=1}^K \exp((\ell_j + g_j) / \tau)}$$
Token expectation (replacing argmax): Uses a probability-weighted average of codebook vectors instead of non-differentiable argmax lookup: $$\mathbb{E}[X_{\text{recon}}] = \sum_{k=1}^K p_\theta(x_k | X_{\overline{M}}, W, S) \cdot c_k$$

Loss & Training¶

Loss	Role	Stage
$\mathcal{L}_{\text{logits}}$	Token classification consistency	Training
$L_s$	Motion–control signal alignment	Training + Inference
$\mathcal{L}_{\text{VQ}}$	Vector quantization reconstruction	Tokenizer

Key Experimental Results¶

Main Results¶

Quantitative comparison on the HumanML3D dataset:

Method	FID ↓	Avg. Error (cm) ↓	R-Precision ↑	Traj. Error >50cm (%) ↓	Zero-Shot Goals
GMD	0.576	14.39	0.665	9.31	-
OmniControl	0.218	3.38	0.687	3.87	✗
MotionLCM	0.531	18.97	0.752	18.87	✗
TLControl	0.271	1.08	0.779	0.00	✗
MaskControl	0.061	0.98	0.809	0.00	✓

Key findings: - FID reduced by 77% compared to Prev. SOTA (TLControl) (0.271→0.061), indicating substantial improvement in motion quality. - Average control error reduced to 0.98 cm, surpassing TLControl's 1.08 cm. - Both trajectory error and positional error reduced to 0%. - The only method supporting zero-shot objective function control.

Ablation Study¶

Component analysis results:

Configuration	FID ↓	Avg. Error (cm) ↓
No control	0.095	63.18
w/o Logits Regularizer	0.142	2.18
w/o Logits Optimization	0.128	40.41
Full model	0.061	0.98

Key findings: - Removing the Regularizer: control error remains acceptable but FID degrades most (0.142), demonstrating that the Regularizer is critical for generation quality. - Removing the Optimization: FID is acceptable but control error spikes to 40.41 cm, demonstrating that Optimization is indispensable for precision. - The two components are complementary: the Regularizer ensures quality; the Optimization ensures precision.

Zero-Shot Goal Control¶

On human–scene interaction (HSI) tasks (head height constraint, obstacle avoidance, region-bounded locomotion):

Task	Method	Constraint Error ↓	Failure Rate ↓	FID ↓
Head height constraint	ProgMoGen	0.012	0.088	0.556
Head height constraint	MaskControl	0.000	0.000	0.246

MaskControl achieves zero constraint error across all three HSI tasks while significantly outperforming ProgMoGen in FID.

Highlights & Insights¶

Paradigm innovation: The first work to introduce controllability into masked motion models, breaking the diffusion-model monopoly in controllable motion generation.
Dual-stage optimization: The Regularizer provides a strong initialization; the Optimization further refines control — the two components work synergistically.
Elegant design of DES: Using expectation approximation to resolve the non-differentiability of discrete sampling, serving as a bridge between the logits space and the motion space.
Zero-shot generalization: Logits Optimization accepts arbitrary loss functions, eliminating the need to retrain for new tasks.

Limitations & Future Work¶

Logits Optimization at inference requires multiple gradient iterations, increasing inference time.
The temperature parameter $\tau$ in Gumbel-Softmax requires careful tuning.
Experiments are validated only on the HumanML3D dataset; performance on larger-scale motion datasets remains to be verified.
Physical constraints (e.g., ground contact, collision detection) are not considered.

Text-driven motion generation: MDM, MoMask, MMM, and other masked/diffusion methods.
Controllable motion generation: GMD (root joint) → OmniControl (arbitrary joints + ControlNet) → TLControl (test-time optimization).
Masked generative models: MaskGIT (images) → MoMask/MMM (motion); this work introduces spatial control into this paradigm for the first time.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to explore spatial controllability in masked motion models; the logits manipulation approach is original.
Technical Depth: ⭐⭐⭐⭐⭐ — DES, dual-stage optimization, and zero-shot generalization are all complete and rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive multi-task, multi-metric evaluation with thorough ablation; limited to a single dataset.
Value: ⭐⭐⭐⭐ — Supports arbitrary-joint, arbitrary-frame control and zero-shot goal control; broad application potential.

Loss	Role	Stage
\(\mathcal{L}_{\text{logits}}\)	Token classification consistency	Training
\(L_s\)	Motion–control signal alignment	Training + Inference
\(\mathcal{L}_{\text{VQ}}\)	Vector quantization reconstruction	Tokenizer