Miburi: Towards Expressive Interactive Gesture Synthesis¶

Conference: CVPR 2026 arXiv: 2603.03282 Code: Project Page Area: Human Understanding Keywords: Co-speech gesture generation, embodied dialogue agent, causal autoregression, real-time generation, residual vector quantization

TL;DR¶

Miburi is proposed as the first online causal framework for real-time synchronized whole-body gesture and facial expression generation, achieved by directly leveraging the internal token stream of the speech-text large model Moshi and a 2D causal Transformer.

Background & Motivation¶

Current LLM-based dialogue agents lack embodiment capabilities and expressive gestures. Existing co-speech gesture generation methods fall into two categories: (1) generative methods (diffusion/Transformer) produce natural and expressive gestures but require future speech context, precluding real-time operation; (2) real-time systems (rule-based/lightweight networks) can run online but produce stiff gestures with low diversity. The root cause is that satisfying causality (relying only on past inputs) and real-time (low latency) simultaneously is extremely challenging. Traditional pipelines process LLM output → speech synthesis → audio encoding → gesture generation sequentially, introducing substantial latency.

Method¶

Overall Architecture¶

Miburi is built upon Moshi (a speech-text foundation model) and directly utilizes its internal speech/text token streams as conditioning signals. A body-part-aware gesture codec encodes motion into multi-level discrete tokens, which are then autoregressively generated by a 2D causal Transformer operating along both temporal and kinematic dimensions.

Key Designs¶

Body-Part-Aware Gesture Codecs: The full-body motion is decomposed into three regions: upper body + hands \(\mathbf{x}^u\), lower body + global displacement \(\mathbf{x}^l\), and facial expressions (FLAME parameters) \(\mathbf{x}^f\). Each region is independently encoded using a Residual VQ-VAE, whose encoder consists of downsampling 1D convolutions and a causal self-attention Transformer. The output is residually vector-quantized into multi-level tokens \(\mathbf{g}^b \in \mathbb{R}^{T \times K^b}\) (\(K^u=K^l=8, K^f=4\) levels). Each token represents 2 frames (0.08 seconds) of motion, aligned with Moshi's token rate. RVQ preserves fine-grained kinematic details more effectively than standard VQ-VAE.
2D Causal Transformer: Temporal and kinematic dimensions are decoupled for prediction:
- Temporal Transformer \(\mathcal{T}_{\text{temporal}}\): 4 layers, 2 heads, causal self-attention (context of 25 tokens) + dual causal cross-attention (speech/text, context of 50 tokens), autoregressively predicting the first-level token \(\mathbf{g}_{(t,1)}\) at each time step. Embeddings across \(K\) levels are summed into a single input.
- Kinematic Transformer \(\mathcal{T}_{\text{kinematic}}\): 2 layers, 1 head, autoregressively predicting subsequent levels \(\mathbf{g}_{(t,k)}\) within a fixed time step \(t\), conditioned on temporal context \(\mathbf{h}_t\) and speech/text embeddings.

Compared to naively processing all \(T \cdot K\) tokens in a single stream, the 2D decomposition substantially reduces attention context length and inference latency.

Expressiveness Enhancement Objectives:
- Contrastive InfoNCE Loss: Predicted tokens are mapped to differentiable latent representations via Gumbel-Softmax reparameterization: \(\mathbf{z} = \sum_k \text{GumbelSoftmax}(\tilde{\mathbf{o}}_k) \mathbf{C}_k\). An InfoNCE loss is applied over temporal segments to increase similarity between matching ground-truth–prediction pairs and decrease similarity for non-matching pairs.
- Voice Activity Loss: A binary classification head is attached to \(\mathbf{h}_t\) to distinguish listening from speaking states (BCE loss), suppressing phantom gestures during listening and enforcing speech-aligned expressive gestures during speaking.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{\text{CE}} + \alpha \mathcal{L}_{\text{con}} + \beta \mathcal{L}_{\text{va}}\), where \(\alpha=0.1, \beta=0.01\).

During inference, top-p nucleus sampling (temporal Transformer \(p=0.8\), kinematic Transformer \(p=0.95\), temperature 0.9) is applied to maintain diversity. Classifier-free guidance is used (single-speaker CFG=1.5, multi-speaker CFG=2.3). KV-Cache enables efficient causal inference, and masking speech/text cross-attention for lower-body tokens reduces runtime.

Key Experimental Results¶

Main Results¶

Method	FGD↓	BeatAlign→	L1-Div→	Causal	Real-time
CaMN	0.736	0.176	6.73	✗	✗
RAG-Gesture	0.515	0.648	10.09	✗	✗
GestureLSM	0.537	0.481	8.41	✗	✓
GestureLSM (Causal)	2.792	0.684	9.11	✓	✓
MambaTalk	1.375	0.080	3.73	✗	✓
Miburi (+Face)	0.480	0.461	10.44	✓	✓

Ablation Study¶

Configuration	Key Metrics	Remarks
Moshi tokens vs. wav2vec	FGD 0.480 vs. 0.665, BeatAlign 0.461 vs. 0.363	Moshi internal features significantly outperform standard audio encoding
2D Transformer vs. single-stream	Better FGD/BeatAlign/Diversity, nearly half the latency	Decoupling temporal and kinematic dimensions is critical
Single-speaker vs. multi-speaker	FGD drops from 0.753 to 0.480	Model benefits substantially from larger data
Latency comparison	Miburi 34.9 ms vs. GestureLSM 144.7 ms	Lowest latency (A100)

Key Findings¶

Miburi is the only method simultaneously satisfying causality, real-time operation, and expressiveness.
It achieves state-of-the-art results under the 23-speaker multi-speaker setting (FGD 0.480, BeatAlign 0.461), surpassing all non-causal baselines.
Naively converting existing methods to causal variants (GestureLSM Causal, MambaTalk Causal) leads to substantial performance degradation, demonstrating the necessity of a dedicated causal architecture.
A user study shows that Miburi outperforms EMAGE and GestureLSM in motion naturalness and speech-motion alignment.

Highlights & Insights¶

New paradigm: By bypassing the conventional LLM → TTS → audio encoder → gesture serial pipeline and directly leveraging the internal token stream of a speech model, the latency bottleneck is eliminated.
2D causal token prediction: The elegant decoupling of temporal and kinematic dimensions draws inspiration from RQ-Transformer ideas applied to motion generation.
Hierarchical RVQ encoding: Coarse large-scale motion and fine-grained finger gestures are distinguished; independent encoding per body region respects the differing associations between body parts and speech.
Gumbel-Softmax bridging discrete sampling and continuous losses: This enables contrastive learning to be trained end-to-end in a discrete token space.

Limitations & Future Work¶

User studies indicate a remaining perceptual gap compared to real motion capture data.
Facial expression quality (Facial-MSE 7.77) has room for improvement relative to dedicated facial animation models.
The association between lower-body motion and speech is weak; the current approach of simply masking cross-attention may be overly coarse.
Miburi depends on Moshi, requiring both Moshi and Miburi to run concurrently during inference, resulting in high overall system resource demands.
Evaluation is conducted solely on the BEAT2 dataset; generalization to other domains (e.g., sign language, virtual streamers) remains to be validated.

Moshi: A full-duplex spoken dialogue model; Miburi leverages its internal token stream as conditioning signals.
Audio2Photoreal: Generates dyadic interactive gestures via diffusion + VQ, but operates offline and non-causally.
GestureLSM: A flow-matching real-time framework that is non-causal; naive causal conversion leads to large performance drops.
RQ-Transformer: The residual quantization autoregressive paradigm inspired Miburi's 2D Transformer design.
Insight: Conditioning downstream tasks on the internal representations of large models, rather than their outputs, is a paradigm worth broader adoption.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First fully qualified system combining causality, real-time operation, and expressiveness in gesture generation; a highly pioneering new paradigm.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-speaker/single-speaker evaluation, user study, and rich ablations, though limited to a single dataset.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise (clear distinction between causality and real-time), with clear figures.
Value: ⭐⭐⭐⭐⭐ Significant advancement for embodied dialogue agents; the technical approach has high scalability.