Miburi: Towards Expressive Interactive Gesture Synthesis¶
Conference: CVPR 2026 arXiv: 2603.03282 Code: Project Page Area: Human Understanding Keywords: Co-speech gesture generation, embodied dialogue agent, causal autoregression, real-time generation, residual vector quantization
TL;DR¶
Miburi is proposed as the first online causal framework for real-time synchronized whole-body gesture and facial expression generation, achieved by directly leveraging the internal token stream of the speech-text large model Moshi and a 2D causal Transformer.
Background & Motivation¶
Current LLM-based dialogue agents lack embodiment capabilities and expressive gestures. Existing co-speech gesture generation methods fall into two categories: (1) generative methods (diffusion/Transformer) produce natural and expressive gestures but require future speech context, precluding real-time operation; (2) real-time systems (rule-based/lightweight networks) can run online but produce stiff gestures with low diversity. The root cause is that satisfying causality (relying only on past inputs) and real-time (low latency) simultaneously is extremely challenging. Traditional pipelines process LLM output → speech synthesis → audio encoding → gesture generation sequentially, introducing substantial latency.
Method¶
Overall Architecture¶
Miburi is built upon Moshi (a speech-text foundation model) and directly utilizes its internal speech/text token streams as conditioning signals. A body-part-aware gesture codec encodes motion into multi-level discrete tokens, which are then autoregressively generated by a 2D causal Transformer operating along both temporal and kinematic dimensions.
Key Designs¶
-
Body-Part-Aware Gesture Codecs: The full-body motion is decomposed into three regions: upper body + hands \(\mathbf{x}^u\), lower body + global displacement \(\mathbf{x}^l\), and facial expressions (FLAME parameters) \(\mathbf{x}^f\). Each region is independently encoded using a Residual VQ-VAE, whose encoder consists of downsampling 1D convolutions and a causal self-attention Transformer. The output is residually vector-quantized into multi-level tokens \(\mathbf{g}^b \in \mathbb{R}^{T \times K^b}\) (\(K^u=K^l=8, K^f=4\) levels). Each token represents 2 frames (0.08 seconds) of motion, aligned with Moshi's token rate. RVQ preserves fine-grained kinematic details more effectively than standard VQ-VAE.
-
2D Causal Transformer: Temporal and kinematic dimensions are decoupled for prediction:
- Temporal Transformer \(\mathcal{T}_{\text{temporal}}\): 4 layers, 2 heads, causal self-attention (context of 25 tokens) + dual causal cross-attention (speech/text, context of 50 tokens), autoregressively predicting the first-level token \(\mathbf{g}_{(t,1)}\) at each time step. Embeddings across \(K\) levels are summed into a single input.
- Kinematic Transformer \(\mathcal{T}_{\text{kinematic}}\): 2 layers, 1 head, autoregressively predicting subsequent levels \(\mathbf{g}_{(t,k)}\) within a fixed time step \(t\), conditioned on temporal context \(\mathbf{h}_t\) and speech/text embeddings.
Compared to naively processing all \(T \cdot K\) tokens in a single stream, the 2D decomposition substantially reduces attention context length and inference latency.
-
Expressiveness Enhancement Objectives:
- Contrastive InfoNCE Loss: Predicted tokens are mapped to differentiable latent representations via Gumbel-Softmax reparameterization: \(\mathbf{z} = \sum_k \text{GumbelSoftmax}(\tilde{\mathbf{o}}_k) \mathbf{C}_k\). An InfoNCE loss is applied over temporal segments to increase similarity between matching ground-truth–prediction pairs and decrease similarity for non-matching pairs.
- Voice Activity Loss: A binary classification head is attached to \(\mathbf{h}_t\) to distinguish listening from speaking states (BCE loss), suppressing phantom gestures during listening and enforcing speech-aligned expressive gestures during speaking.
Loss & Training¶
Total loss: \(\mathcal{L} = \mathcal{L}_{\text{CE}} + \alpha \mathcal{L}_{\text{con}} + \beta \mathcal{L}_{\text{va}}\), where \(\alpha=0.1, \beta=0.01\).
During inference, top-p nucleus sampling (temporal Transformer \(p=0.8\), kinematic Transformer \(p=0.95\), temperature 0.9) is applied to maintain diversity. Classifier-free guidance is used (single-speaker CFG=1.5, multi-speaker CFG=2.3). KV-Cache enables efficient causal inference, and masking speech/text cross-attention for lower-body tokens reduces runtime.
Key Experimental Results¶
Main Results¶
| Method | FGD↓ | BeatAlign→ | L1-Div→ | Causal | Real-time |
|---|---|---|---|---|---|
| CaMN | 0.736 | 0.176 | 6.73 | ✗ | ✗ |
| RAG-Gesture | 0.515 | 0.648 | 10.09 | ✗ | ✗ |
| GestureLSM | 0.537 | 0.481 | 8.41 | ✗ | ✓ |
| GestureLSM (Causal) | 2.792 | 0.684 | 9.11 | ✓ | ✓ |
| MambaTalk | 1.375 | 0.080 | 3.73 | ✗ | ✓ |
| Miburi (+Face) | 0.480 | 0.461 | 10.44 | ✓ | ✓ |
Ablation Study¶
| Configuration | Key Metrics | Remarks |
|---|---|---|
| Moshi tokens vs. wav2vec | FGD 0.480 vs. 0.665, BeatAlign 0.461 vs. 0.363 | Moshi internal features significantly outperform standard audio encoding |
| 2D Transformer vs. single-stream | Better FGD/BeatAlign/Diversity, nearly half the latency | Decoupling temporal and kinematic dimensions is critical |
| Single-speaker vs. multi-speaker | FGD drops from 0.753 to 0.480 | Model benefits substantially from larger data |
| Latency comparison | Miburi 34.9 ms vs. GestureLSM 144.7 ms | Lowest latency (A100) |
Key Findings¶
- Miburi is the only method simultaneously satisfying causality, real-time operation, and expressiveness.
- It achieves state-of-the-art results under the 23-speaker multi-speaker setting (FGD 0.480, BeatAlign 0.461), surpassing all non-causal baselines.
- Naively converting existing methods to causal variants (GestureLSM Causal, MambaTalk Causal) leads to substantial performance degradation, demonstrating the necessity of a dedicated causal architecture.
- A user study shows that Miburi outperforms EMAGE and GestureLSM in motion naturalness and speech-motion alignment.
Highlights & Insights¶
- New paradigm: By bypassing the conventional LLM → TTS → audio encoder → gesture serial pipeline and directly leveraging the internal token stream of a speech model, the latency bottleneck is eliminated.
- 2D causal token prediction: The elegant decoupling of temporal and kinematic dimensions draws inspiration from RQ-Transformer ideas applied to motion generation.
- Hierarchical RVQ encoding: Coarse large-scale motion and fine-grained finger gestures are distinguished; independent encoding per body region respects the differing associations between body parts and speech.
- Gumbel-Softmax bridging discrete sampling and continuous losses: This enables contrastive learning to be trained end-to-end in a discrete token space.
Limitations & Future Work¶
- User studies indicate a remaining perceptual gap compared to real motion capture data.
- Facial expression quality (Facial-MSE 7.77) has room for improvement relative to dedicated facial animation models.
- The association between lower-body motion and speech is weak; the current approach of simply masking cross-attention may be overly coarse.
- Miburi depends on Moshi, requiring both Moshi and Miburi to run concurrently during inference, resulting in high overall system resource demands.
- Evaluation is conducted solely on the BEAT2 dataset; generalization to other domains (e.g., sign language, virtual streamers) remains to be validated.
Related Work & Insights¶
- Moshi: A full-duplex spoken dialogue model; Miburi leverages its internal token stream as conditioning signals.
- Audio2Photoreal: Generates dyadic interactive gestures via diffusion + VQ, but operates offline and non-causally.
- GestureLSM: A flow-matching real-time framework that is non-causal; naive causal conversion leads to large performance drops.
- RQ-Transformer: The residual quantization autoregressive paradigm inspired Miburi's 2D Transformer design.
- Insight: Conditioning downstream tasks on the internal representations of large models, rather than their outputs, is a paradigm worth broader adoption.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First fully qualified system combining causality, real-time operation, and expressiveness in gesture generation; a highly pioneering new paradigm.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-speaker/single-speaker evaluation, user study, and rich ablations, though limited to a single dataset.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is precise (clear distinction between causality and real-time), with clear figures.
- Value: ⭐⭐⭐⭐⭐ Significant advancement for embodied dialogue agents; the technical approach has high scalability.