Skip to content

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

Conference: ICCV 2025 arXiv: 2507.22731 Code: Project Page Area: Human Understanding Keywords: Co-speech gesture generation, diffusion model, Transformer, retrieval-augmented generation, semantic gesture

TL;DR

This paper proposes GestureHYDRA, a co-speech gesture synthesis system based on a Hybrid-Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation, capable of reliably activating semantically explicit gestures such as numerical and directional indications.

Background & Motivation

Co-speech gesture synthesis aims to generate human body gestures synchronized with speech, with broad applications in film, gaming, robotics, and virtual avatar production. Existing works suffer from two core problems:

Data scarcity: Most datasets only cover gestures in conversational settings; semantically explicit instructional gestures (e.g., using fingers to indicate quantities or directions) are extremely rare.

Many-to-many mapping difficulty: The complex many-to-many mapping between speech and gestures makes it difficult for models to reliably activate semantic gestures, occasionally producing unwanted gestures or failing to activate the intended ones.

Core motivation: To build a system that can reliably activate specific semantic gestures (e.g., numerical/directional) within co-speech generation, producing gestures that are not only natural and fluent but also convey explicit instructional information.

Method

Overall Architecture

The GestureHYDRA system consists of two core components: - Hybrid-Modality Diffusion Transformer (HM-DiT): A hybrid-modality diffusion Transformer backbone that jointly processes audio and gesture modalities. - Cascaded-Synchronized RAG: A cascaded-synchronized retrieval-augmented generation strategy that ensures reliable activation of semantic gestures.

Key Designs

  1. Hybrid-Modality Diffusion Transformer (HM-DiT):

    • The system accepts two modality inputs: speech audio and body gestures.
    • Four masking strategies are designed to simulate different scenarios, each occurring with equal probability:
      • Start-Only: Only the seed gesture is retained, corresponding to the standard co-speech generation setting.
      • Start-End: Conditions are provided at both the beginning and end, corresponding to motion in-betweening.
      • Random-Frame: Random frame masking to enhance global modeling capacity.
      • Random-Seg: Random segment masking to enhance continuous segment synthesis.
    • Training pipeline: noisy gestures → Gesture Encoder → noisy features + Key-Frame Encoder features → fused with audio features → Transformer generation.
    • Fusion formula: \(\mathbf{G^F} = \mathbf{G^K} + \text{GAF}(\mathbf{A} \oplus \mathbf{G^K})\)
  2. Motion-Style Injective Transformer Layer:

    • Addresses cross-identity generalization, replacing conventional one-hot identity embeddings.
    • Two style injection layers are appended after standard self-attention and FFN.
    • Style injection combines dynamic and static components:
      • Dynamic component \(\mathbf{S_d}\): Motion style embeddings encoded from an external style reference sequence.
      • Static component \(\mathbf{S_s}\): An internally learnable motion memory bank that memorizes all motion styles observed during training.
    • Injection formula: \(\text{Att}_{style} = \text{softmax}(\frac{\mathbf{G^{F'}}\mathbf{S}^\top}{\sqrt{c}})\mathbf{S}\)
    • During training, a gesture sequence different from the ground truth is selected per identity as the style reference to avoid gesture leakage.
  3. Cascaded-Synchronized RAG:

    • Semantic gesture repository: Manually constructed per identity, containing 18 predefined gesture categories, each with at least one ~1-second clip and annotated keyframes.
    • Adaptive keyframe gesture injection:
      • ASR is used to identify semantically relevant phrases and their corresponding time spans.
      • Matching gesture keyframes (single frames rather than full clips) are retrieved from the repository, so that rhythm depends on the actual audio rather than the retrieved gesture.
      • An adaptive timestamp adjustment strategy based on audio-gesture consistency scores is proposed.
      • Binary search is used to find the optimal injection timestamp, ensuring synchronization between gestures and semantic phrases.

Loss & Training

\[\mathcal{L} = \lambda_t \mathcal{L}_t + \lambda_{vec} \mathcal{L}_{vec} + \lambda_{kp} \mathcal{L}_{kp}\]
  • \(\mathcal{L}_t\): MSE reconstruction loss (\(\lambda_t=10\))
  • \(\mathcal{L}_{vec}\): Velocity loss based on L1 distance (\(\lambda_{vec}=1\))
  • \(\mathcal{L}_{kp}\): 3D keypoint loss based on L1 distance (\(\lambda_{kp}=1\))
  • The 3D keypoint loss is computed only on sparsely sampled frames (1/8) due to the slow forward speed of SMPL-X.
  • Two-stage training: 120k steps of pretraining (without 3D keypoint loss) + 30k steps with 3D keypoint loss.
  • DDIM sampler with 50 denoising steps is used at inference.

Streamer Dataset

  • A large-scale Chinese semantic gesture dataset constructed specifically for this work.
  • Contains 281 streamers and 20,969 10-second clips in total.
  • Focuses on 18 predefined semantic gestures (numerical/directional, etc.) in live-streaming scenarios.
  • Includes test set splits for seen and unseen identities.

Key Experimental Results

Main Results (Tables)

Streamer Dataset:

Method FGD↓ ΔBC↓ SAR↑ SMD-L1↓ SMD-DTW↓
Seen Identity
TalkSHOW 51.50 0.062 61.49% 0.161 32.11
Probtalk 50.33 0.007 72.29% 0.120 22.37
DSG 54.59 0.072 73.03% 0.116 22.61
Ours 3.24 0.003 84.82% 0.107 20.70
Unseen Identity
TalkSHOW 75.35 0.085 31.81% 0.210 41.00
Probtalk 63.74 0.030 66.08% 0.174 33.26
DSG 61.94 0.091 68.77% 0.160 30.77
Ours 15.43 0.027 81.36% 0.143 27.73

SHOW Dataset (FGD: 3.68 vs. TalkSHOW 6.04 / Probtalk 5.46)

Ablation Study (Tables)

Component Ablation (Unseen Test Set):

Setting FGD↓ SMD-L1↓ SMD-DTW↓
w/o mask strategy 15.76 0.156 29.71
w/o motion style 20.31 0.155 29.51
w/o 3D kp loss 14.80 0.154 29.68
Full model 15.43 0.143 27.73

Adaptive Injection Analysis:

Variant SMD-L1↓ SMD-DTW↓
w/o Injection 0.176 30.46
Vanilla Injection 0.155 27.45
Adaptive Injection 0.138 26.88

Key Findings

  • The proposed method achieves a substantial lead on FGD (Seen: 3.24 vs. second-best 50.33; Unseen: 15.43 vs. second-best 61.94), indicating that the feature distribution of generated gestures closely aligns with real data.
  • SAR (Semantic Activation Rate) reaches 84.82% (seen) and 81.36% (unseen), far surpassing baseline methods.
  • The hybrid masking training strategy contributes most to semantic gesture generation quality.
  • The motion-style injection module is critical for generalization; removing it raises FGD from 15.43 to 20.31.
  • Although the 3D keypoint loss has a limited effect on FGD, it significantly improves hand-surface interaction stability in downstream video generation.
  • Adaptive injection outperforms fixed-position injection by locating the optimal timestamp via binary search guided by ΔBC scores.

Highlights & Insights

  • The hybrid-modality design achieves two goals simultaneously: during training it jointly learns co-speech generation and motion in-betweening; at inference it supports flexible gesture editing operations (injection, interpolation, and replacement).
  • Strong practical value: semantic gesture demands in live-streaming scenarios are explicit and frequent, and this system directly addresses the pain point.
  • Core insight of the cascaded RAG strategy: injecting keyframes rather than full gesture clips allows the generated rhythm to depend on the actual audio rather than the retrieved gesture.
  • The dynamic + static style injection design balances generalization capability and personalization.

Limitations & Future Work

  • The semantic gesture repository requires manual keyframe annotation, making extension to new identities costly.
  • The system only supports Chinese live-streaming scenarios; cross-lingual and cross-domain generalization remains to be validated.
  • The 18 predefined gesture categories offer limited coverage; richer gesture types require dataset expansion.
  • The binary search in RAG introduces additional computational overhead at inference.
  • Comparisons with Semantic Gesticulator and SIGesture demonstrate that injecting keyframes (rather than directly merging retrieved results) is a superior strategy.
  • The design of the motion-style injective layer is inspired by zero-shot talking face generation and can be generalized to other motion generation tasks.
  • The proposed SAR and SMD evaluation metrics fill a gap in semantic gesture assessment.

Rating

  • Novelty: ⭐⭐⭐⭐ The hybrid-modality diffusion architecture and cascaded RAG strategy are novel; the semantic gesture dataset fills an existing gap.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive quantitative and qualitative experiments with thorough ablations and newly proposed evaluation metrics.
  • Writing Quality: ⭐⭐⭐⭐ The paper presents a clear logical structure and a complete system design.
  • Value: ⭐⭐⭐⭐⭐ Demonstrates strong practical applicability in live-streaming and virtual avatar scenarios.