Understanding Co-speech Gestures in-the-wild¶

Conference: ICCV 2025 arXiv: 2503.22668 Code: Project Page Area: Audio & Speech Keywords: Co-speech gestures, tri-modal representation learning, gesture retrieval, gesture word spotting, active speaker detection

TL;DR¶

This paper proposes JEGAL — a joint gesture-audio-language tri-modal embedding space that learns co-speech gesture representations under weak supervision via a global phrase-level contrastive loss and a local gesture-word coupling loss. Three new gesture understanding tasks and benchmarks are introduced, and the method outperforms a range of baselines including large vision-language models.

Background & Motivation¶

Humans gesture while speaking — gesture is an integral component of human communication. Co-speech gestures span a rich spectrum: from beat gestures (rhythmic hand movements that emphasize specific words) to iconic gestures (gestures that convey semantic content, e.g., spreading one's arms to indicate "enormous"). Non-verbal communication accounts for 55% of total communication, yet machines' understanding of gesture semantics remains severely limited.

Learning associations between gestures and speech/text is exceptionally challenging for several reasons:

Sparse and ambiguous cross-modal correlations: Typically only a handful of words are clearly expressed through gesture, and the same utterance can manifest vastly different gestures across contexts or individuals.

High inter-personal and cross-cultural variability: Gestures depend on the speaker's emotion, cultural background, and social setting.

Semantically vacuous gestures: Beat gestures are aligned only with prosodic rhythm and carry no semantic content, making direct mapping to specific words infeasible.

Limitations of prior work: - GestSync learns gesture representations via audio-visual synchronization, but captures only low-level temporal correlations rather than high-level semantics. - GestureDiffuCLIP learns a gesture-text joint embedding but lacks word-level correspondences. - Large vision-language models (e.g., CLIP, LanguageBind) are not designed for gesture understanding and have limited capacity to handle long videos and gesture-specific features.

Method¶

Overall Architecture¶

JEGAL (Joint Embedding space for Gestures, Audio, and Language) learns tri-modal representations through two complementary contrastive learning objectives:

Global phrase-level contrastive loss: Encourages the model to learn holistic semantic correspondences between gesture clips and speech/text segments.
Local gesture-word coupling loss: Encourages the model to discover frame-level associations between gesture clips and specific words.

Key Designs¶

Three modal encoders + fusion module
Gesture encoder \(\mathbb{G}\): A 3D convolutional layer (temporal receptive field of 5 frames in the first layer) followed by a Transformer encoder, producing frame-level features \(\mathbf{g}^T \in \mathbb{R}^{T \times d}\). The backbone is initialized from GestSync and frozen; only the Transformer head is trained. Facial regions are masked to prevent lip-motion information leakage.
Text encoder \(\mathbb{L}\): Multilingual RoBERTa XLM-Base extracts subword features, which are encoded and projected by a Transformer head to yield \(\mathbf{l}^w \in \mathbb{R}^{W \times d/2}\).
Speech encoder \(\mathbb{S}\): A 2D-CNN encodes mel-spectrograms, outputting \(\mathbf{s} \in \mathbb{R}^{T' \times d/2}\).
Fusion module: Subword text embeddings are aggregated into word-level embeddings \(\mathbf{l}^w\); speech features are aggregated into word-level features \(\mathbf{s}^w\) according to word boundaries; these are concatenated along the feature dimension to obtain the joint word-level representation \(\mathbf{c}^w \in \mathbb{R}^{W \times d}\).

During training, speech or text input is randomly zeroed out with 50% probability (modality dropout), encouraging the model to learn from both modalities in a balanced manner and enabling single-modality inference at test time.

Gesture-word alignment mechanism

Word boundaries are aligned to speech but not necessarily to gestures — a gesture may be longer, shorter, or temporally offset relative to the corresponding word. To address this, an attention-based pooling mechanism is designed: the temporal window of each word is expanded by \(p=10\) frames on each side, and the word embedding \(c^{w_i}\) is used to compute an attention-weighted aggregation over gesture frames within the expanded window:

\(g^{w_i} = \sum_{j=S}^{E} \frac{\exp(\gamma \cdot g^{T_j} \cdot c^{w_i})}{\sum_{j=S}^{E} \exp(\gamma \cdot g^{T_j} \cdot c^{w_i})} \cdot g^{T_j}\)

This enables the model to flexibly locate the true temporal extent of a gesture beyond the speech boundaries.

Dual training objectives

Global phrase-level contrastive loss: Frame-level gesture features and word-level speech-text features are average-pooled to obtain global embeddings \(\mathbf{g}\) and \(\mathbf{c}\), and an InfoNCE loss is applied:

\(\mathcal{L}_{seq} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\gamma \cdot \cos(g_i, c_i))}{\sum_{j=1}^{N} \exp(\gamma \cdot \cos(g_i, c_j))}\)

Local gesture-word coupling loss: For each speech-text word \(c^{w_i}\), the most similar gesture frame \(g^{w_j}\) is identified, and a coupling score is computed as \(\lambda(g^w, c^w) = \frac{1}{W}\sum_{i=1}^{W}\max_{j}\cos(g^{w_i}, c^{w_j})\). The core assumption is that matched gesture–speech-text pairs exhibit stronger word-level coupling. A contrastive loss in the InfoNCE form is applied analogously.

Final loss: \(\mathbb{L} = \beta \cdot \mathcal{L}_{seq} + (1-\beta) \cdot \mathcal{L}_{couple}\)

Loss & Training¶

Data: PATS (25 speakers, 162 hours) + a subset of MultiVSR (6,934 speakers, 556 hours), totalling approximately 720 hours from 7,000+ speakers.
Preprocessing: Resampled to 25 FPS / 16 kHz; word-aligned transcriptions generated with WhisperX; low-gesture-activity samples filtered based on L2 distance of body keypoints.
Optimizer: AdamW, lr=5e-5, weight decay=1e-4, betas=(0.9, 0.98).
Gesture head: 6-layer Transformer; text head: 3 layers; hidden dim=512, FFN=2048, 8 attention heads.

Key Experimental Results¶

Main Results¶

Cross-modal retrieval (AVS-Ret benchmark, 500 diverse gesture clips):

Method	Modality	S→G R@5↑	S→G R@10↑	S→G MR↓	G→S R@5↑	G→S R@10↑	G→S MR↓
GestSync (FT)	Audio	10.0	18.2	70.5	11.6	16.6	82.5
Clip4Clip (FT)	Text	8.0	12.6	132.0	3.6	7.0	125.0
JEGAL	T+A	18.8	30.8	31.0	18.2	20.2	24.5

JEGAL substantially outperforms all baselines; multimodal fusion (T+A) markedly surpasses unimodal variants.

Gesture word spotting (AVS-Spot benchmark, 500 annotated clips):

Method	Modality	Accuracy↑
GestSync (FT)	Audio	21.04
GestureDiffuCLIP (FT)	Text	19.50
JEGAL (Text)	Text	61.00
JEGAL (T+A)	T+A	63.60

JEGAL decisively outperforms all baselines on word spotting (63.6% vs. 21.04%), with the primary advantage attributable to the local gesture-word coupling loss.

Active speaker detection (AVS-Asd benchmark):

Method	2-person↑	4-person↑	6-person↑
GestSync (FT)	81.2	64.8	54.4
JEGAL (T+A)	76.8	57.8	48.0

GestSync achieves the best performance on this task due to its strong frame-level synchronization supervision; JEGAL ranks second.

Ablation Study¶

Loss Configuration	Retrieval R@5↑	Retrieval MR↓	Spotting Acc↑	ASD Acc↑
Global contrastive only	12.20	45	20.83	44.2
Word coupling only	8.50	76	52.46	14.8
Global + word coupling	18.80	31	63.60	48.0

Fusion Strategy	Retrieval R@5↑	Spotting Acc↑	ASD Acc↑
Independent pairwise contrastive (text)	9.39	34.31	29.6
Independent pairwise contrastive (audio)	9.80	23.67	31.4
Late fusion (average)	17.00	56.04	41.2
Late fusion (concatenation)	18.80	63.60	48.0

Key Findings¶

Speech and text capture complementary gesture signals: Text is more effective at word-level semantic correspondence (spotting 61.0% vs. 41.8%), whereas speech is more sensitive to stressed/emphasized words (stressed-word spotting gap of 39.4% vs. non-stressed-word gap of 14.8%).
Both losses are indispensable: The global contrastive loss is critical for retrieval and ASD, while the word coupling loss is critical for spotting; their combination achieves the best performance across all tasks.
Concatenation fusion outperforms average fusion and independent pairwise contrastive training — the model needs to integrate multiple information streams within a shared embedding space.

Highlights & Insights¶

Word-level learning under weak supervision: Despite having only phrase-level supervision (with no annotation of which words are gesturally expressed), the max-coupling strategy in the coupling loss successfully learns word-level correspondences.
Definition of three new tasks and benchmarks: A systematic evaluation framework for co-speech gesture understanding (retrieval / spotting / ASD) is established, providing a foundation for advancing the field.
Face masking design: Masking facial regions prevents lip-motion information leakage, ensuring that the learned representations encode pure gesture signals.

Limitations & Future Work¶

Training data is predominantly English; cross-lingual and cross-cultural generalization of gesture understanding remains unvalidated.
The gesture encoder operates on RGB video, incurring high computational cost; future work could explore lightweight inputs based on 2D/3D keypoints.
Only hand gestures are considered; other non-verbal signals such as head movements and facial expressions are not incorporated.
On the active speaker detection task, JEGAL still lags behind GestSync, indicating room for improvement in frame-level synchronization alignment.

A fundamental distinction from sign language understanding: in sign language, gestures are the primary communication channel (with text as a translation), whereas co-speech gestures supplement spoken language (co-occurring with words but not translating them), necessitating entirely different modeling approaches.
The max-coupling strategy in the coupling loss is conceptually analogous to Multiple Instance Learning (MIL), which learns instance-level information from bag-level labels only.
The learned representations have potential applications in digital human gesture generation, language learning assistance, and privacy-preserving speaker detection.

Rating¶

Novelty: ⭐⭐⭐⭐ A systematic tri-modal framework for co-speech gesture understanding with an elegantly designed word coupling loss
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three new benchmark tasks with comprehensive ablation studies and insightful analysis of speech vs. text contributions
Writing Quality: ⭐⭐⭐⭐⭐ Clear motivation, excellent figures, and thorough analysis
Value: ⭐⭐⭐⭐ Opens a systematic research direction for co-speech gesture understanding; the three benchmarks provide lasting value to the community