NeurIPS 2025 Human Understanding EMG signals EMG gesture recognition contrastive learning zero-shot classification cross-modal alignment

CPEP: Contrastive Pose-EMG Pre-training Enhances Gesture Generalization on EMG Signals¶

Conference: NeurIPS 2025 arXiv: 2509.04699 Code: None Area: Human Understanding / Gesture Recognition Keywords: EMG signals, EMG, gesture recognition, contrastive learning, zero-shot classification, cross-modal alignment

TL;DR¶

This paper proposes the CPEP framework, which employs contrastive learning to align low-quality EMG signal representations with high-quality hand pose representations, endowing the EMG encoder with pose-awareness. CPEP is the first to achieve zero-shot recognition of unseen gestures from EMG signals, yielding a 21% improvement on in-distribution gesture classification and a 72% improvement on unseen gesture classification.

Background & Motivation¶

Background: Vision-based gesture recognition has matured considerably, yet remains constrained by power consumption and privacy concerns in wearable device deployments. Surface EMG (sEMG) signals are low-power and easy to integrate, making them well-suited for real-time gesture prediction on wearable platforms.

Limitations of Prior Work: (a) EMG signals exhibit low signal-to-noise ratios and high variability, limiting the effectiveness of conventional self-supervised pre-training; (b) supervised methods (e.g., emg2pose for pose regression) generalize poorly and cannot recognize unseen gestures or adapt to new users; (c) large-scale EMG data collection is costly and difficult.

Key Challenge: EMG is a "weak modality" from which high-quality representations are difficult to learn in isolation, whereas hand pose is a "strong modality" that encodes rich structural and semantic information. The core challenge is how to leverage the strong modality's prior knowledge to improve representations of the weak modality.

Goal: To enable the EMG encoder to learn pose-aware representations, supporting zero-shot gesture classification via embedding-space retrieval against pose references.

Key Insight: Inspired by CLIP-style cross-modal contrastive pre-training, with design adaptations specific to the EMG–pose setting—pre-training unimodal encoders to reduce paired data requirements, and freezing the strong modality encoder while training only the weak modality encoder.

Core Idea: Contrastive learning is used to pull EMG representations toward paired pose representations, enabling zero-shot gesture recognition without task-specific fine-tuning.

Method¶

Overall Architecture¶

CPEP consists of three stages: (1) MAE-based self-supervised pre-training of both EMG and pose encoders; (2) contrastive pre-training with the pose encoder frozen, aligning the [CLS] representations of both modalities via InfoNCE; (3) downstream evaluation via linear probing or zero-shot nearest-neighbor retrieval.

Key Designs¶

Unimodal Encoder Pre-training (MAE):
- Function: Pre-trains separate Transformer encoders for EMG and pose modalities independently.
- Mechanism: Standard MAE with temporal patching, mask ratio \(r=50\%\); only unmasked tokens are encoded, and the decoder reconstructs the full sequence. \(\mathcal{L}_{\text{MAE}} = \frac{1}{|\mathcal{M}|}\sum_{i\in\mathcal{M}} \|\psi(\phi(\{\mathbf{z}_j\}_{j\notin\mathcal{M}}))_i - \mathbf{z}_i\|_2^2\)
- Design Motivation: Learning robust unimodal representations prior to contrastive alignment reduces the paired data requirements during the contrastive stage.
Contrastive Pose-EMG Pre-training (CPEP):
- Function: Freezes the pose encoder \(\mathcal{E}_p\) while training the EMG encoder together with a projection head \(h\).
- Mechanism: EMG embeddings are computed as \(u_i = h(\mathcal{E}_x(x_i))_{[\text{CLS}]}\) and pose embeddings as \(v_i = (\mathcal{E}_p(p_i))_{[\text{CLS}]}\). After \(\ell_2\) normalization, symmetric InfoNCE is applied: \(\mathcal{L}_{\text{CPEP}} = \frac{1}{2N}\sum_{i} [-\log\frac{\exp(s_{ii})}{\sum_j\exp(s_{ij})} - \log\frac{\exp(s_{ii})}{\sum_j\exp(s_{ji})}]\), where \(s_{ij} = \tilde{u}_i^\top\tilde{v}_j / \tau\).
- Design Motivation: Freezing the pose encoder is critical—jointly updating both encoders degrades pose representation quality and leads to training divergence, as confirmed by ablation experiments.
Zero-Shot Classification Protocol:
- Function: \(k\)-nearest neighbor voting in the embedding space.
- Mechanism: Pose embeddings are pre-computed offline; for each EMG query, the top-\(k\) (\(k=10\)) nearest pose embeddings are retrieved and a majority vote determines the predicted label: \(\hat{y}_j = \text{mode}\{y(p) | p \in \mathcal{R}_j\}\).
- Design Motivation: Zero-shot performance validates that the EMG representations have internalized the structural information of hand poses.

Loss & Training¶

Three-stage training pipeline: EMG/Pose-MAE pre-training for 100 epochs each → CPEP contrastive pre-training for 100 epochs (batch size 256, learnable temperature \(\tau\) initialized at 0.02) → linear probing. Training is conducted on 4× V100 GPUs and takes approximately 4.5 hours per model.

Key Experimental Results¶

Main Results (Gesture Classification Accuracy)¶

Method	LP In-Dist.	LP Unseen	ZS In-Dist.	ZS Unseen
emg2pose (baseline)	0.647	0.312	-	-
EMG-MAE	~0.55	~0.30	-	-
PoseT (supervised)	~0.60	~0.35	-	-
CPEP	0.782	0.536	0.757	0.481
Pose-MAE (upper bound)	~0.85	~0.65	-	-

Ablation Study¶

Configuration	LP In-Dist.	ZS In-Dist.	LP Unseen	ZS Unseen
EMG encoder Frozen	0.372	0.344	0.326	0.298
EMG encoder RandInit	0.748	0.701	0.479	0.454
AvgPool	0.761	0.711	0.518	0.454
CPEP (full)	0.782	0.757	0.536	0.481

Key Findings¶

MAE pre-training initialization is critical: random initialization leads to slower convergence and lower accuracy, and jointly training both encoders fails to converge.
[CLS] token outperforms AvgPool, indicating that global context is more effective for gesture recognition.
Freezing the EMG encoder yields drastically worse performance (0.372 vs. 0.782), confirming that fine-tuning the EMG encoder is necessary.
Longer EMG patches degrade performance, underscoring the need for fine-grained temporal modeling.

Highlights & Insights¶

First zero-shot gesture recognition framework for EMG: The zero-shot results surpass the baseline's linear probing performance (0.481 vs. 0.312 on unseen gestures), demonstrating that contrastive pre-training yields representations with genuine generalization capability.
The strong-modality-anchors-weak-modality paradigm is transferable to analogous settings such as IMU–video alignment and EEG–behavior alignment.

Limitations & Future Work¶

Validation is limited to a single dataset (emg2pose); generalizability to other EMG acquisition devices and protocols remains untested.
The gesture vocabulary is small (4+4 classes), whereas practical applications require recognition of dozens to hundreds of gestures.
No comparison is made against advanced contrastive learning methods such as SigLIP or CLAP.
As a workshop paper, the experimental scale is limited and statistical significance is not reported.
Online adaptation and few-shot fine-tuning scenarios are not explored.
Robustness to inter-subject EMG signal variability is insufficiently analyzed.

vs. emg2pose: Supervised pose regression offers limited generalization; CPEP's contrastive alignment produces structured embeddings that support zero-shot retrieval.
vs. CLIP: CPEP adopts the cross-modal contrastive learning paradigm but introduces key adaptations—pre-training encoders to reduce data requirements and freezing the strong modality encoder.
vs. NeuroPose/Vemg2pose: These baselines also employ Transformer architectures but are trained with supervised regression objectives, yielding embeddings of insufficient quality for retrieval-based classification.

Rating¶

Novelty: ⭐⭐⭐⭐ First application of CLIP-style contrastive pre-training to EMG–pose alignment for zero-shot gesture recognition.
Experimental Thoroughness: ⭐⭐⭐ Workshop paper; single dataset; limited gesture vocabulary.
Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and method description is concise.
Value: ⭐⭐⭐⭐ Opens a new direction for zero-shot EMG-based gesture recognition.