SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction¶
Conference: ACL 2025
arXiv: 2411.16765
Code: http://shubert.pals.ttic.edu
Area: Self-Supervised Learning / Sign Language Processing
Keywords: Sign Language Representation Learning, Self-Supervised Pre-training, Multi-Stream Cluster Prediction, Masked Prediction, HuBERT
TL;DR¶
Proposes SHuBERT (Sign Hidden-Unit BERT), migrating the masked cluster prediction paradigm of the speech self-supervised learning model HuBERT to sign language video. By clustering hand, face, and body pose streams separately and simultaneously predicting the cluster labels of masked frames, the model is pre-trained on approximately 984 hours of ASL video, achieving state-of-the-art (SOTA) on public benchmarks across translation, isolated recognition, and fingerspelling detection tasks.
Background & Motivation¶
Background: Sign language processing (translation/recognition) traditionally relies on task-specific models. Existing pre-training methods can be categorized into supervised pre-training (requiring large amounts of annotated data, e.g., 6,600 hours) and self-supervised pre-training (e.g., MAE). However, current self-supervised methods either learn context-independent frame/segment representations or model only a subset of modalities (e.g., hand only).
Limitations of Prior Work: (1) Sign language data is scarce, and annotation costs are extremely high; (2) Sign language is multi-channel, where hands, facial expressions, and body poses simultaneously convey semantic information, causing single-channel models to lose key information; (3) Existing self-supervised methods (like MAE in SSVP-SLT) require massive computational resources (64×A100 for 14 days of training) and only process 128-frame/8-second clips, failing to model long-range dependencies.
Key Challenge: A self-supervised representation learning method is needed that can simultaneously handle the multi-channel nature of sign language, model long-range context, and remain computationally efficient.
Goal: To learn unified, contextual, and multi-channel self-supervised representations for sign language videos.
Key Insight: HuBERT in the speech domain has successfully learned contextual speech representations via masked cluster prediction. Sign language and speech share similar challenges (no predefined tokens, variable-length units, no explicit boundaries), suggesting that the HuBERT paradigm can be adapted to multi-stream sign language inputs.
Core Idea: Four-stream input (left hand/right hand/face/body pose) \(\rightarrow\) independent k-means clustering per stream \(\rightarrow\) masked prediction to simultaneously predict cluster labels of the four streams \(\rightarrow\) a single pre-trained model applicable to multiple downstream tasks.
Method¶
Overall Architecture¶
Video \(\rightarrow\) MediaPipe keypoint extraction \(\rightarrow\) cropping hands/face/body pose \(\rightarrow\) DINOv2 feature extraction \(\rightarrow\) four-stream k-means clustering to generate pseudo-labels \(\rightarrow\) Transformer encoder for masked cluster prediction \(\rightarrow\) fine-tuning on downstream tasks.
Key Designs¶
-
Four-Stream Feature Preprocessing:
- Left/Right Hand: MediaPipe detects hand keypoints \(\rightarrow\) crop and resize to 224×224 \(\rightarrow\) DINOv2 (fine-tuned on hand data) extracts 384-dimensional features.
- Face: MediaPipe detects face \(\rightarrow\) retain mouth and eye regions, grayscale the rest \(\rightarrow\) Gaussian blur (for privacy protection) \(\rightarrow\) DINOv2 (fine-tuned on facial data) extracts 384-dimensional features.
- Body Pose: 7 upper-body keypoints (nose, shoulders, elbows, wrists) normalized to a 14-dimensional vector.
- Design Motivation: Hand keypoint estimation is inaccurate in capturing hand shapes, making DINOv2 features superior; facial processing balances linguistic information retention and privacy protection.
-
Self-Supervised Pre-Training (Masked Cluster Prediction):
- Function: Predict the four-stream cluster labels for masked frames.
- Mechanism: The four-stream features are linearly projected to 256 dimensions each \(\rightarrow\) concatenated to 1024 dimensions per frame \(\rightarrow\) random span masking (span = 3 frames \(\approx\) 200 ms, roughly the duration of a fingerspelled letter) \(\rightarrow\) 12-layer Transformer encoder \(\rightarrow\) four linear classification heads separately predict the k-means cluster labels (\(k=256\)) of the masked positions.
- Design Motivation: Clustering each stream independently but predicting them jointly allows the model to learn cross-stream dependencies. Random masking is more effective than channel-wise or purely temporal masking.
-
Multi-Task Fine-Tuning:
- Translation (SLT): SHuBERT + ByT5 decoder, trained in two stages (YouTube-ASL pre-training \(\rightarrow\) target dataset fine-tuning).
- Isolated Sign Language Recognition (ISLR): SHuBERT + linear classification head.
- Fingerspelling Detection: SHuBERT + binary classification head (detecting active fingerspelling).
- All Transformer layer outputs are combined using learned layer weights.
Loss & Training¶
- Pre-training: 984 hours of ASL video, 8×A6000 GPUs, approx. 7 days, 400K steps.
- Adam optimizer, peak lr = 5e-4, cosine schedule + linear warmup.
- 86M parameters (12-layer Transformer, \(d=768\), \(h=12\)).
Key Experimental Results¶
Main Results (Sign Language Translation on Public Data)¶
| Method | Self-Supervised | Pre-Training Duration | How2Sign BLEU↑ | OpenASL BLEU↑ | FLEURS-ASL BLEU↑ |
|---|---|---|---|---|---|
| Uthus 2023 | × | 984h | 12.4 | - | - |
| SSVP (Rust 2024) | ✓ | 1054h | 15.5 | - | - |
| Tanzer 2024 | × | 3207h | 15.4 | - | 4.4 |
| Uni-Sign | × | 984h | 14.9 | 23.1* | - |
| SHuBERT | ✓ | 984h | 16.2 | 23.2 | 4.7 |
*Uni-Sign pre-training contains >72% of the OpenASL test set, making it not fully comparable.
Ablation Study¶
| Configuration | How2Sign BLEURT |
|---|---|
| Random masking (Default) | 49.9 |
| Channel masking | 48.7 |
| Time masking | 49.1 |
| Frozen SHuBERT | 49.1 |
| Fine-tuned SHuBERT | 49.9 |
Key Findings¶
- Single Pre-trained Model Achieves Multi-Task SOTA: The exact same SHuBERT model achieves SOTA on all public benchmarks for translation, ISLR, and fingerspelling detection, demonstrating the generality of the representations.
- Computational Efficiency Advantage: 8×A6000 training for 7 days vs SSVP's 64×A100 for 14 days (approx. \(50 \times\) difference in computational cost), benefiting from multi-stream features and compact representations.
- Strong Frozen SHuBERT: Under the frozen setting, translation quality drops only slightly (BLEURT 49.1 vs 49.9), indicating the extremely high quality of pre-trained representations.
- Necessity of Multi-Stream Joint Modeling: Joint prediction across all four streams outperforms separate modeling, successfully capturing cross-stream dependencies (e.g., the combination of hand shape and facial expression to express negation).
- Natural Sign Language > Translated Sign Language: The largest gain is observed on OpenASL, which contains natural ASL (+10 BLEU vs baseline), indicating that pre-training is more effective on domain-similar data.
Highlights & Insights¶
- Modality Migration from HuBERT to SHuBERT: Adapting the self-supervised speech paradigm to visual sign language is a natural yet non-trivial transfer. The key innovation is using multi-stream clustering and joint prediction to replace the single-stream setup in speech HuBERT.
- Privacy-Friendly Facial Representation: Grayscaling and blurring the face while retaining the mouth and eye regions balances privacy concerns with the retention of linguistic information.
- DINOv2 as a Sign Language Feature Extractor: Task-specific continued pre-training on DINOv2 (using 5M hand/face crops) produces features that are more accurate than keypoints.
Limitations & Future Work¶
- Validated only on ASL; generalization across diverse sign languages (e.g., DGS/BSL/CSL) remains to be explored.
- The model has 86M parameters (base model size); scaling up might yield further improvements.
- Relies on MediaPipe hand detection (approx. 95% accuracy); interpolation is required when detection fails.
- Combined auxiliary losses (e.g., contrastive learning, multi-task joint training) have not been explored.
- Facial blurring might lose subtle non-manual markers (e.g., eyebrow movements).
Related Work & Insights¶
- vs SSVP-SLT (MAE): SSVP reconstructs image pixels via MAE, which is computationally expensive (64×A100 × 14 days) and only handles 128 frames. SHuBERT replaces pixel reconstruction with cluster prediction, which is far more efficient.
- vs SignBERT+: SignBERT+ only models hand pose, lacking facial and body information, and its pre-training data overlaps with downstream test data. SHuBERT joint-models four streams, with complete separation of pre-training and test data.
- vs HuBERT (Speech): SHuBERT is a natural extension of HuBERT to sign language, with the core difference being multi-stream clustering and random span masking.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The transfer design from HuBERT to sign language is elegant, and the multi-stream cluster prediction is a successful adaptation tailored to sign language characteristics.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across six benchmarks spanning three major tasks, featuring detailed ablation studies and comparisons against approaches utilizing private datasets.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, well-defined motivation, and standard high-quality figures and tables.
- Value: ⭐⭐⭐⭐⭐ A breakthrough foundation model for sign language processing, highly computational-efficient, and open-source/reproducible.