ACL 2025 Self-Supervised Learning Sign Language Representation Learning Self-Supervised Pre-training Multi-Stream Cluster Prediction Masked Prediction HuBERT

SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction¶

Conference: ACL 2025
arXiv: 2411.16765
Code: http://shubert.pals.ttic.edu
Area: Self-Supervised Learning / Sign Language Processing
Keywords: Sign Language Representation Learning, Self-Supervised Pre-training, Multi-Stream Cluster Prediction, Masked Prediction, HuBERT

TL;DR¶

Proposes SHuBERT (Sign Hidden-Unit BERT), migrating the masked cluster prediction paradigm of the speech self-supervised learning model HuBERT to sign language video. By clustering hand, face, and body pose streams separately and simultaneously predicting the cluster labels of masked frames, the model is pre-trained on approximately 984 hours of ASL video, achieving state-of-the-art (SOTA) on public benchmarks across translation, isolated recognition, and fingerspelling detection tasks.

Background & Motivation¶

Background: Sign language processing (translation/recognition) traditionally relies on task-specific models. Existing pre-training methods can be categorized into supervised pre-training (requiring large amounts of annotated data, e.g., 6,600 hours) and self-supervised pre-training (e.g., MAE). However, current self-supervised methods either learn context-independent frame/segment representations or model only a subset of modalities (e.g., hand only).

Limitations of Prior Work: (1) Sign language data is scarce, and annotation costs are extremely high; (2) Sign language is multi-channel, where hands, facial expressions, and body poses simultaneously convey semantic information, causing single-channel models to lose key information; (3) Existing self-supervised methods (like MAE in SSVP-SLT) require massive computational resources (64×A100 for 14 days of training) and only process 128-frame/8-second clips, failing to model long-range dependencies.

Key Challenge: A self-supervised representation learning method is needed that can simultaneously handle the multi-channel nature of sign language, model long-range context, and remain computationally efficient.

Goal: To learn unified, contextual, and multi-channel self-supervised representations for sign language videos.

Key Insight: HuBERT in the speech domain has successfully learned contextual speech representations via masked cluster prediction. Sign language and speech share similar challenges (no predefined tokens, variable-length units, no explicit boundaries), suggesting that the HuBERT paradigm can be adapted to multi-stream sign language inputs.

Core Idea: Four-stream input (left hand/right hand/face/body pose) \(\rightarrow\) independent k-means clustering per stream \(\rightarrow\) masked prediction to simultaneously predict cluster labels of the four streams \(\rightarrow\) a single pre-trained model applicable to multiple downstream tasks.

Method¶

Overall Architecture¶

Video \(\rightarrow\) MediaPipe keypoint extraction \(\rightarrow\) cropping hands/face/body pose \(\rightarrow\) DINOv2 feature extraction \(\rightarrow\) four-stream k-means clustering to generate pseudo-labels \(\rightarrow\) Transformer encoder for masked cluster prediction \(\rightarrow\) fine-tuning on downstream tasks.

Key Designs¶

Four-Stream Feature Preprocessing:
- Left/Right Hand: MediaPipe detects hand keypoints \(\rightarrow\) crop and resize to 224×224 \(\rightarrow\) DINOv2 (fine-tuned on hand data) extracts 384-dimensional features.
- Face: MediaPipe detects face \(\rightarrow\) retain mouth and eye regions, grayscale the rest \(\rightarrow\) Gaussian blur (for privacy protection) \(\rightarrow\) DINOv2 (fine-tuned on facial data) extracts 384-dimensional features.
- Body Pose: 7 upper-body keypoints (nose, shoulders, elbows, wrists) normalized to a 14-dimensional vector.
- Design Motivation: Hand keypoint estimation is inaccurate in capturing hand shapes, making DINOv2 features superior; facial processing balances linguistic information retention and privacy protection.
Self-Supervised Pre-Training (Masked Cluster Prediction):
- Function: Predict the four-stream cluster labels for masked frames.
- Mechanism: The four-stream features are linearly projected to 256 dimensions each \(\rightarrow\) concatenated to 1024 dimensions per frame \(\rightarrow\) random span masking (span = 3 frames \(\approx\) 200 ms, roughly the duration of a fingerspelled letter) \(\rightarrow\) 12-layer Transformer encoder \(\rightarrow\) four linear classification heads separately predict the k-means cluster labels (\(k=256\)) of the masked positions.
- Design Motivation: Clustering each stream independently but predicting them jointly allows the model to learn cross-stream dependencies. Random masking is more effective than channel-wise or purely temporal masking.
Multi-Task Fine-Tuning:
- Translation (SLT): SHuBERT + ByT5 decoder, trained in two stages (YouTube-ASL pre-training \(\rightarrow\) target dataset fine-tuning).
- Isolated Sign Language Recognition (ISLR): SHuBERT + linear classification head.
- Fingerspelling Detection: SHuBERT + binary classification head (detecting active fingerspelling).
- All Transformer layer outputs are combined using learned layer weights.

Loss & Training¶

Pre-training: 984 hours of ASL video, 8×A6000 GPUs, approx. 7 days, 400K steps.
Adam optimizer, peak lr = 5e-4, cosine schedule + linear warmup.
86M parameters (12-layer Transformer, \(d=768\), \(h=12\)).

Key Experimental Results¶

Main Results (Sign Language Translation on Public Data)¶

Method	Self-Supervised	Pre-Training Duration	How2Sign BLEU↑	OpenASL BLEU↑	FLEURS-ASL BLEU↑
Uthus 2023	×	984h	12.4	-	-
SSVP (Rust 2024)	✓	1054h	15.5	-	-
Tanzer 2024	×	3207h	15.4	-	4.4
Uni-Sign	×	984h	14.9	23.1*	-
SHuBERT	✓	984h	16.2	23.2	4.7

*Uni-Sign pre-training contains >72% of the OpenASL test set, making it not fully comparable.

Ablation Study¶

Configuration	How2Sign BLEURT
Random masking (Default)	49.9
Channel masking	48.7
Time masking	49.1
Frozen SHuBERT	49.1
Fine-tuned SHuBERT	49.9

Key Findings¶

Single Pre-trained Model Achieves Multi-Task SOTA: The exact same SHuBERT model achieves SOTA on all public benchmarks for translation, ISLR, and fingerspelling detection, demonstrating the generality of the representations.
Computational Efficiency Advantage: 8×A6000 training for 7 days vs SSVP's 64×A100 for 14 days (approx. \(50 \times\) difference in computational cost), benefiting from multi-stream features and compact representations.
Strong Frozen SHuBERT: Under the frozen setting, translation quality drops only slightly (BLEURT 49.1 vs 49.9), indicating the extremely high quality of pre-trained representations.
Necessity of Multi-Stream Joint Modeling: Joint prediction across all four streams outperforms separate modeling, successfully capturing cross-stream dependencies (e.g., the combination of hand shape and facial expression to express negation).
Natural Sign Language > Translated Sign Language: The largest gain is observed on OpenASL, which contains natural ASL (+10 BLEU vs baseline), indicating that pre-training is more effective on domain-similar data.

Highlights & Insights¶

Modality Migration from HuBERT to SHuBERT: Adapting the self-supervised speech paradigm to visual sign language is a natural yet non-trivial transfer. The key innovation is using multi-stream clustering and joint prediction to replace the single-stream setup in speech HuBERT.
Privacy-Friendly Facial Representation: Grayscaling and blurring the face while retaining the mouth and eye regions balances privacy concerns with the retention of linguistic information.
DINOv2 as a Sign Language Feature Extractor: Task-specific continued pre-training on DINOv2 (using 5M hand/face crops) produces features that are more accurate than keypoints.

Limitations & Future Work¶

Validated only on ASL; generalization across diverse sign languages (e.g., DGS/BSL/CSL) remains to be explored.
The model has 86M parameters (base model size); scaling up might yield further improvements.
Relies on MediaPipe hand detection (approx. 95% accuracy); interpolation is required when detection fails.
Combined auxiliary losses (e.g., contrastive learning, multi-task joint training) have not been explored.
Facial blurring might lose subtle non-manual markers (e.g., eyebrow movements).

vs SSVP-SLT (MAE): SSVP reconstructs image pixels via MAE, which is computationally expensive (64×A100 × 14 days) and only handles 128 frames. SHuBERT replaces pixel reconstruction with cluster prediction, which is far more efficient.
vs SignBERT+: SignBERT+ only models hand pose, lacking facial and body information, and its pre-training data overlaps with downstream test data. SHuBERT joint-models four streams, with complete separation of pre-training and test data.
vs HuBERT (Speech): SHuBERT is a natural extension of HuBERT to sign language, with the core difference being multi-stream clustering and random span masking.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The transfer design from HuBERT to sign language is elegant, and the multi-stream cluster prediction is a successful adaptation tailored to sign language characteristics.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across six benchmarks spanning three major tasks, featuring detailed ablation studies and comparisons against approaches utilizing private datasets.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, well-defined motivation, and standard high-quality figures and tables.
Value: ⭐⭐⭐⭐⭐ A breakthrough foundation model for sign language processing, highly computational-efficient, and open-source/reproducible.