Skip to content

SignRep: Enhancing Self-Supervised Sign Representations

Conference: ICCV 2025 arXiv: 2503.08529 Area: Sign Language Understanding / Self-Supervised Learning Keywords: sign language representation learning, self-supervised learning, masked autoencoder, skeleton priors, adversarial style loss, dictionary retrieval

TL;DR

This paper proposes SignRep, a scalable self-supervised sign language representation learning framework that incorporates sign-specific skeleton priors, feature regularization, and an adversarial style-invariant loss into Masked Autoencoder pretraining. Using only a single RGB modality, SignRep surpasses complex multi-modal and multi-branch methods, achieving state-of-the-art performance on three tasks: sign language recognition, dictionary retrieval, and sign language translation.

Background & Motivation

Sign language is a critical communication medium for millions of people worldwide. Accurate sign language understanding requires models to capture complex visual features including handshapes, movements, body posture, and facial expressions. Several key challenges exist in current sign language understanding:

Scarcity of annotated data: Sign languages vary across countries, and collecting labeled data is costly. Existing datasets typically contain no more than 2,000 distinct sign vocabulary items.

Large domain gap with general-purpose pretrained models: A common practice is to pretrain on action recognition datasets such as Kinetics and then fine-tune, but the temporal dynamics and fine-grained gestures in sign language videos differ substantially from general actions.

Excessive complexity of multi-modal/multi-branch architectures: State-of-the-art methods typically require RGB + skeleton multi-modal inputs or multi-branch ensembles, resulting in high computational complexity.

Limitations of skeleton-based models: Keypoint-based methods, while memory-efficient, generally underperform RGB-based methods, and keypoints are prone to missing detections or estimation errors.

Motivation: Can a simple single-modality self-supervised framework be designed that leverages sign-language-specific prior knowledge (e.g., skeleton information) during pretraining but does not rely on keypoints at inference, thereby combining the informativeness of skeleton priors with the performance advantages of RGB-based methods?

Method

Overall Architecture

SignRep builds upon the MAE framework with Hiera (Hierarchical Vision Transformer) as the backbone. The core improvements are: 1. Replacing pixel reconstruction with sign prior reconstruction as the pretraining objective 2. Introducing feature regularization (variance + covariance loss) to enhance representation quality 3. Adding an adversarial style-invariant loss to filter background and appearance distractors 4. Proposing a class probability distribution loss for downstream tasks to leverage retrieval information for recognition

Key Design 1: Sign Priors

A sign language pose estimation model is used to extract six categories of priors as pretraining reconstruction targets:

Keypoint priors: - Hand keypoints \(\mathcal{P}^{\{h,k\}} \in \mathbb{R}^{21 \times 3}\): 21 3D keypoints normalized with the wrist as origin, capturing handshape and orientation - Whole-body keypoints \(\mathcal{P}^{\{b,k\}} \in \mathbb{R}^{61 \times 3}\): 61 3D keypoints (42 for both hands + 19 for the body), capturing the spatial position of hands relative to the body

Joint angle priors: - Hand joint angles \(\mathcal{P}^{\{h,a\}} \in \mathbb{R}^{41 \times 2}\): 41 hand joint angles encoded via sin/cos - Body joint angles \(\mathcal{P}^{\{b,a\}} \in \mathbb{R}^{22 \times 2}\): 22 upper-body joint angles

Distance priors: - Fingertip distances \(\mathcal{P}^{\{h,d\}} \in \mathbb{R}^{5 \times 11 \times 3}\): distance matrix from fingertips to each joint - Inter-hand interaction distances \(\mathcal{P}^{\{b,d\}} \in \mathbb{R}^{12 \times 22 \times 3}\): distances between both hands and various body parts

Activity prior \(\mathcal{P}^{\{h,\text{act}\}} \in [0,1]\): indicates whether a hand is in an active state, determined via heuristic rules based on hand position and motion

Key Design 2: Sign Decoder (Lightweight Sign Language Decoder)

Unlike the pixel reconstruction decoder in standard MAE, SignRep employs a lightweight sign decoder: 1. Average-pool encoder output tokens to obtain \(z^{\text{avg}} \in \mathbb{R}^{1 \times D}\) 2. Apply 1D convolution + GELU + transposed convolution to upsample to \(T\) frames 3. Attach a separate fully connected prediction head for each prior category

Key point: The decoder is discarded entirely for downstream tasks. Inference requires only the encoder and no keypoint extraction.

Key Design 3: Representation Regularization

Variance loss: Encourages representations to spread across the feature space, preventing mode collapse

\[\mathcal{L}_{\text{var}} = \sum_{j=1}^{D} \max(0, 1 - \sigma_j)\]

Covariance loss: Reduces redundant correlations across feature dimensions

\[\mathcal{L}_{\text{cov}} = \sum_{j \neq k} \mathcal{C}_{j,k}^2\]

Key Design 4: Adversarial Style-Invariant Learning

The objective is to guide the encoder to learn sign language semantic features while filtering out background and appearance information.

Approach: Two clips \(A_1, A_2\) are cropped from the same video (sharing style), and a clip \(B\) is sampled from a different video (different style). Gram matrix style representations and content representations are extracted, and a discriminator is trained to distinguish matched from unmatched pairs. An adversarial loss then forces the encoder to produce style-invariant representations.

Loss & Training

Overall pretraining loss:

\[\mathcal{L}_{\text{final}} = \mathcal{L}_{\text{recon}} + w_{\text{var}} \mathcal{L}_{\text{var}} + w_{\text{cov}} \mathcal{L}_{\text{cov}} + w_{\text{adv}} \mathcal{L}_{\text{adv}}\]

Dictionary Retrieval and Class Probability Distribution

Retrieval: Clip representations are extracted via a sliding window; a hand-activity-weighted average serves as the final representation, with cosine similarity used for matching.

Class probability distribution loss: A class-level probability distribution \(\phi \in \mathbb{R}^{C \times C}\) is constructed from the retrieval similarity matrix and used as a KL divergence regularization term to assist supervised recognition in downstream tasks.

Key Experimental Results

Main Results: Sign Language Recognition (WLASL2000)

Method Type Top-1 (Instance) Top-5 (Instance)
ST-GCN Skeleton 34.40 66.57
BEST Skeleton 46.25 79.33
NLA-SLR (3-crop) Multi-modal 61.26 91.77
StepNet (R+F) Multi-modal 61.17 91.94
StepNet RGB 56.89 88.64
SignRep RGB 61.05 90.27

A single-modality RGB method matches multi-modal ensemble methods for the first time, outperforming the strongest single-modality baseline StepNet by 4.16%.

NMFs-CSL Chinese Sign Language Recognition

Method Top-1 Top-5
StepNet (RGB) 77.2 92.5
NLA-SLR (multi-modal, 3-crop) 83.7 98.5
SignRep 84.1 98.8

Surpasses all methods including multi-modal approaches, outperforming the strongest single-modality baseline by ~7%.

ASL-Citizen Recognition

Method DCG MRR Rec@1 Rec@5
I3D 79.13 73.32 63.10 86.09
SignRep 90.84 88.05 81.37 96.11

Top-1 exceeds the I3D baseline by 18%.

Dictionary Retrieval (No Downstream Training)

Feature WLASL DCG WLASL Rec@1 NMFs-CSL Rec@1
HieraMAE-Kinetics 13.21 2.08 3.96
HieraMAE-YTSL 14.06 2.57 7.57
Hand joint angles 30.61 9.42 18.13
SignRep (weighted) 57.93 29.92 63.04

Retrieval performance exceeds hand joint angle features by more than 3×, demonstrating the strong generalization capability of the pretrained representations.

Ablation Study

Configuration WLASL Retrieval DCG
Angle priors only + masking 45.1
Angle + keypoint + distance + masking 48.5
All priors + masking + var/cov 49.9
All priors + masking + var/cov + adversarial 50.7

Removing masking leads to a significant drop in retrieval performance (48.5 → 46.3), confirming that masked learning is critical for robust representations.

Sign Language Translation (As Frozen Feature Extractor)

Backbone Phoenix14T BLEU-4 CSL-Daily BLEU-4
DinoV2 (LoRA) 19.42 12.96
SignRep (frozen) 20.38 16.33

BLEU-4 improves by 3.37 on CSL-Daily, with SignRep entirely frozen, eliminating the computational overhead of LoRA fine-tuning.

Highlights & Insights

  1. The "use skeleton during pretraining, discard at inference" strategy is highly elegant: Incorporating skeleton information as a self-supervised learning target rather than an input allows the model to acquire sign language domain knowledge without depending on keypoint extractors at inference time.
  2. The six-category prior design is comprehensive: Keypoints (spatial structure) + joint angles (finger flexion) + distance matrices (inter-hand interaction) + activity detection cover all major dimensions of sign language expression.
  3. Adversarial style-invariant learning is well-motivated: Background and signer appearance are major sources of interference in sign language understanding; explicitly filtering them via adversarial training substantially improves generalization.
  4. The class probability distribution is an elegant retrieval-to-recognition transfer mechanism: Class-level relationships are extracted from unsupervised retrieval similarity and used as soft labels to assist supervised recognition.
  5. The experimental coverage is broad and convincing: Recognition (3 datasets) + retrieval (3 datasets) + translation (3 datasets), all handled by a single frozen model, provides strong empirical support.

Limitations & Future Work

  1. Pretraining relies on a sign-language-specific pose estimation model to extract priors; low-quality keypoints may introduce noise.
  2. Pretraining on YouTube-SL-25 primarily covers Western sign languages; generalization to other sign language communities remains unexplored.
  3. The work focuses exclusively on isolated sign words; modeling continuous sign language at the sentence level is limited.
  4. The heuristic rules used for activity priors are relatively simple and may not handle all edge cases.
  5. The performance gap with NLA-SLR on WLASL2000 is narrow; validation across additional datasets is needed to confirm the advantage.
  • Supervised sign language recognition: I3D, ST-GCN, StepNet, NLA-SLR, etc.
  • Self-supervised skeleton-based learning: SignBERT, SignBERT+, BEST, Skeletor
  • Sign language translation pretraining: Sign2GPT, SignHiera, GFSLT-VLP
  • Video MAE: VideoMAE, Hiera MAE
  • Self-supervised representation regularization: VICReg, Barlow Twins

Rating

  • Novelty: ★★★★☆ — The idea of incorporating skeleton priors into MAE pretraining is novel; the systematic design of six prior categories and adversarial style-invariant learning demonstrate creative thinking.
  • Technical Depth: ★★★★☆ — Prior design is meticulous; the class probability distribution transfer from retrieval to recognition exhibits theoretical elegance.
  • Experimental Quality: ★★★★★ — Three major tasks, nine dataset configurations, thorough ablation studies; a single-modality model surpassing multi-modal methods is highly convincing.
  • Practicality: ★★★★★ — Single RGB model, no skeleton extraction required at inference, usable as a frozen feature extractor — highly practical.
  • Writing Clarity: ★★★★☆ — Overall clear, though the prior definition section involves heavy notation and presents a moderate barrier on first reading.