CVPR2026 Human Understanding Fingerspelling Recognition Sign Language Understanding Implicit Signing-Hand Detection Monotonic Alignment Loss Diffusion-based Generation OOV Generalization

OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis¶

Conference: CVPR2026 arXiv: 2602.22949 Code: JunukCha/OpenFS Area: Human Body Understanding Keywords: Fingerspelling Recognition, Sign Language Understanding, Implicit Signing-Hand Detection, Monotonic Alignment Loss, Diffusion-based Generation, OOV Generalization

TL;DR¶

This paper proposes OpenFS, a framework that achieves multi-hand fingerspelling recognition with implicit signing-hand detection via dual-level positional encoding, a signing-hand focusing loss, and a monotonic alignment loss. A frame-wise letter-conditioned diffusion generator is further designed to synthesize OOV training data. OpenFS achieves state-of-the-art performance on three benchmarks (ChicagoFSWild / ChicagoFSWildPlus / FSNeo) with inference speed over 100× faster than PoseNet.

Background & Motivation¶

Fingerspelling as an essential complement to sign language: Sign languages cannot easily create unique signs for every proper noun or neologism; fingerspelling (letter-by-letter spelling) is therefore indispensable for expressing technical terms, names, and new words, making its accurate recognition a critical bridge between the Deaf and hearing communities.

Signing-hand ambiguity: Existing methods rely on optical flow or hand motion magnitude for explicit signing-hand detection, but the non-signing hand sometimes exhibits larger motion, leading to detection errors, recognition failures, and training instability.

Peaky behavior of CTC loss: Methods that adopt CTC loss tend to make sparse predictions concentrated on a small number of frames (peaky behavior), providing insufficient supervision to the encoder and hindering the learning of discriminative hand pose representations.

Underestimated OOV problem: New words and neologisms emerge continuously, making generalization to unseen vocabulary critical; yet manually collecting fingerspelling data for new words is costly and requires expert signers.

Existing generative methods are unsuitable for fingerspelling: Text-to-motion models based on CLIP capture word-level semantics as a global condition and cannot model the fine-grained finger-joint movements and inter-letter transitions required by fingerspelling.

Lack of OOV evaluation benchmarks: No standardized benchmark previously existed for evaluating OOV fingerspelling recognition, making systematic assessment of model generalization impossible.

Method¶

Overall Architecture¶

OpenFS consists of three core components:

Multi-hand fingerspelling recognizer: A Transformer encoder–decoder architecture. The encoder receives normalized 2D single- or multi-hand pose sequences extracted by MediaPipe, embeds them via an MLP, adds dual-level positional encodings, and feeds them into Transformer encoder layers. The decoder receives letter sequences (with <start> and <end> tokens) and predicts the next letter via cross-attention.
Frame-wise letter-conditioned generator: A Transformer encoder combined with a diffusion mechanism. Noisy hand poses and frame-wise letter embeddings are concatenated frame-by-frame and fed into the encoder for iterative denoising to generate realistic fingerspelling pose sequences.
FSNeo benchmark: The generator is used to synthesize 1,635 unique words × 5 sequences = 8,175 samples for neologisms categorized by NEO-BENCH (lexical, morphological, and semantic neologisms).

Key Designs¶

1. Dual-Level Positional Encoding

Hand identity encoding \(\tau\): All tokens belonging to the same hand share the same encoding, distinguishing different hands (left/right and different individuals).
Temporal positional encoding \(\eta\): Different hands in the same frame share the same temporal encoding, while different frames use different values, preserving temporal alignment and ordering.
Both encodings follow sinusoidal formulas and are added to the pose token embeddings before being fed into the encoder.

2. Signing-Hand Focusing Loss (SF Loss) \(\mathcal{L}_{SF}\)

Average attention maps are extracted from each decoder cross-attention layer and aggregated into a hand-level attention distribution by hand identity.
Minimizing the entropy of this distribution encourages the decoder to focus on the dominant signing hand, achieving implicit signing-hand detection.

3. Monotonic Alignment Loss (MA Loss) \(\mathcal{L}_{MA}\)

A cumulative cross-attention map is constructed; finite differences are computed along the letter dimension, where positive values indicate that a subsequent letter attends more strongly to earlier frames than the preceding letter (violating temporal order).
Penalizing these positive deviations enforces monotonically increasing temporal attention, serving as a replacement for CTC loss.

4. Coarse-to-Fine Frame-Wise Letter Annotation

Coarse annotation: The cross-attention matrix of the trained recognizer is used; frames whose attention weights exceed a threshold are assigned to the corresponding letter, and conflicting frames are labeled as blank \(\phi\).
Fine annotation: The recognizer is frozen and a frame-wise annotation refiner (taking encoder features as input and predicting per-frame letters) is trained, with the blank class weight set to 0.1 to suppress its dominance.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{CE} + \lambda_{SF}\mathcal{L}_{SF} + \lambda_{MA}\mathcal{L}_{MA}\]

where \(\lambda_{SF} = 0.8\) and \(\lambda_{MA} = 1.0\). The generator is trained with MSE loss.

Key Experimental Results¶

Main Results¶

Letter accuracy comparison on ChicagoFSWild (CFSW), ChicagoFSWildPlus (CFSWP), and FSNeo:

Method	CFSW	CFSWP	FSNeo
Shi et al. (2018)	57.5	58.3	-
Shi et al. (2019)	61.2	62.3	-
FSS-Net	52.5	64.4	-
PoseNet	61.6	61.0	61.2
Ours	75.4	70.5	80.5
PoseNet†	69.2	69.4	94.9
Ours†	77.7	74.6	97.6

† denotes use of additional synthetic training data.

Inference speed comparison (868 CFSW samples, A40 GPU):

Method	Batch Size	Latency (s) ↓	Throughput ↑	Letters/s ↑	FPS ↑
PoseNet	1	4,282	0.2	1	6
Ours	1	39	22.0	106	962
Ours	32	6	149.8	725	6,356

Ablation Study¶

Ablation of positional encoding and auxiliary losses (letter accuracy on CFSW):

Configuration	Acc.
Standard PE + no auxiliary loss	73.2
Standard PE + auxiliary loss	73.1
Dual-level PE + no auxiliary loss	74.8
Dual-level PE + auxiliary loss (full model)	75.4

Comparison of generator conditioning strategies (letter accuracy on generated sequences):

Conditioning Strategy	PoseNet Recognition	Ours Recognition
WC (word-level, CLIP)	19.9	23.3
LC (letter-level)	26.4	40.2
FWLC (frame-wise letter, Ours)	63.5	82.3

Key Findings¶

The auxiliary losses (SF + MA) yield synergistic gains only when combined with dual-level positional encoding (73.2→73.1 without dual PE vs. 74.8→75.4 with dual PE).
Implicit signing-hand detection achieves 99.9% accuracy with only one failure (which is also ambiguous to humans), far exceeding PoseNet's 90.4%.
Synthetic data improves not only OpenFS but also PoseNet substantially (CFSW +7.6, FSNeo +33.7), validating the generality of the generator.
Frame-wise letter conditioning (FWLC) significantly outperforms word-level (WC) and letter-level (LC) conditioning, as fingerspelling requires precise per-frame letter–pose correspondence.

Highlights & Insights¶

Implicit signing-hand detection replaces explicit detection; it is naturally realized through cross-attention via the SF loss with 99.9% accuracy, eliminating recognition failures caused by detection errors.
MA loss replaces CTC loss, addressing peaky behavior through cross-attention regularization and yielding semantically richer encoder representations.
End-to-end with no post-processing, achieving 962 FPS (single sample) to 6,356 FPS (batch), over 100× faster than PoseNet.
A complete recognition–generation closed loop: the recognizer's cross-attention generates frame-wise labels → trains the generator → synthetic data augments the recognizer, forming a positive feedback cycle.
The paper introduces FSNeo, the first OOV fingerspelling evaluation benchmark, filling a notable gap in the field.

Limitations & Future Work¶

Validation is limited to American Sign Language (ASL) fingerspelling; applicability to other sign language systems (e.g., two-handed British Sign Language fingerspelling) remains unknown.
The approach relies on MediaPipe for hand pose extraction; failures of the pose estimator propagate to recognition (end-to-end RGB approaches may be more robust in some scenarios).
The diffusion generator requires 50 denoising steps; generation speed is not reported and may be unsuitable for real-time data augmentation.
FSNeo is composed entirely of synthetic data, and a distribution gap with real-world OOV fingerspelling scenarios may exist.
Ablation experiments are conducted only on CFSW, without fully validating component contributions on CFSWP and FSNeo.

Fingerspelling recognition: Shi et al. (2018/2019) established the ChicagoFSWild dataset series using CNN+LSTM with visual attention; PoseNet employs a Transformer encoder–decoder with re-ranking using single-hand pose input; FSS-Net focuses on fingerspelling detection for search and retrieval; HandReader is a multimodal framework fusing RGB and pose. This paper advances the field by addressing implicit signing-hand detection and replacing CTC loss.
Fingerspelling/motion generation: Text-to-motion models such as MDM use global CLIP conditioning, which is unsuitable for fingerspelling that requires letter-level fine-grained control; sign language generation research focuses on full-body motion but emphasizes the semantic expressiveness of hand joints. This paper proposes a frame-wise letter-conditioned diffusion generator specifically designed for the per-frame letter–pose correspondence inherent in fingerspelling.

Rating¶

Novelty: ⭐⭐⭐⭐ (Implicit signing-hand detection + MA loss replacing CTC + frame-wise conditioned generator — three coupled innovations forming a coherent system)
Experimental Thoroughness: ⭐⭐⭐⭐ (Three datasets, speed comparison, detailed ablations, generalization of synthetic data to other methods, and qualitative analysis)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure, rich and intuitive figures, complete motivation–method–experiment logical chain)
Value: ⭐⭐⭐⭐ (Open-source code and data, new benchmark, deployment-friendly real-time speed, practical significance for the Deaf community)