Sign Language Recognition in the Age of LLMs¶

Conference: CVPR 2026
arXiv: 2604.11225
Code: https://github.com/VaJavorek/WLASL_LLM
Area: LLM / NLP (Other)
Keywords: sign language recognition, vision-language model, zero-shot, American Sign Language, benchmark

TL;DR¶

The first systematic evaluation of modern VLMs on zero-shot isolated sign language recognition (ISLR), revealing that open-source VLMs fall far behind specialized classifiers while large commercial models (GPT-5) demonstrate surprising potential.

Background & Motivation¶

Background: Sign language recognition has traditionally relied on task-specific supervised learning, requiring large amounts of annotated data and specialized architectures. Meanwhile, VLMs have demonstrated powerful multimodal reasoning capabilities, but their application to sign language remains largely unexplored.

Limitations of Prior Work: (1) Supervised methods are limited by annotated data availability and cross-signer/environment generalization; (2) VLMs are primarily evaluated on natural images/videos, with fine-grained gestural motions in sign language not covered; (3) No systematic benchmark exists for VLM zero-shot sign language recognition.

Key Challenge: VLMs are highly general but not specifically trained on sign language data. The high-dimensional spatiotemporal complexity and subtle linguistic structure of sign language may exceed VLMs' zero-shot capabilities.

Core Idea: Return to the controlled ISLR setting to systematically evaluate multiple VLMs' zero-shot sign language recognition capabilities, analyzing the effects of prompting strategies and model scale.

Method¶

Overall Architecture¶

Evaluate multiple open-source and commercial VLMs on the WLASL300 benchmark (300 sign language vocabulary items) → Three evaluation paradigms: (1) standard multi-class classification, (2) zero-shot open-set prediction, (3) zero-shot binary classification (judging whether the sign in a video matches a specified word) → Analyze effects of prompting strategies, frame sampling, and model scale.

Key Designs¶

Systematic Multi-Model Evaluation:
- Function: Establish baselines for VLM zero-shot ISLR
- Mechanism: Evaluate LLaVA-NeXT-Video, InternVL3.5, Qwen2.5/3-VL, BAGEL, GPT-5, Gemini, and other models with unified prompt templates and frame sampling strategies
- Design Motivation: Provide the community with a clear reference for "what VLMs can achieve" in sign language
Multi-Level Prompting Strategy:
- Function: Explore the impact of information quantity on zero-shot performance
- Mechanism: From fully open → specifying the dataset → providing candidate vocabulary lists, progressively constraining the output space. Additionally tests binary classification (given a word description, judge whether it matches) and synonym-tolerant evaluation
- Design Motivation: VLMs' output space is far larger than a classifier's fixed category set; constraining the output space may significantly affect performance
Synonym-Aware Evaluation:
- Function: More fairly evaluate VLMs' semantic understanding
- Mechanism: Obtain synonym lists for each ground truth word from WordNet; predicted synonyms are also counted as correct
- Design Motivation: VLMs may output semantically correct but differently worded predictions (e.g., "happy" vs "glad")

Loss & Training¶

Pure zero-shot evaluation with no training.

Key Experimental Results¶

Main Results¶

Model	Top-1	Top-1+Synonyms	Note
Specialized SOTA (DSLNet)	89.97%	-	Supervised training
GPT-5 (64 frames)	14.67%	17.96%	Best commercial model
Qwen3-VL-30B	2.40%	3.59%	Best open-source model
LLaVA-NeXT-7B	0.30%	0.45%	Worst open-source model

Ablation Study¶

Prompting Strategy	GPT-5 Accuracy	Note
Open-set	14.67%	Unconstrained
Specify dataset name	Slight improvement	Constrains output space
Provide candidate list	Significant improvement	Strongest constraint
Binary classification	~30% precision	Partial visual-semantic alignment exists

Key Findings¶

Open-source VLMs almost completely fail at zero-shot ISLR (< 3%), far below specialized classifiers
GPT-5 significantly outperforms open-source models, indicating model scale and training data diversity are critical
Binary classification experiments show VLMs do capture partial sign language-text semantic alignment
Some models (e.g., Nemotron) honestly answer "I don't know," lowering measured performance but reflecting true capability

Highlights & Insights¶

Honest negative results: Without avoiding VLMs' severe shortcomings in sign language, this provides a realistic baseline for the community
Scale effect is prominent: The massive gap between GPT-5 and open-source models suggests sign language may require more visual-motor pre-training data

Limitations & Future Work¶

Only tested on WLASL300; larger vocabularies and continuous sign language are not covered
Commercial API latency limits large-scale evaluation feasibility
Future work could explore few-shot fine-tuning of VLMs or specialized sign language visual encoders

vs Traditional ISLR: ST-GCN and similar specialized methods perform extremely well under supervision, indicating task-specific training remains irreplaceable
vs Elysium/ChatTracker: These MLLM trackers also require fine-tuning; pure zero-shot approaches are insufficient

Rating¶

Novelty: ⭐⭐⭐ An evaluation study that is not a new method per se but fills a significant gap
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-model, multi-prompt, multi-evaluation paradigm — highly comprehensive
Writing Quality: ⭐⭐⭐⭐ Clear structure with deep analysis
Value: ⭐⭐⭐⭐ Provides an important VLM baseline for sign language AI research