skLEP: A Slovak General Language Understanding Benchmark¶

Conference: ACL 2025 (Findings)
arXiv: 2506.21508
Code: github.com/slovak-nlp/sklep
Area: NLP Understanding / Benchmark Evaluation
Keywords: Slovak, NLU benchmark, language understanding evaluation, multilingual models, low-resource languages

TL;DR¶

This paper introduces skLEP, the first comprehensive natural language understanding benchmark for Slovak, consisting of 9 multi-level tasks (word-level, sentence-pair-level, and document-level). Systematic evaluation of Slovak-specific, multilingual, and English models reveals that mDeBERTaV3 outperforms all Slovak-specific models in average performance.

Background & Motivation¶

Background: GLUE and SuperGLUE benchmarks have rapidly advanced English NLU models, prompting various languages to establish equivalent benchmarks, such as Polish KLEJ, Russian SuperGLUE, and Slovene SuperGLUE. However, Slovak, as a medium-resource language (approximately 10 million native speakers), has lacked a standardized NLU evaluation benchmark.

Limitations of Prior Work: Although several Slovak-specific models (e.g., SlovakBERT, FERNET-CC, HPLT BERT) have been released in recent years, comparing them with each other or with multilingual models has been impossible to conduct systematically due to the absence of a unified evaluation standard. Existing scattered evaluations cover limited tasks (e.g., the SlovakBERT paper evaluated only a few tasks like POS tagging and semantic similarity), and some datasets suffer from poor translation quality.

Key Challenge: Slovak-specific models continue to emerge, but they lack a "racetrack" for fair comparison. Meanwhile, certain tasks (such as textual entailment and natural language inference) lack native Slovak datasets altogether, requiring construction from scratch or translation from English.

Goal: (1) Build skLEP, a comprehensive NLU benchmark covering 9 tasks; (2) systematically evaluate 14 models fairly; and (3) open-source the evaluation toolkit and public leaderboard.

Key Insight: Integrate existing Slovak datasets and translate English datasets to fill missing task slots. Translation quality is safeguarded via specialized evaluation experiments (comparing 5 translation systems and performing post-editing by native speakers).

Core Idea: Fill the void in Slovak NLU evaluation by establishing a standardized GLUE-like benchmark, and introduce the Relative Error Reduction (RER) as an aggregate metric to compare model differences across tasks more fairly.

Method¶

Overall Architecture¶

skLEP consists of 9 tasks categorized into three levels: word-level tasks (POS tagging, two NER datasets), sentence-pair tasks (textual entailment, natural language inference, semantic similarity), and document-level tasks (hate speech detection, sentiment analysis, QA). For tasks lacking native Slovak versions, DeepL translation (assisted by native speaker post-editing) was employed, while the large-scale NLI dataset was translated using MADLAD-400-3B for cost control. Fourteen pre-trained models were fine-tuned and evaluated across all tasks.

Key Designs¶

Multi-level Task Coverage and Dataset Construction:
- Function: Provide comprehensive and diverse NLU evaluation.
- Mechanism: The nine tasks are evenly distributed across three granularity levels. Word-level includes UD (POS tagging based on the Slovak Dependency Treebank), UNER (named entity recognition using the Slovak subset of Universal NER, annotated for PER/ORG/LOC), and WikiGoldSK (another NER dataset from Slovak Wikipedia, additionally including MISC). Sentence-pair-level includes RTE (textual entailment, translated from English and reannotated), NLI (three-class classification based on XNLI), and STS (semantic similarity, 0-5 continuous scale). Document-level includes HS (hate speech, binary classification), SA (sentiment analysis, positive/negative binary classification), and QA (question answering, based on SK-QuAD).
- Design Motivation: A single task cannot comprehensively evaluate model capabilities. The three-level schema covers the full spectrum of language understanding, from fine-grained word structures to paragraph-level semantic reasoning, and each level contains three tasks to enhance the statistical reliability of the evaluation.
Translation Quality Assurance Pipeline:
- Function: Ensure the reliability of translated datasets.
- Mechanism: A systematic comparison was conducted among five translation systems (DeepL, GPT-4o, Google Translate, MADLAD-400-3B, and NLLB-3.3B), evaluated by 4 native speakers on ranking, fluency, and adequacy. DeepL performed best overall (fluency 3.70/4, adequacy 3.73/4) and was chosen as the primary translator. For cost reasons, MADLAD-400-3B was used for the large-scale NLI corpus. All translated datasets underwent native speaker post-editing; verification experiments showed that post-editing improved translation quality in 28 out of 30 samples.
- Design Motivation: The greatest risk in translation-based benchmarks is that poor translation quality leads to unreliable evaluation conclusions. The dual protection of multi-system comparison and human post-editing minimizes this risk.
Relative Error Reduction (RER) Aggregate Metric:
- Function: Enable a fairer comparison of model performance across heterogeneous tasks.
- Mechanism: Absolute score ranges vary widely across tasks (e.g., UD F1 is typically >95, whereas QA F1 is around 75). Direct averaging of absolute scores would disproportionately favor high-scoring tasks. Taking SlovakBERT as the baseline, RER calculates the percentage of error reduction achieved by each model relative to the baseline and averages these percentages. Thus, a 1-point improvement on UD (approx. 20% error reduction) is weighted more than a 1-point improvement on QA (approx. 3% error reduction), accurately reflecting the true difficulty of improvement.
- Design Motivation: Inspired by de Vries et al. (2023), RER rationalizes the aggregation of heterogeneous tasks, avoiding biases from simple arithmetic means.

Loss & Training¶

All models were fine-tuned on each task using the AdamW optimizer, linear learning rate decay, and a batch size of 12. Extensive hyperparameter grid searching (learning rate, epochs, warmup ratio) was performed on A100 GPUs. A total of approximately 4,024 experimental runs were executed, consuming around 130 GPU-days.

Key Experimental Results¶

Main Results¶

Model	Type	Avg Score	Avg RER	UD(F1)	RTE(Acc)	NLI(Acc)	QA(F1)
mDeBERTaV3-Base	Multilingual	85.17	+6.43	98.02	70.94	84.41	75.89
SlovakBERT	Slovak	83.95	0.00	98.04	65.20	82.75	74.36
HPLT BERT-sk	Slovak	82.96	-1.29	98.23	56.81	80.71	75.14
FERNET-CC	Slovak	84.27	+0.55	97.87	68.23	81.83	73.85
XLM-R-Large	Multilingual	83.36	-11.90	98.23	57.35	85.80	77.14
DeBERTaV3-Large	English	83.95	-10.22	97.55	74.30	85.13	74.73
ModernBERT-Base	English	72.82	-132.43	91.52	57.77	71.42	—

Translation Quality Assessment¶

Translation System	Avg Rank↓	Fluency/4	Adequacy/4
DeepL	1.81	3.70	3.73
GPT-4o	1.85	3.62	3.73
Google Translate	2.05	3.57	3.67
MADLAD-400-3B	2.54	3.48	3.53
NLLB-3.3B	2.68	3.40	3.54

Key Findings¶

mDeBERTaV3 is the strongest model: With 276M parameters, it achieves the best performance with an average score of 85.17 and RER of +6.43, demonstrating that the architectural advantages of multilingual DeBERTa (disentangled attention + replaced token detection) are highly effective for medium-resource languages.
Slovak-specific models remain competitive: Although SlovakBERT is the earliest specialized model, it maintains strong performance on multiple tasks, particularly UD, HS, and SA. However, it lags significantly behind mDeBERTaV3 on reasoning-intensive tasks like RTE and NLI.
ModernBERT performs unexpectedly poorly: Despite being the latest large-scale English model, its RER on Slovak is -132.43, far below the baseline. This indicates that models trained purely on English can be severely deficient in non-English languages, regardless of their advanced architectures.
UD and SA are close to being "solved": Most models achieve an F1/Accuracy >90 on these two tasks, leaving limited room for improvement. Conversely, RTE and QA still have substantial room for improvement (<75) and represent key directions for future research.
Translation quality is controllable: Post-editing verification experiments show that translation-induced labeling errors are restricted to single-digit percentages (RTE ~2%, NLI ~5%), which does not undermine the validity of the benchmark.

Highlights & Insights¶

Comprehensive Benchmark Ecosystem: The project provides not only datasets but also an open-source evaluation toolkit, HuggingFace integration, and a public leaderboard, lowering the barrier to entry for subsequent researchers. This "benchmark-as-a-service" philosophy is highly exemplary for other low-resource language communities.
Introduction of the RER Metric: Utilizing relative error reduction instead of simple averages for cross-task comparison uncovers differences masked by absolute scores (e.g., DeBERTaV3-Base and SlovakBERT have identical absolute average scores but very different RER values), providing a more faithful reflection of models' overall capabilities.
Systematic Verification of Translation Quality: Rather than simply claiming that the data was translated, the authors ensured quality through multiple layers of validation—comparing five translation systems, utilizing native speaker post-editing, and calculating reannotation alignment—resulting in a highly rigorous methodology.

Limitations & Future Work¶

The evaluation only covers encoder-only Transformer models, omitting generative LLMs (e.g., GPT-4, Llama), whose performance on NLU tasks remains to be explored.
Three of the nine tasks are translated. Although post-edited, they may still contain artifacts such as "translationese".
Slovak-specific models are all base-scale (<150M parameters), making comparisons with large-scale multilingual models (500M+) somewhat unfair.
A human baseline is missing to serve as an upper bound.
Some tasks share high similarity (e.g., both NLI and RTE involve reasoning relationships); task diversity could be further expanded.
Future iterations can incorporate more task types (e.g., coreference resolution, relation extraction) and non-text/multimodal data.

vs KLEJ (Polish): KLEJ comprises 8 tasks, while skLEP comprises 9, covering comparable scopes. Both face dataset scarcity in low/medium-resource languages, but skLEP establishes a more comprehensive translation-quality validation pipeline.
vs BgGLUE (Bulgarian): BgGLUE is also a Slavic language benchmark. However, resource conditions vary between Slovak and Bulgarian, and skLEP contributes more novel datasets.
vs XTREME/XGLUE (Multilingual): Multilingual benchmarks are broad but exclude Slovak, and they primarily provide English training data (focusing on cross-lingual transfer). skLEP supports Slovak-native training data, facilitating more targeted evaluations.

Rating¶

Novelty: ⭐⭐⭐ Underpinned by benchmark development; technical innovation is limited, but it fills an important vacancy.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely thorough, with 14 models evaluated across 9 tasks, 4,024 runs for hyperparameter tuning, and multi-tier translation quality validation.
Writing Quality: ⭐⭐⭐⭐⭐ Clearly organized, standard data representation, and extremely detailed appendices.
Value: ⭐⭐⭐⭐ Highly valuable to the Slovak NLP community, supported by a complete open-source ecosystem.