Comparative Analysis of Multilingual Hate Speech Detection¶

Conference: ACL 2025
Code: None
Area: Multilingual Translation
Keywords: Hate Speech Detection, Multilingual NLP, Cross-lingual Transfer, Text Classification, Social Media Analysis

TL;DR¶

This paper systematically compares the performance of various LLMs and pre-trained language models on multilingual hate speech detection tasks, revealing the key bottlenecks of cross-lingual transfer and proposing enhancement strategies for low-resource languages.

Background & Motivation¶

Background: Hate speech detection is a core task in social media content moderation. While English hate speech detection is relatively mature, hate speech in reality exists extensively across various languages. Multilingual pre-trained models (e.g., mBERT, XLM-R) and multilingual LLMs (e.g., GPT-4, Llama) offer new possibilities for cross-lingual hate speech detection.

Limitations of Prior Work: Most existing studies focus on a single language or a few high-resource languages, lacking a systematic comparison across a large number of languages. The performance of different models varies drastically across different languages, yet there is a lack of a unified evaluation framework to analyze the root causes of these disparities. Furthermore, the cultural dependency of hate speech creates a semantic gap for cross-lingual transfer—the same expression may carry entirely different meanings in different cultures.

Key Challenge: Multilingual models perform excellently on high-resource languages but experience a sharp performance decline on low-resource languages, and different types of hate speech (explicit vs. implicit) present varying levels of challenge to the models.

Goal: (1) Establish a unified evaluation benchmark covering over 10 languages; (2) systematically compare performance differences between fine-tuned models and prompting-based LLMs; (3) analyze the causes of cross-lingual transfer failures and propose improvement strategies.

Key Insight: The authors aggregate multiple existing multilingual hate speech datasets and construct a comprehensive evaluation platform covering multiple languages and hate speech types after unifying the annotation schemes. They conduct a systematic analysis from three dimensions: model architecture, language characteristics, and data scale.

Core Idea: Through large-scale multi-dimensional comparative experiments, the authors discover that the core bottleneck of multilingual hate speech detection lies in data and cultural adaptation rather than model capabilities. Consequently, they propose low-resource language enhancement strategies based on translation augmentation and cultural context injection.

Method¶

Overall Architecture¶

This paper adopts a unified evaluation framework to compare three categories of methods on standardized multilingual hate speech datasets: (1) fine-tuned multilingual pre-trained models (mBERT, XLM-R, InfoXLM); (2) zero-shot/few-shot LLM prompting (GPT-4, Claude, Llama-3); (3) enhancement strategies (translation augmentation, cross-lingual training data mixing, cultural context injection). The evaluation covers over 10 languages, including English, German, Arabic, Hindi, Turkish, and Indonesian.

Key Designs¶

Unified Multilingual Evaluation Framework:
- Function: Unifies fragmented multilingual hate speech datasets into a comparable evaluation benchmark.
- Mechanism: Collects and integrates multiple public datasets, mapping different annotation schemes to a unified three-level classification (non-hate / hate / severe hate). It splits the data for each language into training/validation/test sets at an 8:1:1 ratio to ensure consistent class distributions. Language-specific preprocessing pipelines (tokenization, emoji handling, slang standardization, etc.) are designed for each language.
- Design Motivation: The lack of a unified benchmark is a major obstacle hindering progress in this field, as different papers report results on different datasets, making direct comparison impossible.
Multi-dimensional Model Comparative Analysis:
- Function: Systematically analyzes performance differences of different models in multilingual hate speech detection from multiple angles.
- Mechanism: Designs four analysis dimensions: (a) language dimension—F1 differences of the same model across different languages; (b) model dimension—ranking of different models on the same language; (c) hate type dimension—the detection difficulty gap between explicit and implicit hate speech; (d) data scale dimension—the performance impact curve of training set size. Statistical significance of differences between models is confirmed using McNemar's test.
- Design Motivation: Multi-dimensional analysis can reveal interaction effects that single comparative experiments cannot capture, such as certain models being exceptionally weak at detecting implicit hate in specific languages.
Low-resource Language Enhancement Strategies:
- Function: Improves the performance of models in detecting hate speech in low-resource languages.
- Mechanism: Proposes a combination of three enhancement strategies: (a) translation-based augmentation: translating annotated data from high-resource languages (English) into target low-resource languages via machine translation to expand the training set; (b) cross-lingual mixed training: mixing data from multiple languages in the training set for joint training to let models learn language-invariant hate features; (c) cultural context injection: adding descriptions of typical hate speech patterns in the target language/culture to the prompts to help LLMs understand culture-specific hate expressions.
- Design Motivation: Pure reliance on cross-lingual transfer ignores cultural differences. Translation augmentation can compensate for data scarcity, while cultural context injection can bridge semantic gaps.

Loss & Training¶

Fine-tuned models employ the standard cross-entropy loss with class weight balancing (as the hate class is usually the minority). LLM evaluation uses both zero-shot and 5-shot configurations, with manually optimized prompt templates. All experiments are run 3 times and averaged to mitigate randomness.

Key Experimental Results¶

Main Results¶

Model	English F1	German F1	Arabic F1	Hindi F1	Turkish F1	Average F1
mBERT (Fine-tuned)	78.3	74.1	68.5	62.3	66.8	70.0
XLM-R-Large (Fine-tuned)	82.7	79.4	73.2	67.8	72.1	75.0
GPT-4 (Zero-shot)	76.5	72.3	65.8	58.2	63.4	67.2
GPT-4 (5-shot)	80.1	76.8	70.4	63.5	68.9	71.9
XLM-R + Translation Augmentation	83.1	80.2	76.8	72.4	75.3	77.6

Ablation Study¶

Configuration	Low-resource Average F1	Note
XLM-R Baseline	67.8	Fine-tuned on target language data only
+ Translation Augmentation	72.4	English translated data helps significantly
+ Cross-lingual Mixed Training	74.1	Multilingual mixing yields further improvements
+ Cultural Context Injection	75.3	Cultural information is particularly effective for implicit hate
Translation Augmentation Only (No original data)	65.2	Translated data cannot completely replace original data

Key Findings¶

XLM-R in the fine-tuned setting systematically outperforms zero-shot LLMs, though the gap is smaller in high-resource languages (only 2.6 F1 difference in English).
Performance on low-resource languages (Hindi, Turkish) is 10-20 F1 points lower than on English, with linguistic distance being the primary factor.
Detection F1 for implicit hate (sarcasm, metaphor) is 15-25 F1 points lower than for explicit hate, a challenge faced by all models.
Translation-based augmentation provides the largest boost for low-resource languages (+4.6 F1), though translation quality is a bottleneck—high-quality translation engines yield better results.

Highlights & Insights¶

The unified multilingual evaluation framework fills a gap in the field, providing a reproducible comparative benchmark for subsequent research, which may contribute more to the community than proposing a new model.
The proposal of the cultural context injection strategy is highly insightful—hate speech is fundamentally a cultural phenomenon, which purely linguistic approaches struggle to fully address.
The experimental finding that fine-tuned medium-sized models still outperform zero-shot large language models offers critical practical guidance for real-world deployment.

Limitations & Future Work¶

The mapping process in the unified annotation scheme may introduce noise, as the original annotation quality varies across different datasets.
Cultural context injection relies on manually curated cultural knowledge, which limits its scalability.
Multimodal hate speech (combining text and images), an increasingly common expression of hate on social media, is not covered.
Testing sets for some low-resource languages are small, which may lead to insufficient stability in the results.

vs HateDay: HateDay focuses on analyzing the temporal distribution of hate speech, whereas this work concentrates on evaluating model capabilities across different languages. The two are complementary.
vs ImpliHateVid: That work targets implicit hate in videos, while this paper focuses on text; however, the challenges of implicit hate detection are shared.
vs Original XLM-R Paper: This work further reveals the cross-lingual transfer capabilities and limitations of XLM-R on the specific task of hate speech detection.

Rating¶

Novelty: ⭐⭐⭐ Limited methodological innovation, primary contributions lie in the evaluation framework and empirical analysis.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly systematic and comprehensive analysis, covering multiple languages, models, and dimensions.
Writing Quality: ⭐⭐⭐⭐ Well-organized analytical framework with data-supported conclusions.
Value: ⭐⭐⭐⭐ Highly valuable reference for the multilingual hate speech detection community; the unified benchmark is of great significance.