Comparative Analysis of Multilingual Hate Speech Detection¶
Conference: ACL 2025
Code: None
Area: Multilingual Translation
Keywords: Hate Speech Detection, Multilingual NLP, Cross-lingual Transfer, Text Classification, Social Media Analysis
TL;DR¶
This paper systematically compares the performance of various LLMs and pre-trained language models on multilingual hate speech detection tasks, revealing the key bottlenecks of cross-lingual transfer and proposing enhancement strategies for low-resource languages.
Background & Motivation¶
Background: Hate speech detection is a core task in social media content moderation. While English hate speech detection is relatively mature, hate speech in reality exists extensively across various languages. Multilingual pre-trained models (e.g., mBERT, XLM-R) and multilingual LLMs (e.g., GPT-4, Llama) offer new possibilities for cross-lingual hate speech detection.
Limitations of Prior Work: Most existing studies focus on a single language or a few high-resource languages, lacking a systematic comparison across a large number of languages. The performance of different models varies drastically across different languages, yet there is a lack of a unified evaluation framework to analyze the root causes of these disparities. Furthermore, the cultural dependency of hate speech creates a semantic gap for cross-lingual transfer—the same expression may carry entirely different meanings in different cultures.
Key Challenge: Multilingual models perform excellently on high-resource languages but experience a sharp performance decline on low-resource languages, and different types of hate speech (explicit vs. implicit) present varying levels of challenge to the models.
Goal: (1) Establish a unified evaluation benchmark covering over 10 languages; (2) systematically compare performance differences between fine-tuned models and prompting-based LLMs; (3) analyze the causes of cross-lingual transfer failures and propose improvement strategies.
Key Insight: The authors aggregate multiple existing multilingual hate speech datasets and construct a comprehensive evaluation platform covering multiple languages and hate speech types after unifying the annotation schemes. They conduct a systematic analysis from three dimensions: model architecture, language characteristics, and data scale.
Core Idea: Through large-scale multi-dimensional comparative experiments, the authors discover that the core bottleneck of multilingual hate speech detection lies in data and cultural adaptation rather than model capabilities. Consequently, they propose low-resource language enhancement strategies based on translation augmentation and cultural context injection.
Method¶
Overall Architecture¶
This paper adopts a unified evaluation framework to compare three categories of methods on standardized multilingual hate speech datasets: (1) fine-tuned multilingual pre-trained models (mBERT, XLM-R, InfoXLM); (2) zero-shot/few-shot LLM prompting (GPT-4, Claude, Llama-3); (3) enhancement strategies (translation augmentation, cross-lingual training data mixing, cultural context injection). The evaluation covers over 10 languages, including English, German, Arabic, Hindi, Turkish, and Indonesian.
Key Designs¶
-
Unified Multilingual Evaluation Framework:
- Function: Unifies fragmented multilingual hate speech datasets into a comparable evaluation benchmark.
- Mechanism: Collects and integrates multiple public datasets, mapping different annotation schemes to a unified three-level classification (non-hate / hate / severe hate). It splits the data for each language into training/validation/test sets at an 8:1:1 ratio to ensure consistent class distributions. Language-specific preprocessing pipelines (tokenization, emoji handling, slang standardization, etc.) are designed for each language.
- Design Motivation: The lack of a unified benchmark is a major obstacle hindering progress in this field, as different papers report results on different datasets, making direct comparison impossible.
-
Multi-dimensional Model Comparative Analysis:
- Function: Systematically analyzes performance differences of different models in multilingual hate speech detection from multiple angles.
- Mechanism: Designs four analysis dimensions: (a) language dimension—F1 differences of the same model across different languages; (b) model dimension—ranking of different models on the same language; (c) hate type dimension—the detection difficulty gap between explicit and implicit hate speech; (d) data scale dimension—the performance impact curve of training set size. Statistical significance of differences between models is confirmed using McNemar's test.
- Design Motivation: Multi-dimensional analysis can reveal interaction effects that single comparative experiments cannot capture, such as certain models being exceptionally weak at detecting implicit hate in specific languages.
-
Low-resource Language Enhancement Strategies:
- Function: Improves the performance of models in detecting hate speech in low-resource languages.
- Mechanism: Proposes a combination of three enhancement strategies: (a) translation-based augmentation: translating annotated data from high-resource languages (English) into target low-resource languages via machine translation to expand the training set; (b) cross-lingual mixed training: mixing data from multiple languages in the training set for joint training to let models learn language-invariant hate features; (c) cultural context injection: adding descriptions of typical hate speech patterns in the target language/culture to the prompts to help LLMs understand culture-specific hate expressions.
- Design Motivation: Pure reliance on cross-lingual transfer ignores cultural differences. Translation augmentation can compensate for data scarcity, while cultural context injection can bridge semantic gaps.
Loss & Training¶
Fine-tuned models employ the standard cross-entropy loss with class weight balancing (as the hate class is usually the minority). LLM evaluation uses both zero-shot and 5-shot configurations, with manually optimized prompt templates. All experiments are run 3 times and averaged to mitigate randomness.
Key Experimental Results¶
Main Results¶
| Model | English F1 | German F1 | Arabic F1 | Hindi F1 | Turkish F1 | Average F1 |
|---|---|---|---|---|---|---|
| mBERT (Fine-tuned) | 78.3 | 74.1 | 68.5 | 62.3 | 66.8 | 70.0 |
| XLM-R-Large (Fine-tuned) | 82.7 | 79.4 | 73.2 | 67.8 | 72.1 | 75.0 |
| GPT-4 (Zero-shot) | 76.5 | 72.3 | 65.8 | 58.2 | 63.4 | 67.2 |
| GPT-4 (5-shot) | 80.1 | 76.8 | 70.4 | 63.5 | 68.9 | 71.9 |
| XLM-R + Translation Augmentation | 83.1 | 80.2 | 76.8 | 72.4 | 75.3 | 77.6 |
Ablation Study¶
| Configuration | Low-resource Average F1 | Note |
|---|---|---|
| XLM-R Baseline | 67.8 | Fine-tuned on target language data only |
| + Translation Augmentation | 72.4 | English translated data helps significantly |
| + Cross-lingual Mixed Training | 74.1 | Multilingual mixing yields further improvements |
| + Cultural Context Injection | 75.3 | Cultural information is particularly effective for implicit hate |
| Translation Augmentation Only (No original data) | 65.2 | Translated data cannot completely replace original data |
Key Findings¶
- XLM-R in the fine-tuned setting systematically outperforms zero-shot LLMs, though the gap is smaller in high-resource languages (only 2.6 F1 difference in English).
- Performance on low-resource languages (Hindi, Turkish) is 10-20 F1 points lower than on English, with linguistic distance being the primary factor.
- Detection F1 for implicit hate (sarcasm, metaphor) is 15-25 F1 points lower than for explicit hate, a challenge faced by all models.
- Translation-based augmentation provides the largest boost for low-resource languages (+4.6 F1), though translation quality is a bottleneck—high-quality translation engines yield better results.
Highlights & Insights¶
- The unified multilingual evaluation framework fills a gap in the field, providing a reproducible comparative benchmark for subsequent research, which may contribute more to the community than proposing a new model.
- The proposal of the cultural context injection strategy is highly insightful—hate speech is fundamentally a cultural phenomenon, which purely linguistic approaches struggle to fully address.
- The experimental finding that fine-tuned medium-sized models still outperform zero-shot large language models offers critical practical guidance for real-world deployment.
Limitations & Future Work¶
- The mapping process in the unified annotation scheme may introduce noise, as the original annotation quality varies across different datasets.
- Cultural context injection relies on manually curated cultural knowledge, which limits its scalability.
- Multimodal hate speech (combining text and images), an increasingly common expression of hate on social media, is not covered.
- Testing sets for some low-resource languages are small, which may lead to insufficient stability in the results.
Related Work & Insights¶
- vs HateDay: HateDay focuses on analyzing the temporal distribution of hate speech, whereas this work concentrates on evaluating model capabilities across different languages. The two are complementary.
- vs ImpliHateVid: That work targets implicit hate in videos, while this paper focuses on text; however, the challenges of implicit hate detection are shared.
- vs Original XLM-R Paper: This work further reveals the cross-lingual transfer capabilities and limitations of XLM-R on the specific task of hate speech detection.
Rating¶
- Novelty: ⭐⭐⭐ Limited methodological innovation, primary contributions lie in the evaluation framework and empirical analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly systematic and comprehensive analysis, covering multiple languages, models, and dimensions.
- Writing Quality: ⭐⭐⭐⭐ Well-organized analytical framework with data-supported conclusions.
- Value: ⭐⭐⭐⭐ Highly valuable reference for the multilingual hate speech detection community; the unified benchmark is of great significance.