Skip to content

Comparative Analysis of Multilingual Hate Speech Detection

Conference: ACL 2025
Code: None
Area: Multilingual Translation
Keywords: Hate Speech Detection, Multilingual NLP, Cross-lingual Transfer, Text Classification, Social Media Analysis

TL;DR

This paper systematically compares the performance of various LLMs and pre-trained language models on multilingual hate speech detection tasks, revealing the key bottlenecks of cross-lingual transfer and proposing enhancement strategies for low-resource languages.

Background & Motivation

Background: Hate speech detection is a core task in social media content moderation. While English hate speech detection is relatively mature, hate speech in reality exists extensively across various languages. Multilingual pre-trained models (e.g., mBERT, XLM-R) and multilingual LLMs (e.g., GPT-4, Llama) offer new possibilities for cross-lingual hate speech detection.

Limitations of Prior Work: Most existing studies focus on a single language or a few high-resource languages, lacking a systematic comparison across a large number of languages. The performance of different models varies drastically across different languages, yet there is a lack of a unified evaluation framework to analyze the root causes of these disparities. Furthermore, the cultural dependency of hate speech creates a semantic gap for cross-lingual transfer—the same expression may carry entirely different meanings in different cultures.

Key Challenge: Multilingual models perform excellently on high-resource languages but experience a sharp performance decline on low-resource languages, and different types of hate speech (explicit vs. implicit) present varying levels of challenge to the models.

Goal: (1) Establish a unified evaluation benchmark covering over 10 languages; (2) systematically compare performance differences between fine-tuned models and prompting-based LLMs; (3) analyze the causes of cross-lingual transfer failures and propose improvement strategies.

Key Insight: The authors aggregate multiple existing multilingual hate speech datasets and construct a comprehensive evaluation platform covering multiple languages and hate speech types after unifying the annotation schemes. They conduct a systematic analysis from three dimensions: model architecture, language characteristics, and data scale.

Core Idea: Through large-scale multi-dimensional comparative experiments, the authors discover that the core bottleneck of multilingual hate speech detection lies in data and cultural adaptation rather than model capabilities. Consequently, they propose low-resource language enhancement strategies based on translation augmentation and cultural context injection.

Method

Overall Architecture

This paper adopts a unified evaluation framework to compare three categories of methods on standardized multilingual hate speech datasets: (1) fine-tuned multilingual pre-trained models (mBERT, XLM-R, InfoXLM); (2) zero-shot/few-shot LLM prompting (GPT-4, Claude, Llama-3); (3) enhancement strategies (translation augmentation, cross-lingual training data mixing, cultural context injection). The evaluation covers over 10 languages, including English, German, Arabic, Hindi, Turkish, and Indonesian.

Key Designs

  1. Unified Multilingual Evaluation Framework:

    • Function: Unifies fragmented multilingual hate speech datasets into a comparable evaluation benchmark.
    • Mechanism: Collects and integrates multiple public datasets, mapping different annotation schemes to a unified three-level classification (non-hate / hate / severe hate). It splits the data for each language into training/validation/test sets at an 8:1:1 ratio to ensure consistent class distributions. Language-specific preprocessing pipelines (tokenization, emoji handling, slang standardization, etc.) are designed for each language.
    • Design Motivation: The lack of a unified benchmark is a major obstacle hindering progress in this field, as different papers report results on different datasets, making direct comparison impossible.
  2. Multi-dimensional Model Comparative Analysis:

    • Function: Systematically analyzes performance differences of different models in multilingual hate speech detection from multiple angles.
    • Mechanism: Designs four analysis dimensions: (a) language dimension—F1 differences of the same model across different languages; (b) model dimension—ranking of different models on the same language; (c) hate type dimension—the detection difficulty gap between explicit and implicit hate speech; (d) data scale dimension—the performance impact curve of training set size. Statistical significance of differences between models is confirmed using McNemar's test.
    • Design Motivation: Multi-dimensional analysis can reveal interaction effects that single comparative experiments cannot capture, such as certain models being exceptionally weak at detecting implicit hate in specific languages.
  3. Low-resource Language Enhancement Strategies:

    • Function: Improves the performance of models in detecting hate speech in low-resource languages.
    • Mechanism: Proposes a combination of three enhancement strategies: (a) translation-based augmentation: translating annotated data from high-resource languages (English) into target low-resource languages via machine translation to expand the training set; (b) cross-lingual mixed training: mixing data from multiple languages in the training set for joint training to let models learn language-invariant hate features; (c) cultural context injection: adding descriptions of typical hate speech patterns in the target language/culture to the prompts to help LLMs understand culture-specific hate expressions.
    • Design Motivation: Pure reliance on cross-lingual transfer ignores cultural differences. Translation augmentation can compensate for data scarcity, while cultural context injection can bridge semantic gaps.

Loss & Training

Fine-tuned models employ the standard cross-entropy loss with class weight balancing (as the hate class is usually the minority). LLM evaluation uses both zero-shot and 5-shot configurations, with manually optimized prompt templates. All experiments are run 3 times and averaged to mitigate randomness.

Key Experimental Results

Main Results

Model English F1 German F1 Arabic F1 Hindi F1 Turkish F1 Average F1
mBERT (Fine-tuned) 78.3 74.1 68.5 62.3 66.8 70.0
XLM-R-Large (Fine-tuned) 82.7 79.4 73.2 67.8 72.1 75.0
GPT-4 (Zero-shot) 76.5 72.3 65.8 58.2 63.4 67.2
GPT-4 (5-shot) 80.1 76.8 70.4 63.5 68.9 71.9
XLM-R + Translation Augmentation 83.1 80.2 76.8 72.4 75.3 77.6

Ablation Study

Configuration Low-resource Average F1 Note
XLM-R Baseline 67.8 Fine-tuned on target language data only
+ Translation Augmentation 72.4 English translated data helps significantly
+ Cross-lingual Mixed Training 74.1 Multilingual mixing yields further improvements
+ Cultural Context Injection 75.3 Cultural information is particularly effective for implicit hate
Translation Augmentation Only (No original data) 65.2 Translated data cannot completely replace original data

Key Findings

  • XLM-R in the fine-tuned setting systematically outperforms zero-shot LLMs, though the gap is smaller in high-resource languages (only 2.6 F1 difference in English).
  • Performance on low-resource languages (Hindi, Turkish) is 10-20 F1 points lower than on English, with linguistic distance being the primary factor.
  • Detection F1 for implicit hate (sarcasm, metaphor) is 15-25 F1 points lower than for explicit hate, a challenge faced by all models.
  • Translation-based augmentation provides the largest boost for low-resource languages (+4.6 F1), though translation quality is a bottleneck—high-quality translation engines yield better results.

Highlights & Insights

  • The unified multilingual evaluation framework fills a gap in the field, providing a reproducible comparative benchmark for subsequent research, which may contribute more to the community than proposing a new model.
  • The proposal of the cultural context injection strategy is highly insightful—hate speech is fundamentally a cultural phenomenon, which purely linguistic approaches struggle to fully address.
  • The experimental finding that fine-tuned medium-sized models still outperform zero-shot large language models offers critical practical guidance for real-world deployment.

Limitations & Future Work

  • The mapping process in the unified annotation scheme may introduce noise, as the original annotation quality varies across different datasets.
  • Cultural context injection relies on manually curated cultural knowledge, which limits its scalability.
  • Multimodal hate speech (combining text and images), an increasingly common expression of hate on social media, is not covered.
  • Testing sets for some low-resource languages are small, which may lead to insufficient stability in the results.
  • vs HateDay: HateDay focuses on analyzing the temporal distribution of hate speech, whereas this work concentrates on evaluating model capabilities across different languages. The two are complementary.
  • vs ImpliHateVid: That work targets implicit hate in videos, while this paper focuses on text; however, the challenges of implicit hate detection are shared.
  • vs Original XLM-R Paper: This work further reveals the cross-lingual transfer capabilities and limitations of XLM-R on the specific task of hate speech detection.

Rating

  • Novelty: ⭐⭐⭐ Limited methodological innovation, primary contributions lie in the evaluation framework and empirical analysis.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly systematic and comprehensive analysis, covering multiple languages, models, and dimensions.
  • Writing Quality: ⭐⭐⭐⭐ Well-organized analytical framework with data-supported conclusions.
  • Value: ⭐⭐⭐⭐ Highly valuable reference for the multilingual hate speech detection community; the unified benchmark is of great significance.