Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch¶

Conference: ACL 2025
arXiv: 2506.07667
Code: Yes
Area: Social Computing
Keywords: Hate speech detection, Content moderation, Algorithmic auditing, Twitch, AutoMod

TL;DR¶

Performing a large-scale audit of Twitch's automated content moderation tool, AutoMod, by transmitting over 107,000 messages, this study reveals that AutoMod flags only 22% of hateful content under its strictest settings, relies heavily on offensive slurs as detection signals, and incorrectly blocks up to 89.5% of educational or empowering content.

Background & Motivation¶

Online platforms face immense pressure to moderate massive volumes of user-generated content, with real-time live-streaming platforms like Twitch requiring even lower moderation latency. While platforms increasingly deploy machine learning-based automated moderation systems, little is known about their actual effectiveness.

Three key advantages of choosing Twitch for this audit study: 1. Twitch has a massive and widely active user base. 2. Live streams can be configured in an "isolated" mode, visible only to the research team, enabling controlled experiments. 3. The AutoMod tool is highly configurable, offering moderation options across different categories and severity levels, and returns explicit moderation reasons.

Research questions: - How effective is AutoMod at flagging hateful content? - How specific and effective are different filters in detecting various types of hate speech? - Are moderation rates consistent across different target groups?

Method¶

Overall Architecture¶

The auditing pipeline consists of three stages: 1. Bot setup and data curation. 2. Large-scale logging of moderation decisions. 3. Analysis of AutoMod moderation decisions.

Key Designs¶

Platform Selection and Experimental Setup
- Surveyed 43 of the largest user-generated content platforms and ultimately selected Twitch.
- Created isolated live-streaming channels to ensure that the tested content was visible only to the research team.
- Utilized three authenticated bots: a sender bot, a receiver bot, and a PubSub bot.
- Adhered to Twitch chat rate limits to prevent message duplication or omission.
Dataset Selection: Four datasets covering both explicit and implicit hate speech were utilized.

Dataset	Source	Characteristics
SBIC	Real comments	Offensive ratings included
IHC	Real implicit hate	Predominantly implicit hate
ToxiGen	Synthetic implicit hate	Generated by LLMs
DynaHate	Synthetic adversarial	Designed to trick classifiers

Formalization of Audit Design
- Define the moderation system as \(\mathcal{S} = (\mathcal{F}, \mathcal{C})\), where \(\mathcal{F}\) is the set of filtering functions and \(\mathcal{C}\) is the set of corresponding criteria.
- Each filter \(\mathcal{F}_i: T \to \{0, 1\}\) operates with a filtering severity parameter \(\alpha\).
- Audited four categories: Disability, Sex/Gender/Sexual Orientation (SSG), Misogyny, and Race/Ethnicity/Religion (RER).
Case Study Design
- Counterfactual analysis: Substituted slurs into false negative samples to observe changes in moderation decisions.
- Policy compliance evaluation: Tested AutoMod's handling of educational and empowering content.
- Robustness testing: Applied 6 types of meaning-preserving perturbations to sensitive words.

Loss & Training¶

This work is an empirical analytical study and does not involve model training. The core methodology is black-box algorithmic auditing.

Key Experimental Results¶

Main Results (Overall Performance of AutoMod)¶

Dataset	Accuracy	Precision	Recall	TNR	F1
SBIC	0.73	0.42	0.19	0.91	0.26
DynaHate	0.49	0.54	0.41	0.59	0.47
ToxiGen	0.53	0.86	0.07	0.98	0.13
IHC	0.52	0.70	0.06	0.97	0.12
Overall	0.55	0.56	0.22	0.84	0.32

Under the strictest settings, the overall recall of AutoMod is only 22%. The recall on implicit hate datasets (ToxiGen, IHC) drops to as low as 6%–7%.

Filter Analysis¶

Filter	Overall Recall	Pre-filtering Rate
Disability	10.6%	6.1%
Misogyny	19.0%	1.5%
RER	12.3%	22.0%
SSG	17.5%	54.8%

Key Case Study Results¶

Experiment	Result
Counterfactual (after adding slurs)	Recall increased from ~20% to 100%
Misclassification rate of educational content (α=2)	89.5%
Misclassification rate of educational content (α=4)	98.5%
After meaning-preserving perturbations	Recall dropped from 100% to 4%

Key Findings¶

AutoMod relies heavily on slurs as hate detection signals—adding slurs causes recall to surge to 100%, confirming keyphrase dependency.
It struggles significantly with implicit hate, flagging only 7% on ToxiGen and 6% on IHC.
89.8% of false negatives contain no vulgarity (i.e., they are implicit hate), while 73% of false positives contain vulgar words.
The effectiveness of adjusting the filtering level \(\alpha\) is marginal, with an increase of only 1.1% from \(\alpha=1\) to \(\alpha=4\).
Significant disparities exist across target groups; up to 98% of hate speech targeting people with mental disabilities evades moderation.
AutoMod lacks conversational-level context awareness (altering message sequences does not affect moderation decisions).

Highlights & Insights¶

Methodological Contribution: Establishes a comprehensive black-box auditing framework for content moderation systems, which is generalizable to other platforms.
Revealing Systemic Flaws: Proves that AutoMod relies heavily on lexical matching rather than semantic understanding.
Gap Between Policy and Practice: While Twitch's community guidelines explicitly require contextual consideration, AutoMod lacks this capability in practice.
Social Impact: The system "allows bigotry and silences empowerment," a finding precisely encapsulated in the paper's title.

Limitations & Future Work¶

The audit is restricted to a single platform (Twitch) without comparing it directly against other platforms' moderation systems.
Messages were transmitted sequentially in a non-conversational manner, failing to fully simulate real conversational contexts.
Only English text was analyzed, omitting multilingual and multimodal content.
The audit did not evaluate Twitch's "Smart Detection" feature or its video and audio moderation capabilities.
The impact of in-group/out-group dynamics on moderation decisions was not investigated.

Hartmann et al. (2025): Contemporary work evaluating multiple moderation APIs (e.g., OpenAI), finding similar issues (poor performance in implicit hate detection, false positives on educational content).
DynaHate (Vidgen et al., 2021b): An adversarial hate speech corpus construction method, useful for further stress testing.
Sap et al. (2020): Introducing the SBIC dataset, where open-source classifiers perform significantly better than AutoMod on the same data.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic and large-scale audit of Twitch's AutoMod.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering 4 datasets, 300k messages, individual filter analyses, and case studies.
Writing Quality: ⭐⭐⭐⭐⭐ Highly structured flow, with a clear formal description of the research design and experimental pipeline.
Value: ⭐⭐⭐⭐⭐ Provides crucial empirical evidence for improving content moderation systems and carries significant social impact.