Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse¶

Conference: ACL 2025
arXiv: 2412.17533
Code: GitHub
Area: Other
Keywords: content moderation, erotic content detection, Polish NLP, annotation, dataset

TL;DR¶

This work constructs forePLay (24,768 sentences, 5 categories), the first Polish erotic content detection dataset, and proposes a multidimensional annotation framework covering ambiguity, violence, and socially unacceptable behaviors. Evaluation results show that language-specific Polish models significantly outperform multilingual models, with Transformer encoder models demonstrating the strongest performance in handling unbalanced categories.

Background & Motivation¶

Background: The demand for online content moderation has surged, but existing tools are primarily designed for English and show limited effectiveness for morphologically rich languages like Polish.

Limitations of Prior Work: (a) Erotic content detection datasets are scarce and mostly in English; (b) Existing datasets mostly rely on simple binary classification, failing to capture fine-grained categories such as eroticism, violence, and socially unacceptable behaviors; (c) Safety filters of LLMs perform poorly on non-English erotic content.

Key Challenge: The expression of erotic content in morphologically complex languages like Polish is diverse, making simple binary classification schemes and English-centric tools ineffective for detection.

Goal: To provide a high-quality annotated dataset and model benchmarks for automatic erotic content detection in Polish.

Key Insight: Sampling from online fiction and Polish literary works, the authors design a 5-class mutually exclusive annotation scheme: erotic/ambiguous/violence/unacceptable/neutral.

Core Idea: Utilizing a multidocument-annotated Polish erotic content dataset to reveal the advantages of language-specific models in content moderation.

Method¶

Overall Architecture¶

Data sources: 69% online amateur fiction + 31% Polish literary works (including LGBTQ+ narratives), totaling 905 text units. Annotation: 6 annotators (3 male, 3 female), with independent annotation of 3 annotators per sentence + majority voting + super-annotator arbitration. Model evaluation: fine-tuning of encoder models (HerBERT, RoBERTa) + Polish LLMs (PLLuM series, Bielik) + general LLMs (GPT-4o, Llama-3.1, etc.) zero/few-shot.

Key Designs¶

Five-class Annotation Scheme:
- erotic (25.68%): descriptions of sexual activity, explicit erotic suggestions
- ambiguous (5.43%): content that may trigger sexual associations in specific contexts but is neutral in itself
- violence (0.28%): sexual harassment/rape/non-consensual violence
- unacceptable (0.47%): illegal/taboo behaviors such as pedophilia, bestiality, incest, etc.
- neutral (68.14%): other content
- Hierarchy rule: unacceptable > violence > erotic > ambiguous > neutral
Annotation Quality Control:
- Cohen's Kappa was mostly in the range of 0.66–0.72 (Krippendorff's Alpha = 0.716 after excluding the outlier annotator Fem1)
- A maximum of 2 stories per author to avoid overfitting to specific writing styles

Four Dataset Configurations¶

Basic: binary classification (neutral/erotic)
Core: 3-class classification (+ambiguous)
Extended: 4-class classification (+merged violence+unacceptable)
Full: complete 5-class classification

Key Experimental Results¶

Encoder Models (Fine-tuning)¶

Model	Basic F1	Core F1	Full F1
HerBERT-Large	0.939	0.738	0.648
RoBERTa-Base	0.944	0.738	0.707
RoBERTa-Large	0.943	0.748	0.664

LLM Zero-shot Comparison¶

Model	Basic F1	Core F1	Full F1
GPT-4o (0-shot)	0.888	0.640	0.340
PLLuM-Mistral-12B	0.894	0.656	0.401
PLLuM-Mixtral-8x7B	0.874	0.647	-
Bielik-11B (5-shot)	0.868	0.607	0.480

Key Findings¶

Language-specific Polish encoder models consistently outperform general multilingual LLMs (RoBERTa-Base Full F1=0.707 vs GPT-4o=0.340).
As classification granularity increases (Basic \(\rightarrow\) Full), the performance of all models drops significantly.
The ambiguous category has the lowest annotation consistency and remains the primary detection bottleneck.
Polish-specific LLMs (PLLuM) outperform general LLMs with the same architecture in zero-shot settings.

Highlights & Insights¶

The multidimensional annotation scheme (particularly the distinction of ambiguous and violence/unacceptable) is closer to real-world content moderation requirements than simple binary classification.
It reveals that fine-tuning small models in language-specific content moderation can significantly outperform large-scale general LLMs.
The dataset intentionally covers LGBTQ+ content, avoiding systematic omission of such narratives.

Limitations & Future Work¶

Samples in the violence and socially unacceptable categories are extremely sparse (\(< 1\%\)), making it difficult for models to learn.
It features sentence-level annotation only, lacking document-level context.
It is limited to Polish; while the methodology is transferable, the data itself is not directly applicable to other languages.
Inter-annotator variation (HLV) remains high on the ambiguous category.

vs Jigsaw/BeaverTails: English-centric + binary classification, while forePLay provides Polish + 5-class classification.
vs CENSORCHAT: Designed for dialog system monitoring, whereas forePLay is tailored for text content detection.
vs Llama Guard: General safety classifiers, whereas this work proves that language-specific models are superior in non-English scenarios.

Rating¶

Novelty: ⭐⭐⭐⭐ The first Polish erotic content detection dataset with a distinctive annotation scheme.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison among encoders, LLMs, open-source, and closed-source models.
Writing Quality: ⭐⭐⭐⭐ Detailed annotation pipeline description and thorough ethical considerations.
Value: ⭐⭐⭐ Directly useful for the Polish NLP community, though the cross-lingual transferability of the method is general.