BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories¶
Conference: ACL 2026 arXiv: 2604.17008 Code: https://huggingface.co/spaces/Linyuana/BIASEDTALES-ML Area: AIGC Detection Keywords: multilingual bias, narrative generation, social attribute distribution, cross-lingual consistency, children stories
TL;DR¶
BiasedTales-ML constructs a corpus of ~350K LLM-generated children's stories across 8 languages, using full-permutation prompt design and a distributional analysis framework to reveal that social attribute distributions in narratives vary significantly across languages, and English-centric evaluation fails to capture bias patterns in multilingual settings.
Background & Motivation¶
Background: LLMs are increasingly used to generate narrative content (especially children's stories), which implicitly conveys notions about social roles, occupations, and environments. Existing social bias research primarily focuses on English short-text tasks (e.g., sentence completion, classification).
Limitations of Prior Work: (1) Short-text bias evaluation cannot capture biases expressed indirectly through characters, scenes, and plot structures in long-form narratives; (2) existing bias benchmarks (e.g., StereoSet, BBQ) are static classification tasks disconnected from real generation scenarios; (3) virtually no work has systematically studied cross-lingual consistency of bias in multilingual narrative generation.
Key Challenge: RLHF and other safety alignment techniques are primarily developed on English data and Western norms, but model bias behavior in other languages may be entirely different — conclusions of "safe" from English evaluation may not hold in low-resource languages.
Goal: (1) Construct a large-scale multilingual parallel narrative corpus; (2) propose a systematic narrative-level social attribute distribution analysis framework; (3) empirically study cross-lingual bias consistency.
Key Insight: Children's stories are chosen as a controlled yet expressive narrative domain — encouraging positive and imaginative content while requiring models to make structured choices about characters, settings, and social roles.
Core Idea: Generate parallel stories across 8 languages through full-permutation prompt design (systematically varying nationality × religion × social class × parental role × child gender), and analyze bias using distributional metrics rather than instance-level annotation.
Method¶
Overall Architecture¶
Three-stage pipeline: (1) Prompt design and localization: construct standardized prompt templates, localized into 8 target languages by native speakers; (2) Large-scale parallel generation: use 3 LLMs to generate stories across all prompt configurations (5 independent samples per configuration); (3) Narrative feature extraction and distributional analysis: use LLM extractors to extract character traits, environments, and cultural references from stories, then compare distributions using statistical metrics.
Key Designs¶
-
Full-Permutation Prompt Design:
- Function: Construct controlled cross-lingual comparative experiments
- Mechanism: Systematically combine 27 nationalities × 6 religions × 2 social classes × 3 parental roles × 3 child genders = 2,916 unique prompt configurations, generated across 8 languages × 3 models with 5 samples per configuration, totaling ~350K stories. Language selection covers languages without grammatical gender (EN/ZH/JA/KO), with grammatical gender (ES/RU/AR), and low-resource (Swahili)
- Design Motivation: Full-permutation design allows separating the effects of language medium from cultural content, avoiding language-specific patterns that translation benchmarks might mask
-
LLM-based Narrative Feature Extractor:
- Function: Extract structured social attribute representations from long-form stories
- Mechanism: Use Qwen-3-14B to extract three-dimensional representations \(E = (A_{\text{adj}}, V_{\text{env}}, C_{\text{cul}})\) from each story \(S\): character descriptive adjectives (e.g., brave, obedient), environment keywords (e.g., forest, kitchen), and cultural references (e.g., menorah, dates). Human validation on 800 stories achieves 85.6% precision with Cohen's \(\kappa = 0.618\)
- Design Motivation: Narrative bias is expressed indirectly through character descriptions and scene settings, requiring structured extraction beyond surface keywords
-
Multi-Dimensional Distributional Bias Metrics:
- Function: Quantify and compare cross-lingual social attribute distribution differences
- Mechanism: Four complementary metrics: (1) Directional bias \(S_C = \ln(P(C|g_m)/P(C|g_f))\) measures the association direction between specific attribute categories and gender; (2) JSD measures overall distributional divergence; (3) cosine similarity measures cross-lingual bias pattern consistency; (4) valid story rate (VSR) controls generation quality
- Design Motivation: No single metric can fully characterize bias — direction, magnitude, cross-lingual consistency, and generation quality require multi-dimensional synthesis
Loss & Training¶
Pure evaluation/analysis work with no model training involved. Uses the vLLM inference framework with higher sampling temperature to encourage narrative diversity.
Key Experimental Results¶
Main Results¶
| Analysis Dimension | Key Finding | Model |
|---|---|---|
| Directional bias | Communality descriptions skew toward female stories across all languages; intellect descriptions skew toward males in Arabic/Russian | 8B model |
| Grammatical gender effect | Llama-3.1-8B shows higher JSD (greater bias divergence) in gendered languages; Qwen-3-8B shows no significant difference | - |
| Cross-lingual consistency | Qwen-3 shows high cross-lingual cosine similarity (consistent); Llama-3 shows large bias pattern differences between English and low-resource languages | - |
| Small model effect | 1B model bias directionality approaches zero, not due to better safety but due to insufficient vocabulary diversity falling back to generic patterns | Llama-3.2-1B |
Ablation Study¶
| Config | Effect | Note |
|---|---|---|
| Gender condition | Male → outdoor/activity words; Female → home/relationship words | Consistent across languages |
| Social class condition | Working class → practical/labor words; Affluent → leisure/aesthetic words | Qwen-3 data |
| Low-resource language | Swahili: low VSR, high JSD | Especially pronounced in 1B models |
Key Findings¶
- Bias patterns observed in English cannot be simply extrapolated to other languages, especially low-resource languages
- The relationship between model scale and bias is non-monotonic: smaller models are not "safer" but "more mediocre" (vocabulary diversity bottleneck)
- The effect of grammatical gender on bias divergence varies by model, not a universal rule
- Qwen-3 shows higher cross-lingual consistency than Llama-3, possibly reflecting differences in multilingual coverage of training data
Highlights & Insights¶
- Full-permutation experimental design is the paper's greatest highlight: by systematically varying each social attribute dimension, the influence of each factor can be precisely isolated. This methodology is transferable to any NLP evaluation involving multi-factor analysis
- The finding that "small model bias appears low but is actually due to capability limitations" is very important: it warns against using surface distributional uniformity to assert safety, as vocabulary poverty can also produce uniform distributions
- Distribution-level bias analysis (rather than instance-level annotation) is better suited for large-scale generation scenarios, avoiding the unscalability of per-sample annotation
Limitations & Future Work¶
- All stories are generated by LLMs and cannot directly reflect bias patterns in human narratives
- Feature extraction relies on LLMs, which may introduce extraction bias themselves
- While 8 languages are representative, many low-resource languages remain uncovered
- Analysis is limited to the distributional level, without delving into individual story quality or actual impact on children
Related Work & Insights¶
- vs Biased Tales (Rooein et al., 2025): The latter covers only English + a few languages; BiasedTales-ML extends to 8-language full-permutation design
- vs StereoSet/BBQ: Static classification benchmarks; this paper analyzes bias closer to real scenarios through long-form generation
- vs Yong et al., 2025: The latter studies cross-lingual transfer of safety interventions; this paper complements with representational safety analysis in non-adversarial scenarios
Rating¶
- Novelty: ⭐⭐⭐⭐ Large-scale multilingual narrative bias analysis is a novel research direction; the full-permutation design methodology is valuable
- Experimental Thoroughness: ⭐⭐⭐⭐ 350K stories, 8 languages, 3 models, multi-dimensional analysis, but lacks comparison with human narratives
- Writing Quality: ⭐⭐⭐⭐ Clear structure, rich visualizations, but discussion section is somewhat generic