CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming¶
Conference: ACL 2025
arXiv: 2410.02677
Code: HuggingFace
Area: Cultural Knowledge Evaluation / LLM Benchmarks
Keywords: Cultural Knowledge, Human-AI Co-red Teaming, Multi-region Coverage, Mode-seeking Bias, True/False Evaluation
TL;DR¶
CulturalBench is constructed through a Human-AI CulturalTeaming pipeline, comprising 1,696 human-written and five-way independently verified cultural knowledge questions across 45 global regions and 17 themes. CulturalBench-Hard (True/False format) yields only 61.5% accuracy even for the strongest model (OpenAI o1), far below the human performance of 92.4%, revealing models' mode-seeking tendencies in multi-answer questions and imbalanced performance in cross-regional cultural knowledge.
Background & Motivation¶
The uneven cultural representation of LLMs is a long-standing issue, but constructing high-quality cultural knowledge benchmarks faces multiple challenges:
Insufficient Robustness of Existing Benchmarks: - Inadequate quality verification: Most benchmarks only perform quality checks during intermediate data collection steps rather than verifying the entire final dataset. - Over-reliance on web data sources: Sources like Wikipedia might have already been seen by models during pre-training. - Risk of bias propagation in LLM-generated benchmarks.
Narrow Topic Coverage: - Most benchmarks rely on pre-defined topics (e.g., food, dating), which fail to capture cultural elements unique to different regions. - Covering only 1-12 topics, lacking diversity.
Limitations of Evaluation Formats: - Multiple-choice formats allow models to achieve accuracy far exceeding random guessing (40.4% vs. 25% random) using heuristic methods (e.g., embedding similarity between options and country names) without actually understanding the question content. - Models might be guessing rather than demonstrating true cultural understanding.
CulturalBench aims to address these issues by constructing a robust, diverse, and challenging benchmark.
Method¶
Overall Architecture¶
The CulturalTeaming data collection pipeline consists of three steps: 1. Red-teaming data collection (human-AI collaboration) 2. Human quality verification (five-person independent verification) 3. Majority vote filtering
Key Designs¶
1. Human-AI Red-Teaming Data Collection¶
- Function: Guides human annotators to iteratively propose cultural questions that challenge models.
- Mechanism:
- Question Construction: Annotators brainstorm culture-related scenarios based on their own cultural experiences (e.g., "Singaporeans using tissues to reserve seats"), and an AI assistant converts these scenarios into structured multiple-choice questions with four options.
- Question Verification & Refinement: Annotators challenge an AI validator on an interactive platform using the constructed questions. The platform provides refinement strategies and examples (e.g., "question reversal") to make the questions more challenging.
- Internal Filtering: Researchers filter out questions unrelated to specific regions from over 3,600 questions, retaining more than 3,000.
- Design Motivation: Adopts the concept of AI safety red-teaming to collect challenging data through human-AI competition.
- Discovery-based Topic Approach: Does not pre-define topic sets, encouraging annotators to freely explore based on their personal experiences.
2. Five-Person Independent Human Quality Verification¶
- Function: Every question is verified by 5 independent annotators.
- Mechanism:
- Recruit via the Prolific platform, requiring annotators' nationality and primary residence before age 18 to match the region associated with the question.
- Adopt a multi-label selection setting: Annotators can select multiple correct answers.
- Provide additional "no correct option" and "no relevant knowledge" options to prevent guessing.
- Design Motivation: The correctness of cultural knowledge is difficult to verify, necessitating expert-level human verification for the entire final dataset.
- Majority vote threshold: \(\ge 4/5\) annotator agreement.
3. Dual-Format Benchmark Construction¶
CulturalBench-Easy (Multiple-Choice Questions): - 1,696 four-option multiple-choice questions. - Single-mode questions (one correct answer): Used directly. - Multi-mode questions (multiple correct answers): Restructured into compound options (e.g., "A. (i) and (iv)") with instructions to "select all that apply".
CulturalBench-Hard (True/False): - \(1,696 \times 4 = 6,784\) binary classification questions. - Each of the four options from the original question becomes a True/False question. - A question is considered correctly answered only if all four decisions are correct. - Random baseline: \(0.5^4 = 6.25\%\)
Topic Discovery¶
Through GPT-4o classification, 17 topics are identified, falling into three major categories: - Daily Life: Food, workplace, etc. - Social Etiquette: Greetings, social norms, etc. - Broader Society: Celebrations, religion, etc.
Annotators from different regions focus on different topics: Italians lean toward food (38.9%), while Israelis focus on religion (23.8%).
Key Experimental Results¶
Main Results: Performance of 29 LLMs on CulturalBench-Hard¶
| Model | CulturalBench-Easy | CulturalBench-Hard |
|---|---|---|
| Human | 92.4% | 92.4% |
| Random | 25.0% | 6.25% |
| OpenAI o1 | 89.6% | 61.5% |
| GPT-4o | - | 60.4% |
| Claude 3.5 Sonnet | - | ~56% |
| Llama-3.1-70B | - | 54.6% |
| Llama-3.1-8B | - | 36.0% |
| GPT-3.5 Turbo | - | 34.5% |
| Cohere Aya-8b | - | 28.7% |
The gap between the best model and humans on the Hard version is 30.9 percentage points.
Ablation Study: Question Type Analysis¶
| Question Type | Model Average Accuracy | Best Model (o1) | Human |
|---|---|---|---|
| Single-mode (1 correct answer, \(N=1554\)) | 49.6% | ~65% | ~95% |
| Multi-mode (multiple correct answers, \(N=142\)) | 20.9% | ~20% | ~89% |
| Gap | 28.7% | 45.5% | 6.1% |
Models' performance drops precipitously on multi-answer questions, whereas human performance decreases only slightly.
Regional Performance Differences¶
| Region | Model Average Accuracy |
|---|---|
| North America | 57.9% |
| Northern Europe | 51.8% |
| South Asia | 51.5% |
| South America | 41.5% |
| Eastern Europe | 41.5% |
| Middle East / Western Asia | 37.8% |
Heuristic Baseline Analysis¶
| Method | CulturalBench-Easy Accuracy |
|---|---|
| Random Guessing | 25.0% |
| Option vs. Country Name Embedding Similarity | 40.4% |
| Best Model | 89.6% |
Even without the question, an accuracy of 40.4% can be reached solely based on the similarity between options and country names, indicating that the multiple-choice format in the Easy version contains shortcuts.
Key Findings¶
- CulturalBench-Hard is highly challenging: The best model achieves only 61.5%, far below the human performance of 92.4%.
- Multiple-choice format contains shortcuts: The embedding similarity heuristic achieves 40.4%, indicating that the Easy version may overestimate LLMs' cultural knowledge.
- Models' mode-seeking tendencies: Models perform extremely poorly on multi-answer questions (\(-28.7\%\)), tending to overfit to a single most likely answer.
- Positive correlation with model size: Within the same family, larger models perform better.
- Imbalanced regional performance: North America, Northern Europe, and South Asia show better performance compared to South America, Eastern Europe, and the Middle East.
- Lack of cultural advantage for local providers: Qwen/DeepSeek in East Asia and Mistral in Western Europe do not outperform GPT-4o.
- Performance ceiling: Improvements across versions within the same model family are becoming increasingly smaller, potentially approaching a performance bottleneck.
Highlights & Insights¶
- Human-AI CulturalTeaming Pipeline: Creatively applies the concept of AI safety red-teaming to cultural knowledge benchmark construction.
- Five-way Comprehensive Verification: 100% of final questions are verified by five independent annotators, providing quality assurance that far exceeds similar work.
- Discovery-based Topic Approach: Does not pre-set topics, allowing annotators to freely explore, thereby capturing 17 diverse themes.
- Exquisite Hard Version Design: The True/False format effectively eliminates heuristic shortcuts inherent in multiple-choice questions.
- Multi-answer Questions Reveal Mode-Seeking Bias: Exposes the fundamental weakness of LLMs in handling cultural diversity.
Limitations & Future Work¶
- English Only: The performance of models on cultural knowledge in local languages is not evaluated, potentially omitting scenarios of "understanding the language but not the culture".
- Small Verifier Sample Size: In some underrepresented regions (e.g., Bangladesh), active annotators on Prolific number fewer than 30, limiting recruitment to just 5 people.
- Coarse Country/Region Granularity: Cultural diversity within the same country (e.g., Wales vs. England in the UK) is not fully captured.
- Annotator Representativeness Issues: Due to limitations of the Prolific platform, certain cultural perspectives might be over- or under-represented.
- No Multimodal Testing: Limited to text-only formats, omitting visual cultural knowledge.
Related Work & Insights¶
- Systematically compares with cultural benchmarks such as FORK, BERTAQA, CVQA, NormAd, and Blend.
- CulturalBench leads comprehensively across three dimensions: verification coverage (100%), theme diversity (17 themes), and challenging nature (best model at 61.5%).
- The human-AI collaborative red-teaming paradigm can be generalized to the construction of other highly subjective evaluation benchmarks.
- The True/False evaluation format also serves as a reference for assessing other multiple-choice benchmarks.
Rating¶
- Novelty: ⭐⭐⭐⭐ (The CulturalTeaming pipeline is novel, and the Hard version design is clever)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (29 models, regional analysis, question type analysis, heuristic baseline analysis, temporal version analysis)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure, comprehensive analysis, and sufficient comparison with related work)
- Value: ⭐⭐⭐⭐⭐ (High-quality open-source benchmark, reveals systematic weaknesses in LLM cultural knowledge, and the methodology is reusable)