MultiMM: Cultural Bias Matters — Cross-Cultural Benchmark for Multimodal Metaphors¶
Conference: ACL 2025
arXiv: 2506.06987
Code: GitHub
Area: Multimodal VLM
Keywords: Multimodal Metaphor, Cultural Bias, Cross-Cultural, Sentiment Analysis, Metaphor Detection
TL;DR¶
Proposes MultiMM, the first cross-cultural multimodal metaphor dataset containing 8,461 Chinese-English advertisement image-text pairs with fine-grained annotations, and designs the SEMD model integrating sentiment features to enhance metaphor detection.
Background & Motivation¶
Background: Metaphors are ubiquitous in communication, with approximately one-third of sentences containing metaphors. Multimodal metaphors are more expressive than unimodal ones, yet existing datasets primarily originate from Western cultural backgrounds.
Limitations of Prior Work: Cultural bias in training data leads to overestimated model performance and poor generalization in non-Western cultural scenarios. There is a lack of benchmark datasets for cross-cultural multimodal metaphor research.
Key Challenge: Although conceptual metaphor mapping has universality, specific linguistic and visual expressions heavily depend on cultural backgrounds (e.g., "dinosaur" refers to "outdated" in English, but "ugly" in Chinese).
Goal: Construct the first cross-cultural multimodal metaphor dataset to reveal the impact of cultural bias on metaphor processing.
Key Insight: Collect image-text pairs from Chinese and English advertisements, and annotate three dimensions: metaphor occurrence, source/target domain relationships, and sentiment categories.
Core Idea: The influence of cultural bias on multimodal metaphor processing has been severely underestimated, and sentiment information can serve as a bridge for cross-cultural metaphor understanding.
Method¶
Overall Architecture¶
A dataset of 8,461 Chinese-English advertisement image-text pairs (4,397 Chinese, 4,064 English) was constructed, and a three-branch SEMD model was designed: ViT extracts image features, BERT extracts text features, and sentiment analysis extracts sentiment features, which are then concatenated and fused for classification.
Key Designs¶
- Data Collection and Annotation: Chinese samples were collected through Baidu search, and English samples were derived from public advertisement datasets. The annotation model includes: metaphor occurrence (literal/metaphorical), source/target domain vocabulary, and sentiment category (positive/neutral/negative). Annotated by 8 experts with Fleiss' \(\kappa = 0.73\) (metaphor) and \(0.82\) (sentiment).
- SEMD Model: A three-branch architecture—ViT encodes images, mBERT encodes text, and a sentiment analysis module extracts sentiment features (NRCLex emotion + VADER sentiment scores), which are fused via concat and passed through a feed-forward network to obtain the final prediction.
- Cross-Cultural Analysis: Analysis of cultural differences in source domain vocabulary distribution between Chinese and English advertisements—English favors lions/eagles (strength and freedom), while Chinese favors dragons/pandas (power and national pride). The same source domain may express different sentiments in different cultures.
Evaluation Strategy¶
Evaluated on two tasks: metaphor detection and sentiment analysis, reporting Accuracy, Macro Precision, and Macro F1. Compared with 18 baseline models (8 text-based, 3 vision-based, 7 multimodal).
Key Experimental Results¶
Main Results (Metaphor Detection F1%)¶
| Model | English | Chinese |
|---|---|---|
| mBERT (Text) | 64.00 | 65.52 |
| ViT (Image) | 71.67 | 69.04 |
| CMGCN (Multimodal) | 79.04 | 74.91 |
| GPT-4o (Multimodal LLM) | 64.00 | 67.00 |
| LLaVA | 73.31 | 69.84 |
| Qwen2.5-VL-72B | 59.12 | 67.66 |
| SEMD (Ours) | 80.16 | 77.79 |
Ablation Study (Fusion Method + Sentiment Features)¶
| Sentiment Features | Fusion Method | English F1% | Chinese F1% |
|---|---|---|---|
| None | concat | 78.64 | 74.84 |
| Yes | add | 77.76 | 74.37 |
| Yes | max | 78.59 | 74.37 |
| Yes | concat | 80.88 | 77.39 |
Key Findings¶
- The visual modality generally outperforms the textual modality, indicating that advertising images contain rich metaphorical features.
- Multimodal models significantly outperform unimodal ones, whereas large multimodal LLMs like GPT-4o perform poorly in comparison.
- Positive sentiments dominate in metaphorical content (74.75% in English, 56.42% in Chinese), and English advertisements present more extreme sentiment.
- Translation degrades metaphor detection performance, confirming the importance of cultural context in metaphor understanding.
Highlights & Insights¶
- The first cross-cultural multimodal metaphor dataset, filling the gap of Eastern culture in metaphor research.
- Discovered that sentiment, as a cross-cultural universal feature, can effectively enhance metaphor understanding.
- Large multimodal LLMs do not dominate in metaphor detection, indicating that metaphor understanding requires specialized designs.
Limitations & Future Work¶
- Only covers Chinese and English cultures, without extending to more diverse cultural backgrounds.
- The SEMD model architecture is simple (concat fusion); more complex cross-modal interactions can be explored.
- Sentiment features rely on existing tools (NRCLex/VADER), potentially introducing tool bias.
Related Work & Insights¶
- Complementary to prior work such as MultiMET, expanding from monocultural to cross-cultural analysis.
- Prompts researchers to attend to cultural bias issues in NLP systems.
- The sentiment-metaphor correlation is worth exploring in broader tasks.
Additional Technical Details¶
- SEMD metaphor detection prediction: \(P_{Meta} = \text{Sigmoid}(\text{Fusion}(\text{concat}(I_i, T_i, S_i)))\)
- Sentiment analysis prediction (excluding sentiment feature input): \(P_{Senti} = \text{Sigmoid}(\text{Fusion}(\text{concat}(I_i, T_i)))\)
- Model hyperparameters: embedding dimension of 768, dropout of 0.3, maximum text length of 30 tokens, batch size of 64, and learning rate of \(3e-5 \sim 5e-4\).
- Dataset split: train/validation/test = 80%/10%/10% (Chinese: 3517/440/440, English: 3251/406/407).
- Fleiss' \(\kappa\): Metaphor 0.73, Target Domain 0.70, Source Domain 0.66, Sentiment 0.82.
- Cultural difference findings: English metaphorical advertisements have almost no negative sentiment (0.69%), whereas Chinese metaphorical advertisements have 3.17%.
- Translation experiments: SEMD F1 drops from 80.16 to 78.57 after EN \(\rightarrow\) CN translation, and from 77.79 to 75.53 after CN \(\rightarrow\) EN translation.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first cross-cultural multimodal metaphor benchmark, offering a valuable task definition.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive analysis across various dimensions with over 18 baselines.
- Writing Quality: ⭐⭐⭐⭐ Rich in data analysis, though the model design section is relatively straightforward.
- Value: ⭐⭐⭐⭐ Advances awareness of cultural bias and promotes more equitable NLP systems.