RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding¶

Conference: ICLR 2026 arXiv: 2505.14462 Code: https://jiaangli.github.io/ravenea Area: Information Retrieval Keywords: Retrieval-Augmented Generation, Cultural Understanding, Multimodal Benchmark, Visual Question Answering, Image Captioning

TL;DR¶

This paper introduces Ravenea, the first benchmark for evaluating multimodal retrieval-augmented cultural understanding. It comprises 1,868 instances and 11,396 human-ranked Wikipedia documents, spanning 11 categories across 8 countries. The benchmark evaluates 7 multimodal retrievers and 17 VLMs, finding that culture-aware RAG yields average improvements of 6% on cVQA and 11% on cIC.

Background & Motivation¶

Background: VLMs perform well on general vision-language tasks but fall short in understanding cultural nuances—such as the ritual significance of traditional attire or region-specific symbols and customs. While RAG has been shown to effectively enhance cultural understanding in text-only settings, its application to multimodal cultural scenarios remains largely unexplored.

Limitations of Prior Work: (a) Existing multimodal cultural datasets primarily test VLMs' memorized cultural knowledge rather than their ability to understand culture in realistic scenarios. (b) It is unclear whether current multimodal retrievers can reliably retrieve culturally relevant documents. (c) VLM performance varies dramatically across countries and cultures, reflecting a pronounced cultural bias toward Western cultures.

Key Challenge: VLMs are increasingly deployed in culturally sensitive contexts such as education and assistive technologies, yet their cultural blind spots risk causing misunderstanding or reinforcing cultural bias—and no systematic benchmark exists to evaluate and address this capability.

Goal: (a) Construct a benchmark specifically designed to evaluate multimodal RAG for cultural understanding; (b) assess the cultural retrieval capability of existing retrievers; (c) quantify the gains that RAG provides to VLMs on cultural understanding tasks.

Key Insight: Building on two existing cultural datasets, CVQA and CCUB, the authors perform BM25-based initial retrieval followed by human re-ranking annotation, attaching culturally relevant Wikipedia documents to each image and constructing a retrieval-augmented evaluation pipeline.

Core Idea: By constructing a multimodal RAG benchmark grounded in human-annotated culturally relevant documents, the paper reveals substantial improvements in VLM cultural understanding enabled by culture-aware retrieval.

Method¶

Overall Architecture¶

The data construction pipeline proceeds as follows: (1) culturally relevant images and QA/caption pairs are sourced from CVQA/CCUB; (2) GPT-4o generates cultural descriptions to serve as queries; (3) BM25 retrieves the top-10 documents from 6 million Wikipedia articles; (4) human annotators label and re-rank the retrieved results for cultural relevance. The evaluation pipeline feeds retrieved cultural documents into VLMs to complete either cVQA or cIC tasks.

Key Designs¶

Three-Dimensional Cultural Relevance Annotation:
- Function: Decomposes "cultural relevance" into three independently verifiable binary dimensions.
- Mechanism: Each image–document pair is annotated along three dimensions: (a) country association (True/False/Uncertain), (b) cultural content relevance, and (c) visual element relevance. The three dimensions are assessed independently to reduce annotation ambiguity.
- Design Motivation: "Cultural relevance" as a monolithic concept is too vague. Decomposing it improves annotation consistency (Cohen's \(\kappa = 0.83\)) and enables finer-grained analysis.
Culture-Aware Contrastive (CAC) Learning:
- Function: Fine-tunes CLIP/SigLIP to improve cultural retrieval capability.
- Mechanism: A combination of three losses — \(\mathcal{L}_{\text{CAC}} = \frac{1}{3}(\mathcal{L}_{\text{Culture Classify}} + \mathcal{L}_{\text{Rank}} + \mathcal{L}_{\text{Diversity}})\). The classification loss uses sigmoid binary cross-entropy to determine whether a document is culturally relevant; the ranking loss applies margin ranking to ensure relevant documents score higher than irrelevant ones; the diversity loss prevents the text embeddings of positive samples from collapsing.
- Design Motivation: Standard contrastive learning does not distinguish cultural relevance; explicit cultural supervision signals are needed to guide the retriever.
RegionScore Evaluation Metric:
- Function: Quantifies whether generated captions contain correct geographic or cultural region references.
- Mechanism: Checks whether the target country name or its corresponding adjective/demonym appears in the generated caption. A simple binary match: \(R(\mathbf{g}^{(i)}, I_i) = 1\) if the correct region term appears in the caption.
- Design Motivation: Existing metrics (ROUGE-L, CIDEr, BERTScore, CLIPScore) show weak or even negative correlation with human judgments of cultural accuracy. RegionScore achieves a Kendall \(\tau\) of 0.442 with human judgments—statistically significant and substantially higher than all other metrics.

Loss & Training¶

CAC training fine-tunes CLIP/SigLIP encoders on Ravenea annotation data, combining the three losses with equal weights. Annotation quality is ensured through multiple rounds of independent labeling plus meta-checker verification (98.2% acceptance rate), with annotators trained via detailed guidelines and mock tests.

Key Experimental Results¶

Main Results¶

Retrieval performance (7 retrievers):

Retriever	MRR↑	P@1↑	nDCG@5↑
CLIP-L/14 (frozen)	75.44	60.87	78.09
SigLIP2 (frozen)	68.62	54.66	71.44
LLaVA-OV-7B	58.85	37.48	60.34
Ravenea-CLIP (ours)	82.17	72.05	84.09
Ravenea-SigLIP (ours)	70.95	57.14	73.92

Downstream tasks (17 VLMs, w/ vs. w/o RAG): - cVQA average improvement: +6% - cIC average improvement: +11% (RegionScore) - Lightweight models benefit more from RAG

Ablation Study¶

Analysis Dimension	Key Finding
Retriever type	Contrastive architectures (CLIP/SigLIP) are naturally suited for retrieval; generative models (LLaVA, VL-T5) are not
Cultural fine-tuning	Ravenea-CLIP P@1 improves from 60.87→72.05 (+11.18), demonstrating the value of cultural supervision signals
Cross-country variation	VLM performance varies substantially across countries; each model exhibits distinct "cultural preferences"
Metric comparison	RegionScore achieves the highest correlation with human judgments (\(\tau = 0.442\)); traditional metrics show negative correlation

Key Findings¶

The fine-tuned contrastive retriever (Ravenea-CLIP) achieves state-of-the-art results on all metrics, with P@1 improving by over 11%.
Cultural RAG provides greater benefit to lightweight models—external knowledge compensates more substantially for the knowledge gaps of smaller models.
Different VLMs exhibit distinct "cultural preferences"—certain models understand specific national cultures significantly better than others.
Conventional automatic evaluation metrics fail to capture cultural accuracy; RegionScore is a meaningful, if preliminary, alternative.
Generative retrieval models (LLaVA-OV-7B) unexpectedly underperform discriminative models (CLIP) on cultural retrieval, likely due to a mismatch between their training objective and retrieval requirements.

Highlights & Insights¶

Filling a Gap: This is the first benchmark to systematically evaluate multimodal RAG for cultural understanding. The large-scale experimental setup (7 retrievers × 17 VLMs × 8 countries × 2 tasks) yields comprehensive empirical findings.
RegionScore Insight: A simple region-word match outperforms complex semantic metrics in reflecting cultural accuracy—this "simpler is better" finding exposes a blind spot in the existing evaluation paradigm with respect to cultural dimensions.
Simplicity and Effectiveness of Cultural Fine-Tuning: Three straightforward contrastive learning losses suffice to improve retrieval performance by 11%+, suggesting that explicit cultural supervision signals—rather than larger models—are the key ingredient.
Cross-Cultural Bias Analysis: Each VLM exhibits a unique pattern of cultural bias, with important implications for fairness research—future work should develop calibration methods targeted at cultural bias.

Limitations & Future Work¶

Only 8 countries are covered; with 200+ countries in the world, many cultures (e.g., African, Middle Eastern, Pacific Islander) are unrepresented.
Wikipedia as the sole external knowledge source introduces bias, as Wikipedia's coverage is itself uneven across different cultures.
RegionScore only checks whether the correct country or region term is mentioned and cannot assess the accuracy of cultural details (e.g., whether the specific meaning of a ritual is correctly described).
Retrieval is conducted exclusively using English documents; cross-lingual cultural retrieval is not explored.
Although annotation quality is high, annotators may themselves hold biased understandings of certain cultures.
The cVQA task uses a multiple-choice format, which may not reflect open-ended cultural reasoning ability.

vs. CVQA (Romero et al., 2025): CVQA provides only QA pairs without external knowledge; Ravenea extends it with human-ranked Wikipedia documents to support RAG evaluation.
vs. CCUB (Liu et al., 2023): CCUB focuses on cultural descriptions for text-to-image generation; Ravenea reverses the task direction (image→text) and incorporates retrieval augmentation.
vs. Seo et al. (2025): Their work studies RAG for cultural understanding in a text-only setting; Ravenea extends this to the multimodal domain.
Practical Implications: In any multimodal system operating in culturally sensitive contexts—such as cultural heritage preservation or multicultural educational assistance—explicit culture-aware retrieval augmentation warrants serious consideration.

Rating¶

Novelty: ⭐⭐⭐⭐ First multimodal RAG benchmark for cultural understanding, filling an important gap
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Large-scale evaluation across 7 retrievers × 17 VLMs with multi-dimensional analysis
Writing Quality: ⭐⭐⭐⭐ Well-organized, though the dataset construction section is slightly verbose
Value: ⭐⭐⭐⭐ Sustained value for VLM cultural fairness research, though limited to 8 countries