Skip to content

VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Conference: ACL2026
arXiv: 2601.07986
Code: https://github.com/yha9806/VULCA-Bench
Area: multimodal_vlm
Keywords: Multicultural evaluation, Vision-language models, Art criticism, Cultural understanding, Cross-cultural fairness

TL;DR

VULCA-Bench advances VLM evaluation from "object recognition" to "understanding symbols, history, and aesthetic philosophy" by utilizing 8 cultural traditions, 7,410 image-bilingual expert critique pairs, and a five-layer L1-L5 cultural understanding framework. It demonstrates that existing models generally suffer a performance drop of 31-40 percentage points in high-level cultural reasoning.

Background & Motivation

Background: Mainstream evaluations for multimodal VLMs have long focused on object recognition, scene description, VQA, hallucination detection, and chart/document Q&A. These benchmarks can measure L1-level visual perception and some factual Q&A but rarely require models to interpret the cultural symbolism, historical genres, and aesthetic concepts behind an image.

Limitations of Prior Work: Cultural datasets have begun to emerge, but many still adopt QA or identification formats, which easily compress cultural understanding into factual recall. Art-related datasets like WikiArt, OmniArt, and ArtEmis cover works and styles but lack expert-level critiques, cultural dimension annotations, and cross-cultural hierarchical diagnostics. More critically, many existing datasets provide insufficient coverage of non-Western traditions, making models appear to "understand art" while they may only be familiar with Western visual vocabulary.

Key Challenge: Cultural understanding is not a single ability but a hierarchy of capabilities deepening from visual surfaces to philosophical interpretations. A model's ability to recognize plum blossoms, ink strokes, and composition does not imply it understands the symbolic resilience of plum blossoms in Chinese painting, the tradition of the "Four Gentlemen," or aesthetic concepts like "Qi-yun Sheng-dong" (spirit resonance) and "Yi-jing" (artistic conception). Current benchmarks mix these levels, allowing shallow visual capabilities to mask deep cultural deficiencies.

Goal: The authors aim to construct a cross-cultural, reproducible, and diagnostic benchmark for VLM cultural understanding. It must be sufficiently scaled and capable of distinguishing L1-L2 visual/technical analysis from L3-L5 symbolic, historical, and philosophical/aesthetic reasoning while maintaining methodological fairness across different cultures.

Key Insight: The paper selects "art criticism" as the task medium because art images naturally contain visual forms, material techniques, cultural symbols, historical contexts, and aesthetic philosophies. Compared to multiple-choice or short Q&A, generative expert critiques more effectively reveal whether a model can organize high-level cultural interpretations rather than just reciting keywords.

Core Idea: Replace single visual Q&A metrics with "Cross-cultural expert critiques + Five-level cultural understanding dimensions + Balanced evaluation subsets," allowing a VLM’s cultural understanding to be diagnosed by hierarchy, culture, and dimension.

Method

The methodology essentially focuses on benchmark construction and validation. The authors first define a hierarchical framework for cultural understanding, then collect open art images across 8 cultural traditions, organize experts to write bilingual (Chinese and English) critiques, and annotate cultural dimensions. Finally, they conduct a pilot evaluation using several VLMs to verify if the dataset reveals models' weaknesses in high-level cultural understanding.

Overall Architecture

The input to VULCA-Bench is an artwork and its metadata; the output is an expert critique covering five levels, accompanied by explicit cultural dimension labels. The complete process involves four steps: first, collecting images and metadata based on open museum collections; second, defining L1-L5 dimension tables for each cultural tradition; third, having experts from corresponding cultural backgrounds write critiques in Chinese and English and annotate covered_dimensions; fourth, using the Dimension Coverage Rate (DCR) to diagnose whether model-generated critiques cover these cultural dimensions.

The data covers 8 cultural traditions: Western, Chinese, Japanese, Korean, Islamic, Indian, Mural, and Hermitage. The full version contains 7,410 image-critique pairs with a total of 225 culture-specific dimensions. It also provides subsets such as Balanced, Balanced-Pilot, Gold, and Human for full evaluation, fairness analysis, and manual calibration.

Key Designs

  1. Five-level Cultural Understanding Framework:

    • Function: Deconstruct "cultural understanding" into diagnostic capability levels to avoid conflating object recognition with philosophical/aesthetic understanding.
    • Mechanism: Drawing from Panofsky’s iconological method, the framework divides capabilities into L1 Visual Perception, L2 Technical Analysis, L3 Cultural Symbolism, L4 Historical Context, and L5 Philosophical/Aesthetic. L1-L2 primarily rely on image observation and material/technical knowledge, while L3-L5 require knowledge of symbolic traditions, art history lineages, and indigenous aesthetic theories.
    • Design Motivation: Many VLM benchmarks only test if a model can "see" but not if it can "interpret." The five-layer framework transforms evaluation results into a profile rather than a single score, directly identifying where a model begins to fail.
  2. Cultural Symmetry Principle:

    • Function: Ensure that different cultures are not forced into Western standards but retain indigenous aesthetic vocabulary under the same protocol.
    • Mechanism: The authors pursue symmetry in schema and annotation protocols rather than exact sample equality. Each culture utilizes the L1-L5 framework, uniform quality thresholds, and expert review, but dimensions for each culture reflect its own theories, such as "Qi-yun" and "Yi-jing" in Chinese painting, wabi-sabi in Japanese art, and rasa in Indian art.
    • Design Motivation: Forcing identical dimensions across all cultures would erase cultural differences, while allowing complete freedom would hinder cross-comparison. Cultural Symmetry balances comparability and indigeneity.
  3. Bilingual Expert Critiques and DCR Diagnosis:

    • Function: Provide trainable, evaluatable, and auditable cultural interpretation texts for each image.
    • Mechanism: Expert critiques must reach at least 150 characters in Chinese and 100 words in English, covering at least 70% of the cultural dimensions. Each record explicitly stores covered_dimensions. During evaluation, DCR is used to approximate how many cultural dimensions the model's critique covers, formulated as \(DCR(c,k)=|D_k^c|/|D_k|\).
    • Design Motivation: Bilingual critiques preserve cultural terminology while providing English accessibility. Explicit dimension labels transform the benchmark from a collection of free text into one with reproducible diagnostic labels.

Loss & Training

The paper does not propose a new training loss but rather evaluation metrics and data construction protocols. The core diagnostic metric is the Dimension Coverage Rate (DCR), which estimates whether model critiques touch upon cultural dimensions using keywords, synonym dictionaries, embedding similarity, and NLI verification. In the pilot, all models generated English critiques, and the authors reported L1-L2, L3-L5, hierarchical differences, and overall DCR.

Key Experimental Results

Main Results

Pilot evaluation was conducted on the Balanced-Pilot subset (48 samples per culture, 336 samples total, 7 cultures). Results were highly consistent: L1-L2 scores for all models were significantly higher than L3-L5 scores, indicating that while models can describe visuals and techniques, they struggle with deep cultural symbolism and philosophical aesthetics.

Model L1-L2 DCR L3-L5 DCR Level Gap ΔL Total DCR
Gemini-2.5-Pro 89.2 58.1 31.1 72.4
Qwen3-VL-235B 85.6 54.3 31.3 68.7
GPT-4o 87.1 46.8 40.3 65.3
Claude-Sonnet-4.5 84.3 48.2 36.1 64.8
GLM-4V-Flash 78.4 40.7 37.7 58.2

The scale and quality control of the dataset are comprehensive. The authors provide various evaluation perspectives beyond a single sample set.

Item Value / Description Meaning
Full Samples 7,410 image-critique pairs Supports aggregate benchmarking and training
Cultural Traditions 8 Covers Western, CN, JP, KR, Islamic, Indian, Dunhuang, Hermitage, etc.
Cultural Dimensions 225 Approx. 25-30 dimensions per culture
Bilingual Completeness 100% Every sample has both ZH and EN critiques
Cultural Fact Accuracy 98% Estimated by sampled expert audits
Balanced-Pilot 336 samples, 7 cultures Used for fair, low-cost pilot evaluation

Ablation Study

The paper does not include traditional ablation of model training but performs multiple analyses on data quality, evaluation robustness, and few-shot diagnosis to prove that the benchmark signal is not caused by sample length, random sampling, or proprietary embeddings.

Analysis Item Result Description
Balanced vs Full Ranking Consistency Spearman ρ=0.94, 95% CI [0.87, 0.98] Small balanced subsets predict full rankings well
DCR Correlation with Human Dimensions Pearson r=0.82 DCR serves as a coarse-grained diagnostic signal
Keyword Hit Precision vs Expert Approx. 78% Noisy but sufficient for dataset-level checks
OpenAI embedding vs BGE Consistency rate 86% vs 84% Conclusions do not rely on proprietary embeddings
Few-shot Prompting DeepSeek-VL2 3-shot dropped 41.3%, GPT-4o dropped 15.5% Directly providing critiques doesn't necessarily improve understanding

Key Findings

  • The most stable finding is the level gap: All models drop 31-40 percentage points from L1-L2 to L3-L5, proving that "cultural depth" is not a natural byproduct of general visual perception.
  • Error types cluster into three categories: mentioning cultural terms without visual grounding, applying modern historical concepts to early works, and confusing adjacent cultural traditions (e.g., misidentifying Safavid Persian miniatures as Mughal/Rajput styles).
  • Few-shot results are intriguing: using culturally matched expert critiques as examples actually degraded some models' performance, possibly due to long-context attention dilution, imitation of format over reasoning, or expert templates restricting generative flexibility.

Highlights & Insights

  • The primary value of this paper is not just "another art dataset" but the deconstruction of cultural understanding into a hierarchical diagnostic problem, preventing a total score from masking specific model failures.
  • The Cultural Symmetry design is practical: it acknowledges the natural imbalance in sample sizes across cultures but ensures that smaller categories are not overshadowed during comparison through identical protocols and balanced subsets.
  • Expert critiques serve both as evaluation targets and as trainable resources. For future cultural VLM fine-tuning, VULCA-Bench provides supervised signals with dimension labels.
  • While DCR is coarse, it makes large-scale rapid diagnosis feasible; it can be integrated with LLM judges and human scoring into a multi-layer evaluation system.

Limitations & Future Work

  • Western and Chinese data account for 82% of the full set, reflecting the reality of museum digitization and expert resources, which increases estimation variance for minority cultures. Serious cross-cultural comparisons should prioritize the balanced subset and report confidence intervals.
  • L5 philosophical aesthetics are inherently more subjective; authors observed higher revision rates in L5 audits compared to L1-L2. Future work requires stronger psychometric calibration beyond simple dimension coverage.
  • The bilingual design primarily focuses on Chinese and English; preservation of indigenous terms in Japanese, Korean, Arabic, Sanskrit/Hindi, etc., remains limited. Truly multicultural benchmarks should eventually expand to native-language critiques.
  • DCR remains a coarse, keyword/synonym-driven diagnostic vulnerable to missing implicit interpretations or being influenced by surface-level terminology. A more robust direction involves introducing expert-calibrated, judge-based rubrics.
  • vs MME / SEED-Bench / POPE: These focus on visual perception, object hallucination, and general VQA, whereas VULCA-Bench focuses on cultural symbols and aesthetics in art, targeting higher-level evaluation.
  • vs CulturalBench / CulturalVQA / GIMMICK: The latter are mostly QA or identification tasks testing cultural facts and biases; VULCA-Bench uses generative critiques, which are closer to open-ended interpretation capabilities.
  • vs WikiArt / OmniArt / ArtEmis: These excel in style, category, or emotion but lack cross-cultural expert critiques and hierarchical dimension labels. VULCA-Bench suggests that art understanding benchmarks need an "expert interpretation structure" rather than just images and labels.
  • Inspiration for Future Research: L1-L5 can be treated as a training curriculum—first training models on visual/technical grounding, then introducing RAG or knowledge graphs for L3-L5, and finally using expert judges to calibrate the quality of cultural interpretations.

Rating

  • Novelty: ⭐⭐⭐⭐☆ Clear problem definition using hierarchical frameworks for multicultural art critiques.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Data quality, pilot, and robustness analyses are comprehensive, though DCR is still relatively coarse.
  • Writing Quality: ⭐⭐⭐⭐☆ Clear structure with sufficient data construction details, though some core protocols could be more concise amidst numerous tables.
  • Value: ⭐⭐⭐⭐⭐ Direct utility for multimodal cultural understanding, fairness, and art VLM evaluation.