Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration¶

Conference: ACL 2025
arXiv: 2406.16469
Code: Yes (Dataset)
Area: Multimodal VLMs
Keywords: Cultural Understanding, VLM Benchmark, Korean Culture, Human-VLM Collaboration, Multiple-Choice Visual Question Answering

TL;DR¶

This paper proposes a semi-automated framework for constructing cultural VLM benchmarks. Through human-VLM collaboration, multiple-choice VQA samples are generated to construct the K-Viscuit dataset (657 questions) focusing on Korean culture, revealing a significant gap between open-source and closed-source VLMs in cultural understanding.

Background & Motivation¶

Current VLMs are primarily trained on Western-centric datasets (such as COCO, VQAv2, etc.), leading to poor performance in non-Western cultural scenarios. Building culture-aware VLM benchmarks faces the following challenges:

High Manual Annotation Cost: Manually creating VQA samples for each culture is time-consuming and resource-intensive.

Cognitive Fixation: Human annotators tend to generate a limited variety of questions, restricting data diversity.

Difficulty in Cross-Cultural Scaling: The construction methods of existing cultural benchmarks (e.g., MaRVL, GD-VCR, CVQA) are difficult to scale efficiently to new cultures.

Core Motivation: Can VLM generation capabilities be leveraged to assist human annotators, simultaneously improving efficiency and increasing question diversity, while ensuring cultural accuracy through human verification?

Method¶

Overall Architecture¶

The construction of K-Viscuit consists of four stages:

Concept Categorization: Referencing the Intercontinental Dictionary Series (IDS), 10 core concept categories are defined: Food, Drink, Play, Festival, Religion, Tool, Clothing, Heritage, Architecture, and Agriculture.
Image Selection: Native Korean annotators collect CC-licensed images from Wikimedia Commons, with each specific item appearing a maximum of twice in the same category.
Question Generation: Autonomously generated by combining human demonstrations and a VLM (GPT-4-Turbo).
Human Verification: Checked by native Korean speakers to review generation quality and cultural relevance.

Key Designs¶

1. Two-Tiered Question Design¶

Type	Description	Count	Avg. Word Length
Type 1 - Visual Recognition	Evaluates basic visual information (e.g., item identification)	237	10.1
Type 2 - Cultural Knowledge Application	Requires deeper cultural reasoning or multi-step inference	420	15.5

For each image, 1 Type 1 question and 1–4 Type 2 questions are created. Key advantages of this classification: - Type 1 tests the model's capability to recognize culture-specific visual elements. - Type 2 evaluates the depth of cultural understanding beyond simple recognition.

2. AI-Assisted Annotation Process¶

The VLM (GPT-4-Turbo) receives the following inputs to generate question-answer pairs: - Target image - Human-annotated demonstration exemplars (at least 3 per concept category) - Detailed annotation guidelines - Image-specific background knowledge descriptions

Key Constraint: The guidelines emphasize that high similarity must be maintained between the four options to prevent the model from ruling out distractors. This principle is also followed in human demonstrations.

3. Rigorous Human Verification¶

Verification checks not only factual correctness but also details such as: - Whether the questions genuinely reflect the expected cultural nuances. - Whether Type 2 questions actually require cultural knowledge (rather than relying solely on visual recognition). - Whether the distractors among the options are sufficiently misleading.

Many samples generated by the VLM that were factually accurate but lacked cultural depth were discarded, ensuring the cultural resonance of the dataset.

4. English Text but Testing Cultural Understanding¶

All text is written in English, intentionally decoupling multicultural understanding from multilingual capability. For Korean concepts lacking English equivalents, standard Romanization transcription is applied.

Loss & Training¶

This work introduces a benchmark and does not involve model training. Evaluation utilizes the standard multiple-choice VQA paradigm: - Input = Image + Question + Four Options (alphabetically ordered) + Output format instruction - Metric: Accuracy

Key Experimental Results¶

Main Results¶

Performance of different VLMs on K-Viscuit (Table 2 Summary):

Model	Overall Acc	Food	Play	Festival	Clothing	Architecture
InstructBLIP-7B	50.84	40.85	38.46	53.19	62.16	60.55
LLaVA-1.6-13B	57.08	45.07	36.54	68.09	70.27	69.72
Llama-3.2-11B	68.04	61.27	50.00	72.34	75.68	69.72
Claude-3-opus	70.02	62.68	59.62	72.34	78.38	67.89
GPT-4-Turbo	80.82	73.94	78.85	85.11	86.49	79.82
GPT-4o	89.50	88.73	86.54	95.74	91.89	91.74

Analysis by Question Type (Table 3):

Model	Type 1 (Visual)	Type 2 (Knowledge)	Overall
InstructBLIP-7B	45.57	53.81	50.84
Llama-3.2-11B	69.20	67.38	68.04
GPT-4o	92.41	87.86	89.50

Interesting Finding: Most models performed better on Type 2 than on Type 1, hinting that visually recognizing items in cultural contexts is inherently challenging.

Ablation Study¶

Human Evaluation (Figure 5): - Native Koreans' average accuracy: 80.2 (SD: 2.69) - Non-Koreans' average accuracy: 47.0 (SD: 5.95) - GPT-4-Turbo is comparable to native Koreans, validating the effectiveness of VLM-assisted annotation.

Korean Input Test (Table 4): - Korean-only input generally does not improve performance. - Gemini-1.5-Pro shows improvement under bilingual English+Korean inputs (\(81.58 \rightarrow 83.41\)).

Visual Dependency Analysis (Figure 7): - After replacing real images with Gaussian noise images, the accuracy of all models drops drastically. - Llama-3.2-11B drops the most, while Molmo-7B-D drops the least. - This confirms K-Viscuit indeed requires visual understanding.

Retrieval-Augmented Generation (Table 7, Food category):

Model	No Retrieval	Retrieval-Augmented	Oracle Document
LLaVA-1.6-7B	43.66	68.31	78.87
GPT-4-Turbo	73.94	78.17	88.73
GPT-4o	88.73	83.10	92.25

External knowledge retrieval significantly improves open-source models, but strong closed-source models are sometimes instead distracted by low-quality retrieval results.

Key Findings¶

Massive Gap Between Closed and Open-Source: GPT-4o (89.5%) outperforms the best open-source model Llama-3.2-11B (68.0%) by 21.5 percentage points.
"Play" Category is Most Challenging: All models perform the worst in this category (highest 86.54%, open-source highest only 50%).
Visual Recognition \(\neq\) Easy: Type 1 questions are actually harder for open-source models, because recognizing culture-specific items requires encountering sufficient culturally diverse samples during training.
Generative Setup is Harder: LLaVA-1.6-13B drops from 45.07% in multiple-choice to 36.25% in the generative setting.

Highlights & Insights¶

Practicality of the Semi-Automated Framework: Human-VLM collaborative annotation substantially reduces costs, while VLM recommendations enhance question diversity, addressing the issue of human cognitive fixation.
Ingenious Option Design: Highly similar distractors (2,129 unique options out of 2,628) effectively prevent models from scoring through elimination.
Comprehensive Multi-Dimensional Analysis: In-depth analysis is provided from various angles, including human evaluation, language impact, visual dependency, retrieval-augmented generation, and generative evaluation.
Transferable Framework: Although the study focuses on Korean culture, the framework design can be directly applied to other cultures.

Limitations & Future Work¶

Image Selection Still Requires Human Effort: Complete automation of dataset generation is not yet feasible.
Sensitivity to Option Ordering: VLMs are sensitive to option ordering; while random shuffling mitigates this, it does not completely resolve it.
Limited Cultural Coverage: The 657 samples only cover a subset of Korean culture.
Only Evaluating English Capability: Completely separating cross-cultural and cross-lingual aspects remains an idealized assumption.
Unexplored Tuning Directions: The paper only tests retrieval-augmented generation and does not explore the performance of fine-tuning open-source models on cultural data.

MaRVL (Liu et al., 2021): Multilingual visual reasoning dataset covering 5 languages and cultures.
CVQA (Romero et al., 2024): Comprehensive multilingual VQA benchmark, compared in this paper on its Korean subset.
CLIcK (Kim et al., 2024): Cultural knowledge benchmark for Korean LLMs (text-only).
IDS (Key and Comrie, 2015): Intercontinental Dictionary Series, providing a framework for cross-cultural concept selection.
Insight: The human-AI collaborative annotation paradigm (initial AI generation followed by human filtering and refinement) can be generalized to other annotation tasks requiring expert domain knowledge.

Rating¶

Dimension	Score (1-5)
Novelty	3.5
Practicality	4
Experimental Thoroughness	4.5
Writing Quality	4
Overall Rating	4

The framework is well-designed, with a comprehensive and in-depth analysis. Specifically, the extended analyses on retrieval augmentation and generative evaluation are highly valuable. As a benchmark paper, the dataset size is relatively small (657 questions), but the question quality is high and the experiments are thorough.