MINERVA-Cultural: A Benchmark for Cultural and Multilingual Long Video Reasoning¶

Conference: CVPR 2026 arXiv: 2601.10649 Code: Coming soon Area: Video Understanding / Multicultural Benchmarking Keywords: Video QA, cross-cultural understanding, multilingual reasoning, long video, evidence graph error analysis

TL;DR¶

This paper introduces MINERVA-Cultural, a benchmark comprising 2,400 manually annotated video reasoning questions spanning 18 language/region locales, and reveals severe deficiencies in cultural visual perception among state-of-the-art Video-LLMs through evidence graphs and an iterative error isolation strategy (best model Gemini-2.5-Pro: 45.07% vs. human: 95.22%).

Background & Motivation¶

Background: Video understanding has advanced substantially, with long video comprehension emerging as a focal research area. Benchmarks such as EgoSchema, LongVideoBench, and MLVU have driven model progress, and frontier models including GPT-5 and Gemini-2.5 achieve strong performance on standard benchmarks.
Limitations of Prior Work: (a) Existing video benchmarks are dominated by Western content and English, introducing significant evaluation bias; (b) cross-cultural benchmarks such as ViMUL-Bench rely on automatic translation while their visual content still centers on Western concepts; (c) evaluation focuses solely on final answer correctness, ignoring specific failure modes within the reasoning process.
Key Challenge: Training data for current models is dominated by Euro-American and English content, leading to severely inadequate understanding of low-resource languages and cultures (e.g., Tamil, Telugu). Moreover, simple accuracy metrics cannot reveal precisely where in the reasoning chain a model fails.
Goal: (a) Construct a genuinely multicultural, multilingual video reasoning benchmark annotated entirely by local experts; (b) provide human reasoning chains as diagnostic tools; (c) develop fine-grained error analysis methods to localize the root causes of model failures.
Key Insight: Every question is required to demand "visual-cultural understanding" as a skill, tightly coupling perception with cultural knowledge. Human reasoning processes are modeled as directed acyclic graphs (DAGs), enabling iterative isolation and classification of errors.
Core Idea: Expose and quantify systemic deficiencies of Video-LLMs in cultural visual perception through a long-video reasoning benchmark fully annotated by local experts from 18 regions (with no reliance on translation), combined with an evidence-graph-based iterative analysis methodology.

Method¶

Overall Architecture¶

MINERVA-Cultural consists of two components: (1) a benchmark dataset — 540 videos and 2,400 questions covering 18 language/region locales across 6 cultural domains, each question accompanied by a manually authored multi-step reasoning chain; and (2) a diagnostic methodology — reasoning chains are formalized as evidence graphs, and iterative error isolation is applied to precisely localize model failure points.

Key Designs¶

Culture-Centric Curation:
- Function: Ensures that every question genuinely requires deep understanding of visual-cultural content to answer correctly.
- Mechanism: A four-stage pipeline — (1) Cultural video selection: local reviewers filter YouTube videos according to a cultural taxonomy, requiring native-language audio, culturally salient scenes, and duration exceeding one minute; (2) Difficulty calibration: 10% of samples are annotated first to verify that questions are sufficiently challenging for LLMs (not answerable from a single frame, audio alone, or common sense); (3) Correctness calibration: an independent reviewer answers each question without access to the reference answer; disagreements are resolved through revision until consensus is reached; (4) Final annotation and audit.
- Design Motivation: Avoids the pseudo-multiculturalism prevalent in other benchmarks that combine automatic translation with Western imagery. Each question must require at least two reasoning skills, one of which must be visual-cultural understanding.
Evidence Graph:
- Function: Formalizes unstructured human reasoning chains into directed acyclic graphs to precisely localize reasoning failure points.
- Mechanism: An LLM decomposes each human reasoning chain into atomic evidence nodes — including visual observations (with timestamps), external knowledge retrieval, and logical inference steps. Prerequisite dependency edges are established between nodes, such that an error at a given node blocks all downstream reasoning. On average, each question requires 5.0 atomic evidence nodes, and 63% of evidence nodes are anchored to specific video timestamps.
- Design Motivation: Simple accuracy cannot distinguish between "failure to perceive cultural elements" and "logical reasoning errors." The graph structure captures causal dependencies as well as spatiotemporal relationships.
Iterative Error Isolation:
- Function: Exhaustively identifies all model failure modes, preventing early errors from masking subsequent failures.
- Mechanism: A three-stage loop — (1) Traversal: the model's reasoning is compared against human evidence along the evidence graph; traversal halts upon encountering a missing evidence node; (2) Error labeling: "divergences" (where the model takes a reasonable alternative path, occurring in only 2% of cases) are distinguished from "errors" (labeled according to a taxonomy: temporal localization, spatial localization, attribute misidentification, hallucination, etc.); (3) Prompted correction and re-evaluation: the model receives a corrective prompt, evaluated nodes are pruned from the graph, and unevaluated nodes are re-assessed. This loop continues until all nodes are evaluated (99.7% of cases are resolved within 5 iterations).
- Design Motivation: A single-pass error analysis can only detect the first error, causing subsequent reasoning errors to be masked. The iterative approach uncovers an additional 22% of errors, including 78 reasoning errors that were originally obscured by perception errors.

Loss & Training¶

This is a benchmark paper; no model training is involved. Evaluation employs an LLM Judge (Gemini-2.5-Flash), scoring open-ended responses on a 0–2 three-point scale.

Key Experimental Results¶

Main Results¶

Model performance across 18 regions (accuracy %):

Model	Aggregate	Highest Region	Lowest Region
Qwen-2.5-VL	12.75	en-GB (25.70)	ta-IN (3.60)
Qwen-3-VL	21.50	en-GB (34.58)	te-IN (12.40)
Claude-Sonnet-4	23.36	en-GB (29.91)	te-IN (14.40)
GPT-5-mini	36.64	ko-KR (51.90)	ta-IN (16.40)
GPT-5	42.20	id-ID (56.34)	te-IN (23.60)
Gemini-2.5-Flash	35.84	de-DE (51.90)	ta-IN (20.00)
Gemini-2.5-Pro	45.07	ko-KR (64.29)	te-IN (28.00)
Human Baseline	95.22	it-IT (98.24)	de-DE (90.51)

Ablation Study¶

Analysis Dimension	Key Finding
Audio vs. video-only	Adding audio yields an average gain of 4.32% (zh-TW: +8.15%, id-ID: +7.09%)
Thinking budget (tokens)	Accuracy increases from 35.9% to 45.9% as budget grows from 128 to 2k tokens, then saturates
Frame count (1→512)	Monotonically increasing with diminishing returns, indicating temporal reasoning is required
Error type analysis	75% of errors are attributable to cultural visual perception (temporal localization + spatial localization + attribute misidentification + hallucination)

Key Findings¶

Large human–machine gap: The strongest model, Gemini-2.5-Pro (45.07%), lags behind human performance (95.22%) by approximately 50 percentage points.
Pronounced cultural disparity: The same model shows a 36-percentage-point gap between Korean (ko-KR: 64.29%) and Telugu (te-IN: 28.00%), revealing severe cultural bias.
South Indian languages are the weakest: ta-IN (31.60%), te-IN (28.00%), and mr-IN (38.72%) all fall far below English-locale performance.
75% of errors stem from cultural visual perception: Models fail not due to insufficient reasoning ability but because they cannot "see" or "interpret" culturally specific visual elements (attire, rituals, symbols, etc.).
Iterative error analysis is essential: 22% of errors are missed in a single-pass analysis, particularly reasoning errors masked by perception errors.
Perception errors in low-resource languages are 1.4× those in high-resource languages: Cultural perception errors in ar-EG and ta-IN exceed those in en-GB and ja-JP by approximately 40%.

Highlights & Insights¶

Key finding that "cultural understanding is a visual perception problem": Models do not fail because they lack reasoning capability; they fail because they cannot perceive culturally specific visual elements. This pinpoints the direction for improvement — diversifying pretraining data culturally rather than merely enhancing reasoning capacity.
Evidence graph + iterative error isolation as a diagnostic methodology: This approach provides substantially deeper analytical insight than simple accuracy metrics and is transferable to any benchmark involving multi-step reasoning (e.g., mathematical reasoning, code generation).
High-standard fully manual annotation: Every question is annotated by a local cultural expert and calibrated by an independent reviewer, eliminating translation artifacts. This is a resource-intensive but highly valuable contribution to the research community.

Limitations & Future Work¶

Limited scale: 2,400 questions across 18 regions yield approximately 130 questions per region, which may be insufficient to cover all cultural scenarios.
Dependence on LLM Judge: Although majority voting mitigates the issue, LLM-based evaluation may itself carry biases toward certain languages.
Incomplete cultural coverage: Parts of Africa, South America, and Southeast Asian minority language communities are not included.
Future directions: (a) Expansion to additional regions and languages; (b) development of culturally aware visual pretraining data strategies; (c) open-sourcing automated diagnostic tools based on evidence graphs; (d) exploration of strategies to balance cultural distributions in pretraining data.

vs. ViMUL-Bench: ViMUL-Bench covers 14 languages but incorporates non-cultural videos and partially translated annotations. MINERVA-Cultural is annotated entirely in native languages by local experts, with video content that is uniformly culturally specific.
vs. MINERVA: MINERVA provides reasoning chains and error taxonomies; MINERVA-Cultural extends this foundation by adding a multicultural dimension and an evidence-graph-based analysis methodology, enabling finer-grained diagnosis.
vs. LongVideoBench / MLVU: These benchmarks target general video understanding without addressing cultural or multilingual dimensions; MINERVA-Cultural fills this gap.

Rating¶

Novelty: ⭐⭐⭐⭐ — First large-scale, fully manual multicultural multilingual video reasoning benchmark; evidence graph analysis methodology is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 7 models, 18 regions, multi-dimensional analysis (audio, frame count, thinking budget, error types), and human baseline.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is well-grounded; annotation pipeline is described in detail; analysis is deep and insightful.
Value: ⭐⭐⭐⭐⭐ — Exposes cultural bias in AI systems, defines a critical direction for model improvement, and carries significant implications for fair AI and global deployment.