Skip to content

CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity

Conference: ACL 2026
arXiv: 2604.11632
Code: https://github.com/Big-Sid/CARTBENCH-Chinese-Artwork-Benchmark
Area: Multimodal VLM/Cultural Understanding
Keywords: Chinese Art, Museum Benchmark, Vision-Language Models, Appreciation Ability, Authenticity Discrimination

TL;DR

This paper constructs CArtBench—a multi-task benchmark based on the collections of the Palace Museum—to evaluate four capabilities of VLMs in Chinese art understanding (evidence-based QA, structured appreciation, defensible re-interpretation, and authenticity discrimination). It finds that even the strongest models show significant performance degradation in evidence association and style-period reasoning, while authenticity discrimination remains near random levels.

Background & Motivation

Background: VLMs are increasingly utilized as general-purpose multimodal assistants; however, their evaluation is dominated by internet images and Western-centric concepts. Although Chinese and culture-focused benchmarks have expanded, they primarily concentrate on short-text recognition and basic QA.

Limitations of Prior Work: (1) Existing benchmarks lack assessments of expert-level explanatory capabilities, which require depth of understanding anchored in culture and supported by explicit visual evidence. (2) Many visual conventions in Chinese art are period-sensitive; curator-level understanding requires linking observable cues to historical contexts. (3) Authenticity judgment is a core workflow in cultural heritage, yet the capabilities of current VLMs in this area have never been evaluated.

Key Challenge: VLMs may perform well in short-text QA, but high accuracy might mask severe deficiencies in deep capabilities such as evidence association, structured appreciation, and authenticity identification.

Goal: Construct a unified benchmark to comprehensively evaluate curator-level capabilities of VLMs in Chinese art understanding.

Key Insight: Align Wikidata entities of Palace Museum collections with authoritative catalog pages to build a museum benchmark spanning multiple dynasties and five major art categories.

Core Idea: Expand from short-text QA to four progressive task levels: evidence-anchored QA, structured appreciation, defensible interpretation, and authenticity discrimination, revealing systematic failure modes of VLMs in cultural understanding.

Method

Overall Architecture

CArtBench is constructed via a three-stage pipeline: (1) retrieving image collections of the Palace Museum from Wikidata; (2) aligning collections with official catalog descriptions; (3) expert-guided filtering and classification. Four complementary tasks are instantiated based on the constructed data.

Key Designs

1. CuratorQA: Dissecting "High Accuracy" with Difficulty Stratification and Question Classification

General VLMs often achieve high total scores on Chinese art, yet total accuracy alone fails to distinguish true understanding from surface-pattern recognition. CuratorQA thus splits 14,421 questions (covering 1,589 artworks) along two axes: difficulty levels P1 (requiring only visual evidence) and P2 (requiring art knowledge for reasoning), and six question types: subject recognition, scene classification, composition format, technique style, iconographic detection, and style-period reasoning.

This two-dimensional split allows every failure to be precisely located—whether it is a lack of visual evidence association (lower scores in P1) or a lack of cultural context reasoning (lower scores in P2 or style-period tasks). The Q&A pairs were generated by GPT-5.2 and reviewed by experts; a spot check of 1,000 entries showed an error rate of only 0.47%, ensuring annotation credibility despite large-scale generation.

⚠️ The source text specifies the generation model as GPT-5.2; this translation adheres to the original text.

2. CatalogCaption: Probing Comprehensive Abilities with Four-Part Structured Text

Evidence-based QA only tests "point-like" recognition, whereas curator-level understanding is best demonstrated by the ability to weave visual observations, techniques, history, and aesthetics into a coherent piece of appreciation text. CatalogCaption selects 86 artworks and requires models to generate structured appreciation text containing four paragraphs: basic information, technical analysis, historical background, and aesthetic evaluation, which are then scored against authoritative Palace Museum catalog descriptions.

Long-text generation is significantly more difficult than multiple-choice questions—it demands that the model simultaneously mobilize visual understanding and cultural knowledge while organizing them into expert-acceptable expressions, thereby exposing shortfalls like "high QA score but inability to write decent appreciation."

3. ReInterpret: Testing Non-Conventional yet Defensible Interpretations using Classic Anchors

While QA and appreciation test the "repetition of existing knowledge," ReInterpret measures a more difficult step—whether the model can provide novel interpretations that go beyond convention while respecting the image and cultural context. It selects 25 classic Chinese artworks frequently used in art education and training as anchors: these works have extensive mature discussions and established interpretations, making them ideal for testing whether a model can diverge from standard narratives without detaching from visual evidence.

The evaluation utilizes a two-stage questionnaire designed after the Torrance Tests of Creative Thinking (TTCT): the first stage is a plausibility gate to filter out outputs with severe misinterpretations, violations of art history consensus, or factual fabrications; the second stage involves manual scoring (1–5) across five dimensions: interpretation novelty, integrative coherence, evidence reasoning, elaborative expressiveness, and creative insight. Experiments found that the bottleneck lies not in expression quality but in "defensibility"—models primarily improved scores by more stably passing the first-stage gate rather than widening the gap in interpretation quality during the second stage.

4. ConnoisseurPairs: A Diagnostic Stress Test Using Visually Similar Authentic-Fake Pairs

Authenticity judgment is a core aspect of cultural heritage work and represents deep reasoning that transcends surface recognition, yet it has never been included in VLM evaluations. ConnoisseurPairs constructs 10 pairs of visually highly similar authentic-fake artworks, requiring the model to judge which is authentic based on holistic consistency and subtle clues.

While small in scale, this task serves as a diagnostic stress test: it directly probes whether a model can infer authenticity from faint signals like brushwork, composition, and material, much like a connoisseur. In experiments, all models performed near random levels, exposing a blind spot in current VLMs regarding deep visual reasoning.

Loss & Training

Does not involve model training. Evaluation uses a unified protocol combining automatic metrics, format compliance checks, and expert scoring.

Key Experimental Results

Main Results

CuratorQA Overall Accuracy (9 VLMs)

Model Overall Accuracy QA6 (Style-Period Reasoning)
Qwen3-VL-235B 0.84 0.56
Qwen3-VL-30B 0.80 0.42
Qwen2.5-VL-72B 0.81 0.53
Qwen2.5-VL-32B 0.80 0.53

Ablation Study

  • High overall accuracy masks significant performance drops in evidence association (QA5) and style-period reasoning (QA6).
  • Long-text appreciation (CatalogCaption) falls far short of expert reference levels.
  • Authenticity discrimination (ConnoisseurPairs) is near random for all models, highlighting the extreme difficulty of connoisseur-level reasoning.

Key Findings

  • High VLM scores in short-text recognition can hide severe deficiencies in evidence grounding and cultural reasoning.
  • Style-period reasoning is the most difficult subtask, with the strongest model achieving only 56%.
  • Authenticity discrimination shows near-random performance, indicating a lack of connoisseur-level visual reasoning in current VLMs.
  • Significant performance differences exist across different art categories.

Highlights & Insights

  • The first museum-grade Chinese art VLM benchmark, spanning four levels: recognition, appreciation, interpretation, and authenticity.
  • Alignment with authoritative Palace Museum catalogs ensures the authority of the data.
  • The unique design of the authenticity discrimination task directly targets the blind spots of deep reasoning in VLMs.
  • The evaluation protocol is rigorously designed, combining automatic metrics with expert scoring.

Limitations & Future Work

  • ReInterpret and ConnoisseurPairs are small in scale (25/10), serving as diagnostic assessments.
  • Data primarily originates from the Palace Museum, which may introduce collection bias.
  • The cost of expert annotation for authenticity discrimination is extremely high, making it difficult to scale.
  • Future work could extend to more museums and diverse artistic traditions.
  • Complements cultural-aware benchmarks like CVLUE and CulturalVQA but extends into expert-level evaluation.
  • Provides task complementarity with ArtEmis (emotion) and MuseumQA (facts).
  • Establishes more rigorous evaluation standards for AI applications in the cultural heritage domain.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The first museum-grade Chinese art VLM benchmark covering authenticity.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation of 9 VLMs across four tasks.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with well-motivated task designs.