CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity¶
Conference: ACL 2026 arXiv: 2604.11632 Code: https://github.com/Big-Sid/CARTBENCH-Chinese-Artwork-Benchmark Area: Multimodal VLM / Cultural Understanding Keywords: Chinese art, museum benchmark, vision-language models, connoisseurship, authenticity verification
TL;DR¶
This paper introduces CArtBench — a multi-task benchmark grounded in the Palace Museum collection — to evaluate VLMs across four capabilities in Chinese art understanding (evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification). Even the strongest models exhibit significant performance drops in evidence association and style-period reasoning, while authenticity verification approaches random-chance performance.
Background & Motivation¶
Background: VLMs are increasingly deployed as general-purpose multimodal assistants, yet their evaluation is dominated by web-sourced images and Western-centric concepts. Chinese and culturally focused benchmarks have grown in number but remain concentrated on short-text recognition and QA.
Limitations of Prior Work: (1) Existing benchmarks lack evaluation of expert-level interpretive capabilities — i.e., deep understanding requiring cultural grounding and explicit visual evidence; (2) Many visual conventions in Chinese art are period-sensitive, and curatorial understanding requires linking observable cues to historical context; (3) Authenticity judgment is a core workflow in cultural heritage, yet VLM capabilities in this area have never been benchmarked.
Key Challenge: VLMs may perform well on short-text QA, but high accuracy can mask severe deficiencies in deeper capabilities such as evidence association, structured connoisseurship, and authenticity verification.
Goal: To construct a unified benchmark that comprehensively evaluates VLMs at the curatorial level of Chinese art understanding.
Key Insight: Aligning Wikidata entities of Palace Museum objects with authoritative catalogue pages enables the construction of a museum-grade benchmark spanning multiple dynasties and five major art categories.
Core Idea: Extend beyond short-text QA to four progressively demanding task tiers — evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification — to expose systematic failure patterns of VLMs in cultural understanding.
Method¶
Overall Architecture¶
CArtBench is constructed through a three-stage pipeline: (1) retrieving Palace Museum image collections from Wikidata; (2) aligning objects with official catalogue descriptions; and (3) expert-guided filtering and categorization. Four complementary tasks are instantiated from the curated data.
Key Designs¶
-
CuratorQA (Curatorial Question Answering):
- Function: Evaluates evidence-anchored recognition and reasoning in VLMs.
- Mechanism: 14,421 questions covering 1,589 artworks, divided into two difficulty levels — P1 (visual evidence only) and P2 (requiring art knowledge) — across six question types: subject identification, scene classification, compositional format, technique and style, iconographic detection, and style-period reasoning. QA pairs are generated by GPT-4o and expert-reviewed, with an error rate of only 0.47% across 1,000 audited items.
- Design Motivation: The P1/P2 difficulty stratification and six-category typology enable precise localization of model capability gaps.
-
CatalogCaption (Structured Connoisseurship):
- Function: Evaluates VLMs' ability to generate four-part expert-level appreciation texts.
- Mechanism: 86 artworks, for which models must produce structured texts comprising basic information, technical analysis, historical context, and aesthetic evaluation, compared against authoritative catalogue descriptions.
- Design Motivation: Long-form generation is a more demanding task than QA, requiring models to synthesize visual understanding with cultural knowledge.
-
ConnoisseurPairs (Authenticity Verification):
- Function: Evaluates VLMs' ability to discriminate between visually similar authentic and forged works.
- Mechanism: 10 pairs of visually similar authentic–imitation works, requiring models to identify the genuine piece based on holistic consistency and subtle cues. This serves as a diagnostic stress test.
- Design Motivation: Authenticity discrimination is a core connoisseurial skill, testing whether VLMs can move beyond surface-level recognition to deep-level reasoning.
Loss & Training¶
No model training is involved. Evaluation follows a unified protocol combining automated metrics, format compliance checking, and expert scoring.
Key Experimental Results¶
Main Results¶
CuratorQA Overall Accuracy (9 VLMs)
| Model | Overall Accuracy | QA6 (Style-Period Reasoning) |
|---|---|---|
| Qwen3-VL-235B | 0.84 | 0.56 |
| Qwen3-VL-30B | 0.80 | 0.42 |
| Qwen2.5-VL-72B | 0.81 | 0.53 |
| Qwen2.5-VL-32B | 0.80 | 0.53 |
Ablation Study¶
- High overall accuracy masks significant performance drops in evidence association (QA5) and style-period reasoning (QA6).
- Long-form connoisseurship (CatalogCaption) falls considerably short of expert reference quality.
- Authenticity verification (ConnoisseurPairs) approaches random-chance performance across all models, highlighting the extreme difficulty of connoisseur-level reasoning.
Key Findings¶
- High scores on short-text recognition may conceal severe deficiencies in evidence association and cultural reasoning.
- Style-period reasoning is the most challenging sub-task; even the strongest model reaches only 56%.
- Authenticity verification yields near-random performance, indicating that current VLMs lack connoisseur-level visual reasoning.
- Substantial performance variation exists across art categories.
Highlights & Insights¶
- The first museum-grade Chinese art VLM benchmark spanning four tiers: recognition, connoisseurship, interpretation, and authenticity verification.
- Alignment with authoritative Palace Museum catalogues ensures data credibility.
- The authenticity verification task is uniquely designed to probe blind spots in VLMs' deep reasoning.
- The evaluation protocol is rigorously designed, combining automated metrics with expert scoring.
Limitations & Future Work¶
- ReInterpret and ConnoisseurPairs are small in scale (25 and 10 instances, respectively), serving as diagnostic rather than large-scale evaluations.
- Data are primarily sourced from the Palace Museum, which may introduce collection bias.
- Expert annotation for authenticity verification is extremely costly and difficult to scale.
- Future work may extend coverage to additional museums and broader artistic traditions.
Related Work & Insights¶
- Complements culturally aware benchmarks such as CVLUE and CulturalVQA, while operating at a deeper, expert-level evaluation tier.
- Forms a task-level complement to ArtEmis (affective response) and MuseumQA (factual QA).
- Establishes a more rigorous evaluation standard for AI applications in cultural heritage.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First museum-grade Chinese art VLM benchmark encompassing authenticity verification.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 9 VLMs and four task types.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with well-motivated task design.