CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity¶

Conference: ACL 2026 arXiv: 2604.11632 Code: https://github.com/Big-Sid/CARTBENCH-Chinese-Artwork-Benchmark Area: Multimodal VLM / Cultural Understanding Keywords: Chinese art, museum benchmark, vision-language models, connoisseurship, authenticity verification

TL;DR¶

This paper introduces CArtBench — a multi-task benchmark grounded in the Palace Museum collection — to evaluate VLMs across four capabilities in Chinese art understanding (evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification). Even the strongest models exhibit significant performance drops in evidence association and style-period reasoning, while authenticity verification approaches random-chance performance.

Background & Motivation¶

Background: VLMs are increasingly deployed as general-purpose multimodal assistants, yet their evaluation is dominated by web-sourced images and Western-centric concepts. Chinese and culturally focused benchmarks have grown in number but remain concentrated on short-text recognition and QA.

Limitations of Prior Work: (1) Existing benchmarks lack evaluation of expert-level interpretive capabilities — i.e., deep understanding requiring cultural grounding and explicit visual evidence; (2) Many visual conventions in Chinese art are period-sensitive, and curatorial understanding requires linking observable cues to historical context; (3) Authenticity judgment is a core workflow in cultural heritage, yet VLM capabilities in this area have never been benchmarked.

Key Challenge: VLMs may perform well on short-text QA, but high accuracy can mask severe deficiencies in deeper capabilities such as evidence association, structured connoisseurship, and authenticity verification.

Goal: To construct a unified benchmark that comprehensively evaluates VLMs at the curatorial level of Chinese art understanding.

Key Insight: Aligning Wikidata entities of Palace Museum objects with authoritative catalogue pages enables the construction of a museum-grade benchmark spanning multiple dynasties and five major art categories.

Core Idea: Extend beyond short-text QA to four progressively demanding task tiers — evidence-anchored QA, structured connoisseurship, defensible reinterpretation, and authenticity verification — to expose systematic failure patterns of VLMs in cultural understanding.

Method¶

Overall Architecture¶

CArtBench is constructed through a three-stage pipeline: (1) retrieving Palace Museum image collections from Wikidata; (2) aligning objects with official catalogue descriptions; and (3) expert-guided filtering and categorization. Four complementary tasks are instantiated from the curated data.

Key Designs¶

CuratorQA (Curatorial Question Answering):
- Function: Evaluates evidence-anchored recognition and reasoning in VLMs.
- Mechanism: 14,421 questions covering 1,589 artworks, divided into two difficulty levels — P1 (visual evidence only) and P2 (requiring art knowledge) — across six question types: subject identification, scene classification, compositional format, technique and style, iconographic detection, and style-period reasoning. QA pairs are generated by GPT-4o and expert-reviewed, with an error rate of only 0.47% across 1,000 audited items.
- Design Motivation: The P1/P2 difficulty stratification and six-category typology enable precise localization of model capability gaps.
CatalogCaption (Structured Connoisseurship):
- Function: Evaluates VLMs' ability to generate four-part expert-level appreciation texts.
- Mechanism: 86 artworks, for which models must produce structured texts comprising basic information, technical analysis, historical context, and aesthetic evaluation, compared against authoritative catalogue descriptions.
- Design Motivation: Long-form generation is a more demanding task than QA, requiring models to synthesize visual understanding with cultural knowledge.
ConnoisseurPairs (Authenticity Verification):
- Function: Evaluates VLMs' ability to discriminate between visually similar authentic and forged works.
- Mechanism: 10 pairs of visually similar authentic–imitation works, requiring models to identify the genuine piece based on holistic consistency and subtle cues. This serves as a diagnostic stress test.
- Design Motivation: Authenticity discrimination is a core connoisseurial skill, testing whether VLMs can move beyond surface-level recognition to deep-level reasoning.

Loss & Training¶

No model training is involved. Evaluation follows a unified protocol combining automated metrics, format compliance checking, and expert scoring.

Key Experimental Results¶

Main Results¶

CuratorQA Overall Accuracy (9 VLMs)

Model	Overall Accuracy	QA6 (Style-Period Reasoning)
Qwen3-VL-235B	0.84	0.56
Qwen3-VL-30B	0.80	0.42
Qwen2.5-VL-72B	0.81	0.53
Qwen2.5-VL-32B	0.80	0.53

Ablation Study¶

High overall accuracy masks significant performance drops in evidence association (QA5) and style-period reasoning (QA6).
Long-form connoisseurship (CatalogCaption) falls considerably short of expert reference quality.
Authenticity verification (ConnoisseurPairs) approaches random-chance performance across all models, highlighting the extreme difficulty of connoisseur-level reasoning.

Key Findings¶

High scores on short-text recognition may conceal severe deficiencies in evidence association and cultural reasoning.
Style-period reasoning is the most challenging sub-task; even the strongest model reaches only 56%.
Authenticity verification yields near-random performance, indicating that current VLMs lack connoisseur-level visual reasoning.
Substantial performance variation exists across art categories.

Highlights & Insights¶

The first museum-grade Chinese art VLM benchmark spanning four tiers: recognition, connoisseurship, interpretation, and authenticity verification.
Alignment with authoritative Palace Museum catalogues ensures data credibility.
The authenticity verification task is uniquely designed to probe blind spots in VLMs' deep reasoning.
The evaluation protocol is rigorously designed, combining automated metrics with expert scoring.

Limitations & Future Work¶

ReInterpret and ConnoisseurPairs are small in scale (25 and 10 instances, respectively), serving as diagnostic rather than large-scale evaluations.
Data are primarily sourced from the Palace Museum, which may introduce collection bias.
Expert annotation for authenticity verification is extremely costly and difficult to scale.
Future work may extend coverage to additional museums and broader artistic traditions.

Complements culturally aware benchmarks such as CVLUE and CulturalVQA, while operating at a deeper, expert-level evaluation tier.
Forms a task-level complement to ArtEmis (affective response) and MuseumQA (factual QA).
Establishes a more rigorous evaluation standard for AI applications in cultural heritage.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First museum-grade Chinese art VLM benchmark encompassing authenticity verification.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across 9 VLMs and four task types.
Writing Quality: ⭐⭐⭐⭐ Clear structure with well-motivated task design.