AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models¶

Conference: ACL 2025
arXiv: 2406.09295
Code: https://github.com/THUDM/AlignMMBench
Area: Multimodal VLM / Evaluation Benchmark
Keywords: VLM Evaluation, Chinese Multimodal, alignment benchmark, CritiqueVLM, prompt robustness

TL;DR¶

Proposes AlignMMBench, the first multimodal alignment evaluation benchmark for Chinese visual contexts, covering 13 tasks across 3 major categories, with 1054 images and 4978 QA pairs (including single-turn/multi-turn dialogues). Additionally, a ChatGLM3-6B-based evaluator, CritiqueVLM, is trained, which outperforms GPT-4 in evaluation consistency.

Background & Motivation¶

Background: VLMs exhibit strong visual understanding capabilities after SFT and RLHF alignment training. Chinese VLMs (e.g., QwenVL, CogVLM, InternVL) have approached GPT-4o on public leaderboards.

Limitations of Prior Work: Existing benchmarks (e.g., MME, MMBench, MMMU) primarily evaluate fundamental capabilities using non-verbal forms like yes/no or multiple-choice questions, lacking fine-grained open-ended evaluation of alignment performance, especially benchmarks tailored for Chinese visual scenes.

Key Challenge: Chinese multimodal corpora are scarce and difficult to annotate (due to higher ambiguity in Chinese contexts), requiring iterative validation by multiple annotators; moreover, significant differences in image features and cultural backgrounds between Chinese and English mean that English-only datasets cannot comprehensively evaluate Chinese VLMs.

Goal: To build a high-quality Chinese multimodal alignment evaluation benchmark covering three major dimensions—perception and understanding, reasoning and analysis, and dialogue context—while addressing the difficulty of evaluating open-ended questions.

Key Insight: Manually collecting Chinese images from real-world scenarios and internet resources, designing prompt paraphrase strategies to generate semantically equivalent but differently phrased variant questions, and training a dedicated evaluator to replace GPT-4.

Core Idea: Build a reproducible and controllable Chinese multimodal alignment evaluation system using carefully curated Chinese visual scenes, prompt paraphrase strategies, and a rule-calibrated small evaluator.

Method¶

Overall Architecture¶

AlignMMBench consists of three core components: - Evaluation Dataset: 1054 images, 4978 QA pairs, covering 13 tasks across 3 major categories. - Evaluator CritiqueVLM: An automatic scoring model fine-tuned based on ChatGLM3-6B. - Alignment Score Metric: Measures model robustness under different prompt variants.

Key Designs¶

Dataset Construction (3 Categories, 13 Tasks):
- Function: Covers perception & understanding (description, recognition, counting, OCR, memory, knowledge), reasoning & analysis (reasoning, charts, programming, comparison, writing), and dialogue context (coherent/incoherent multi-turn dialogues).
- Mechanism: ① Define task types $ \rightarrow $ ② Collect images from Chinese websites like Baidu via web scrapers $ \rightarrow $ ③ Manually filter low-quality images $ \rightarrow $ ④ Create seed questions $ \rightarrow $ ⑤ Paraphrase via LLM to generate semantically equivalent variants $ \rightarrow $ ⑥ Manually annotate reference answers $ \rightarrow $ ⑦ Conduct two-stage quality review.
- Design Motivation: Chinese multimodal corpora are extremely scarce and cannot leverage existing datasets like VQAv2 as in English, requiring construction from scratch; dialogue context tasks (especially incoherent ones) evaluate the VLM's ability to detect errors, which is critical in practical applications.
Prompt Paraphrase Strategy:
- Function: Uses LLMs to paraphrase each seed question into multiple stylistically diverse but semantically equivalent variants (expanding from 1054 seeds to 4978 QA pairs).
- Mechanism: Real-world users express the same intent in various ways; this evaluates the model's robustness to different formulations.
- Design Motivation: Introduces the "alignment score"—the consistency of the model's performance under different phrasings of the same question—to measure alignment stability.
CritiqueVLM Evaluator:
- Function: Fine-tuned on ChatGLM3-6B, taking the question + reference answer + model response as input, and outputting a score of 1-10 along with a CoT explanation.
- Mechanism: Designs general prompts (scoring scale, criteria, format) + task-specific prompts (category-level scoring keypoints) to perform SFT using human-annotated scoring data (containing CoT explanations).
- Training Details: 32 x A800 GPUs, batch_size=128, 1000 iterations, with loss decreasing from 3.8 to 0.3.
- Design Motivation: The GPT-4 API is a black box, expensive, and unstable due to updates; CritiqueVLM is open-source, controllable, and shows higher consistency with human ratings than GPT-4 (reducing MAE by 34.8%).
Alignment Score:
- Function: Quantifies the scoring stability of the model under different prompt variants.
- Mechanism: Calculates the inverse of the score variance across multiple variants of the same question; higher values indicate greater robustness.
- Design Motivation: Models with high performance but low robustness are unreliable in real-world applications.

Evaluation Metric System¶

Consistency between CritiqueVLM and human scoring is measured using 6 metrics: MAE, Pearson/Spearman/Kendall correlation coefficients, Fuzzy Accuracy, and Strict Accuracy.

Key Experimental Results¶

CritiqueVLM Evaluator Comparison¶

Evaluator	MAE (↓)	Pearson (↑)	Fuzzy Acc (↑)	Strict Acc (↑)
ChatGLM3-6B	2.424	0.230	0.350	0.285
ChatGPT	1.720	0.572	0.427	0.347
GPT-4	1.256	0.839	0.677	0.565
CritiqueVLM	0.818	0.846	0.747	0.646

VLM Benchmark Leaderboard (Partial)¶

Model	Params	Avg Score	Perception & Understanding	Reasoning & Analysis	Dialogue Context	Align. Score
Qwen2-VL	72B	6.51	~7.0	~5.5	~5.8	1.54
Claude	-	6.51	~7.1	~5.6	~6.3	1.45
GPT-4o	-	6.41	~6.9	~5.8	~5.4	1.18
CogVLM2	19B	5.81	~6.5	~4.7	~5.4	1.49
InternVL-Chat	26B	5.62	~6.0	~4.8	~5.4	1.12

Key Findings¶

All VLMs perform relatively well on perception and understanding (average score of 5.07) but poorly on reasoning and analysis (average score of 4.38).
Incoherent scenarios in multi-turn dialogues score significantly lower than coherent scenarios, demonstrating that VLMs struggle to detect errors in preceding dialogue.
English-centric VLMs (such as Phi-3-Vision) exhibit a significant performance drop on the Chinese benchmark.
High-performing models are not necessarily highly robust: GPT-4o achieves high performance but has a lower alignment score (1.18), whereas Qwen2-VL excels in both.

Highlights & Insights¶

First Chinese Visual Alignment Benchmark: Fills the gap in Chinese multimodal evaluation, utilizing data from real-world Chinese scenarios with tasks spanning both single-turn and multi-turn dialogues.
Small-Model Evaluator Outperforming GPT-4: With only 6B parameters, CritiqueVLM surpasses GPT-4's evaluation capability through rule calibration and SFT, demonstrating that domain-specific fine-tuned evaluators hold massive advantages, remaining highly controllable and economical.
Alignment Score Metric: Incorporates robustness into the evaluation, assessing stability rather than only peak scores; this methodology can be generalized to other evaluation benchmarks.
Prompt Paraphrase Strategy: Cost-effectively scales the evaluation size and introduces a robustness dimension.

Limitations & Future Work¶

The data scale is relatively limited (1054 images); the task coverage can be further expanded.
CritiqueVLM is based on ChatGLM3-6B (an older model); upgrading to a newer base model may yield further improvements.
The construction method for incoherent scenarios in dialogue context tasks is not detailed.
The evaluation focuses solely on Chinese, leaving cross-lingual comparative analysis (Chinese-English) unexplored.
Some images are sourced from web scrapers, requiring continuous attention to copyright compliance.
Lacks theoretical analysis regarding the alignment score (e.g., why certain models are more robust).

vs MMBench: MMBench includes both English and Chinese versions but only utilizes multiple-choice questions; AlignMMBench evaluates deeper alignment capability using open-ended questions.
vs LLaVABench / MM-Vet: These are English open-ended benchmarks evaluated by GPT; AlignMMBench targets Chinese and employs a self-trained evaluator.
vs TouchStone: TouchStone is also an open-ended VLM evaluation but only in English and without dialogue context tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The first Chinese multimodal alignment benchmark, with innovations in CritiqueVLM and the alignment score.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluates 15+ models, with comprehensive comparative validation for the evaluator.
Writing Quality: ⭐⭐⭐⭐ Clear data construction pipeline and comprehensive evaluation methodology.
Value: ⭐⭐⭐⭐ Holds significant practical value for the Chinese VLM community; CritiqueVLM can be directly reused.