CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution¶
Conference: AAAI2026 arXiv: 2511.21717 Code: https://github.com/bytedance/CrossCheck-Bench Area: Interpretability Keywords: [multimodal conflict detection, compositional reasoning, diagnostic evaluation, vision-language models, hierarchical benchmark]
TL;DR¶
CrossCheck-Bench is a three-level hierarchical benchmark comprising 15k adversarial QA samples. It diagnoses compositional reasoning failures of VLMs in multimodal conflict resolution via 7 atomic capabilities and 15 tasks, revealing systematic performance degradation from perception (L1) to reasoning (L3) and exposing the limitations of conventional prompting strategies.
Background & Motivation¶
Background In open-domain multimodal content, visual and textual cues frequently contradict each other—e.g., an e-commerce page displaying a luxury brand logo paired with a suspiciously low price, or a sportswear image accompanied by a formal-wear description. Humans can intuitively detect such inconsistencies, yet existing VLMs are predominantly trained and evaluated on aligned image-text pairs.
Limitations of Prior Work Existing benchmarks (VCR, MMMU, MathVista, etc.) primarily assess compositional tasks where modalities are mutually reinforcing, implicitly assuming visual-textual consistency. Inconsistency-detection benchmarks such as MMIR are limited to predefined error types and lack fine-grained capability diagnostics. No benchmark systematically tests whether models can verify the logical compatibility of multimodal signals.
Key Challenge VLMs may confidently affirm incompatible cues, producing outputs that are logically inconsistent with the input evidence. This capability gap poses tangible risks in real-world deployment scenarios such as product authenticity verification and content moderation.
Goal To systematically evaluate and diagnose the ability of VLMs to detect, analyze, and resolve cross-modal inconsistencies.
Key Insight A diagnostic framework is designed around a three-level hierarchy (perception → integration → reasoning) and seven atomic capabilities, with evaluation samples constructed by injecting contradictions into real-world data.
Core Idea Through hierarchical capability decomposition and cascading failure analysis, the work uncovers a fundamental issue in VLM multimodal conflict reasoning: models may appear successful at the perception layer while failing systematically at the reasoning layer.
Method¶
Overall Architecture¶
The CrossCheck-Bench construction pipeline consists of three stages: (1) Cue encoding — real-world e-commerce data spanning 30+ product categories and 5 languages is aggregated into Multimodal Cue Graphs (MCGs), where each MCG contains (entity, modality, attribute, value) quadruples; (2) QA composition — three-level hierarchical QA pairs are generated by sampling 1–n cues from each MCG; (3) Quality control — a three-step cycle of expert review, model-based filtering, and difficulty balancing, requiring 450+ expert hours.
Key Designs¶
-
Three-Level Diagnostic Hierarchy and Seven Atomic Capabilities:
- Function: Decomposes multimodal conflict detection ability into fine-grained units that can be measured independently and analyzed in combination.
- Mechanism: Seven atomic capabilities are defined—A1 visual grounding, A2 entity recognition, A3 attribute comparison, A4 multi-frame reasoning, A5 numerical plausibility, A6 region-constrained OCR, and A7 rule-based logic—organized into three cognitive levels: L1 Perception (single capability), L2 Integration (2–3 capability combinations), and L3 Reasoning (multi-step inference + rule verification). Each level builds upon the previous, forming cascading dependencies.
- Design Motivation: Enables model failures to be traced back to specific capability deficits, distinguishing "perception failures" from "reasoning failures" for precise failure attribution.
-
Multimodal Cue Graph (MCG) and Adversarial QA Generation:
- Function: Constructs structured factual representations from real-world data and generates evaluation samples with injected contradictions.
- Mechanism: MCG construction involves entity extraction (YOLOv8-L + GroundingDINO + visual embedding ensemble + fine-tuned Qwen3-8B for text recognition), attribute extraction (rule templates + GPT-4o augmentation), and cross-validation (GPT-4o + 15% human review yielding 98.2% accuracy). The resulting 22.8k MCGs contain an average of 12.7 verifiable cues. QA generation employs a hybrid strategy: L1 template-driven (45+ rule templates), L2 model-assisted (GPT-4o generation + human refinement), and L3 expert handcrafted.
- Design Motivation: Ensures authenticity of evaluation data (sourced from real e-commerce scenarios), controllability of contradiction injection (via precise MCG manipulation), and reliability of quality assurance (through a three-step verification pipeline).
Loss & Training¶
This paper presents a benchmark evaluation study and involves no model training. Evaluation uses a hybrid scoring protocol: exact matching for deterministic multiple-choice questions, and semantic judgment by GPT-4o for open-ended responses. The proposed MM-CoT (Multimodal interleaved Chain-of-Thought) operates in two stages: Stage 1 generates candidate answers and extracts visual element annotations with bounding boxes; Stage 2 feeds the annotation-augmented input together with the reasoning trace back into the model for iterative inference.
Key Experimental Results¶
Main Results¶
| Model | Avg. Accuracy | L1 Perception | L2 Integration | L3 Reasoning |
|---|---|---|---|---|
| Human | 95.2 | 94.5–98.1 | 85.6–97.8 | 82.1–94.3 |
| GPT-4.1 | 76.8 | 85.3 | ~80 | 75.7 |
| Gemini-2.5-pro | 76.2 | 80.9 | ~83.7 | ~70.2 |
| InternVL3-78B | 71.5 | 74.4 | ~74.0 | ~64.0 |
| Qwen2.5VL-72B | 69.9 | 75.1 | ~69.1 | ~63.6 |
| MiMo-VL-7B | 65.3 | 62.9 | ~66.1 | ~46.7 |
Ablation Study¶
| Prompting Strategy | A5 Numerical Reasoning | A6 OCR | A7 Logical Inference |
|---|---|---|---|
| Base (Vanilla) | 61.2 | 58.7 | 49.1 |
| CoT | 62.0 | 56.3 | 50.8 ↑ |
| SoM | 62.4 | 60.9 ↑ | 48.6 |
| CoT + SoM | 61.8 | 59.3 | 50.1 |
| CSFT (500 samples) | 63.5 ↑ | 60.2 | 49.5 |
| MM-CoT (Ours) | 65.3 ↑ | 61.7 ↑ | 53.5 ↑ |
Key Findings¶
- All models exhibit consistent performance degradation from L1 to L3: GPT-4.1 drops from 85.3% to 75.7%, with more severe degradation observed in open-source models (MiMo-VL: L1 62.9% → L3 46.7%).
- Composition of atomic capabilities induces accuracy drops of 12%–35%: individual capabilities perform adequately but collapse under combination.
- Model scaling is effective for lower-level tasks (Qwen2.5-VL 7B→72B yields ~9-point L1 improvement) but yields diminishing or negative returns at higher-level reasoning.
- The gap between human performance and the best model exceeds 18 points, widening further on L3 reasoning tasks (88% vs. ~76%).
- Conventional CoT and SoM prompting provide only marginal or even negative gains, whereas MM-CoT significantly outperforms them through iterative interleaving of visual grounding and symbolic reasoning.
Highlights & Insights¶
- Cascading failure analysis reveals deep-seated issues: the same sample succeeds at L1 but fails at L2/L3, indicating that surface-level perception masks a collapse at the reasoning layer.
- The capability decomposition design makes failures attributable and diagnosable, rather than reducing performance to a single aggregate score.
- The finding of diminishing returns from model scaling on compositional reasoning suggests that pure scaling cannot resolve reasoning bottlenecks.
- MM-CoT offers a promising improvement direction by enhancing cross-modal verification through iterative reasoning-grounding feedback loops.
Limitations & Future Work¶
- Data is primarily sourced from e-commerce scenarios; domain diversity should be extended to news, social media, and other settings.
- MCG construction relies on GPT-4o and expert annotation, making it costly and difficult to scale automatically.
- Only zero-shot QA protocols are evaluated; performance under few-shot or fine-tuning settings remains unexplored.
- MM-CoT requires two inference calls, doubling inference overhead.
- Difficulty calibration for L3 tasks may be susceptible to model consensus bias, given only 18% expert coverage.
Related Work & Insights¶
This work complements traditional alignment-oriented benchmarks such as VCR and MMMU, extending VLM evaluation from "understanding consistent information" to the new dimension of "detecting inconsistent information." SpaCE-10 decomposes spatial intelligence into 10 atomic skills but does not consider conflict scenarios; VLM2-Bench addresses cross-image matching rather than intra-input conflicts—CrossCheck-Bench fills the gap of "hierarchical conflict diagnosis." For future VLM design, the findings suggest the need for training data and evaluation mechanisms specifically targeting conflict reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ First hierarchical multimodal conflict diagnosis benchmark with a well-motivated problem formulation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 13 models, 15k samples, and multi-dimensional analysis comprehensively.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, rich figures and tables, and intuitive data presentation.
- Value: ⭐⭐⭐⭐ Provides important tools and directions for VLM reliability evaluation and improvement.