PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies¶

Conference: ICLR 2026 arXiv: 2510.16505 Code: Project Page Area: Multimodal Evaluation / Scientific Documents Keywords: Multimodal Inconsistency, Peer Review, Scientific Papers, LMM Benchmark, JSON Debiasing

TL;DR¶

This work introduces PRISMM-Bench, the first benchmark grounded in genuine reviewer-annotated multimodal inconsistencies in scientific papers. Mining 18,009 ICLR open reviews yields 384 cross-modal inconsistencies, evaluated across three tasks—identification, remediation, and paired matching—with a JSON-structured debiasing scheme for answer representation. Among 21 state-of-the-art LMMs, the best achieves only 53.9%, systematically exposing severe deficiencies in cross-modal reasoning over scientific documents.

Background & Motivation¶

Background: Large multimodal models (LMMs) are increasingly employed to assist scientific research—interpreting figures, summarizing papers, and detecting errors. A fundamental question, however, remains unresolved: can LMMs genuinely understand and reason over the complex multimodal structure of scientific papers spanning text, figures, and equations?

Limitations of Prior Work: - Existing document QA benchmarks (DocVQA, ChartQA, etc.) evaluate individual modalities in isolation, ignoring cross-modal dependencies among text, figures, and formulas. - Synthetic datasets (e.g., MMIR) inject artificial errors, which tend to be overly conspicuous and fail to represent the subtle, domain-knowledge-demanding inconsistencies found in real-world scientific writing. - Multiple-choice evaluation suffers from severe linguistic bias—models can achieve well above chance accuracy by reading answer choices alone without the question (e.g., Gemini 2.5 Flash reaches 57.6% without context).

Key Challenge: A benchmark that is simultaneously authentic and systematic is needed to assess cross-modal reasoning, yet real inconsistencies are rare, scattered, and costly to verify; moreover, evaluation itself is compromised by linguistic shortcuts.

Goal: (1) How can real cross-modal inconsistencies be systematically collected? (2) How can evaluation tasks be designed to be fair and unbiased?

Key Insight: Open peer review provides a natural solution—inconsistencies flagged by reviewers in real papers constitute expert-level annotations that are organically produced and unpredictable.

Core Idea: Reviewer criticisms are the best test cases for multimodal reasoning.

Method¶

Overall Architecture: Six-Stage Construction Pipeline¶

PRISMM-Bench is constructed through six stages: (1) Review Acquisition—18,009 ICLR 2024/2025 reviews are scraped from OpenReview, restricted to rejected or withdrawn papers without rebuttals to ensure inconsistencies remain uncorrected; (2) LLM Filtering—Mistral Nemo at low temperature filters down to 6,056 candidate inconsistency mentions; (3) Human Annotation—a custom web annotation tool is used to verify each entry, labeling inconsistency type, modalities involved, and location metadata, yielding 384 inconsistencies across 353 papers and 15 categories; (4) LMM Task Generation—Gemini 2.5 Flash automatically generates four-way multiple-choice questions; (5) Human Verification—automatically generated errors are corrected; (6) LLM Debiasing—natural-language answers are converted to JSON format to eliminate linguistic shortcuts.

Key Design 1: Three-Task Progressive Evaluation Framework¶

Three multiple-choice tasks of increasing difficulty (4 options each) are combined with three levels of context granularity, forming seven evaluation configurations:

Inconsistency Identification (Ident, 384 items): Given paper context, answer "What inconsistency exists across these sections?"→detection ability.
Inconsistency Remediation (Remedy, 384 items): Answer "What action is needed to fix the inconsistency?"→requires deeper reasoning.
Paired Matching (Match, 192 items): Given a visual element, identify which of four candidates conflicts with it→pure visual cross-modal reasoning.

Three context granularity levels: Focused (key excerpts only) → Page (full page rendered at 144 DPI) → Document (entire paper concatenated into five images), with increasing difficulty.

Design Motivation: The three tasks progress from detection to remediation to relational reasoning, while the three context levels move from noise-free to highly distracting, jointly covering the full capability spectrum of scientific document understanding.

Key Design 2: JSON-Structured Debiasing Answer Representation¶

To counter models exploiting choice-only shortcuts, answer options are converted from natural language to structured JSON:

Ident task: Evidence–Claim JSON format (evidence + assertion).
Remedy task: Target–Action JSON format (target element + corrective action).

The core mechanism is to remove stylistic cues (length variation, phrasing habits, positional patterns) while retaining only semantic content. The visual dependency ratio \(R\) quantifies the effect:

\[R = \frac{Acc_{\text{with\_context}} - Acc_{\text{without\_context}}}{1 - Acc_{\text{without\_context}}}\]

A higher \(R\) indicates greater reliance on visual evidence. Human \(R = 69.0\%\), while the best model achieves only \(R = 53.5\%\), indicating that humans rely more genuinely on visual reasoning than current models.

Key Experimental Results¶

Main Results: 21 LMMs Benchmarked (Accuracy %)¶

Model	Params	Ident-Focused	Remedy-Focused	Match	Ident-Page	Ident-Doc	Avg.
Gemma 3 4B	4B	27.9	29.9	39.6	25.0	26.6	27.8
InternVL3.5 8B (R)	8B	49.5	35.9	45.8	38.3	36.7	37.7
Ovis2 34B	34B	50.0	41.1	37.0	40.6	33.3	38.7
GLM 4.5V 106B (R)	106B	51.8	43.2	52.1	45.8	40.9	42.6
GPT-5 minimal (R)	—	53.6	43.5	63.0	47.1	40.9	44.0
Gemini 2.5 Pro (R)	—	65.9	61.2	66.7	54.7	39.8	52.8
GPT-5 high (R)	—	63.8	54.4	70.3	58.1	46.9	53.9

Ablation Study: Effect of Disabling CoT Reasoning (Ident-Focused)¶

Model	Reasoning On	Reasoning Off	Drop
GLM 4.5V 106B	51.8%	43.2%	−16.6%
InternVL3.5 8B	49.5%	40.6%	−18.0%
InternVL3.5 38B	54.4%	40.4%	−25.7%

JSON Debiasing Effect (User Study Subset)¶

Model	NL w/o Context	JSON w/o Context	Visual Dep. R (NL)	Visual Dep. R (JSON)
InternVL3.5 38B	53.7%	25.3%	22.5	38.1
Gemini 2.5 Pro	70.1%	37.3%	43.8	45.2
Human	27.5%	—	69.0	—

Key Findings¶

Even the strongest model, GPT-5 (high), achieves only 53.9%, falling far short of what would be required for a reliable scientific assistant.
Performance consistently degrades from Focused → Page → Document, indicating that long-document distraction is a critical bottleneck.
Remedy scores are systematically lower than Ident scores, confirming that "fixing" requires deeper reasoning than "detecting."
Enabling CoT reasoning improves performance by 5–14 percentage points on average, highlighting the importance of structured reasoning for scientific document understanding.
17% of ICLR 2025 submissions contain at least one reviewer-flagged inconsistency, demonstrating that cross-modal inconsistency is a pervasive problem.
High-resolution specialist models (VILA HD 4K, InternLM XC 2.5) show no advantage under extended context.

Highlights & Insights¶

"Reviewer Criticism as Test Case" Data Philosophy: Rather than artificially injecting errors, the benchmark exploits problems naturally identified by experts during peer review, maximizing ecological validity and proximity to real-world application scenarios.
Elegant Simplicity of JSON Debiasing: The idea of style homogenization—borrowed from NLP security—is transferred to multimodal evaluation, using uniform structured representations to eliminate answer-style variation and addressing a systemic problem in MCQ-based benchmarking.
Sustainable Live Benchmark: The pipeline can be applied to new conference review data, continuously generating fresh samples and fundamentally avoiding data contamination.
Scale vs. Architecture: Gemma 3 12B achieves 63.5% on the Match task, surpassing many 70B+ models, suggesting that architectural design matters more than raw parameter count.

Limitations & Future Work¶

Coverage is limited to the AI domain (ICLR 2024/2025); inconsistencies in chemistry, biology, physics, and other fields may exhibit different characteristics.
Samples are biased toward rejected papers; persistent inconsistencies in accepted papers remain unevaluated.
The 384-sample scale is limited, providing insufficient statistical power for fine-grained per-category analyses.
The benchmark evaluates identification of inconsistencies at known locations, without assessing the ability to proactively search for them across an entire paper.

vs. MMIR (Yan et al., 2025): MMIR uses synthetically injected inconsistencies, enabling easier scaling but at the cost of realism; PRISMM-Bench uses genuine reviewer annotations, which are harder to collect but offer higher ecological validity. The two approaches are complementary.
vs. QASA / SciDQA: The former is text-only QA; the latter has a similar data source but lacks visual elements. PRISMM-Bench is unique in combining both authentic sourcing and multimodal evaluation.
Insights: Future work could extend the pipeline to arXiv preprints and reviews from additional venues to construct a large-scale cross-domain version. Integration with automated review tools (e.g., AI reviewers) could establish a closed-loop system for proactive inconsistency detection.

Rating¶

⭐⭐⭐⭐⭐ (5/5)

Overall assessment: The first benchmark grounded in authentic reviewer-annotated inconsistencies, combined with JSON debiasing, evaluated across 21 models × three tasks × three context levels in an exceptionally thorough experimental design. The pipeline supports sustainable expansion, establishing infrastructure-level contributions to the evaluation of scientific AI assistants and setting a new standard for multimodal benchmarking.