VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?¶

Conference: ICLR 2026
arXiv: 2603.07888
Code: GitHub / Dataset
Area: Multimodal VLM
Keywords: VLM, Comparative Reasoning, Benchmark, Subtle Differences, Multi-Image

TL;DR¶

This paper introduces VLM-SubtleBench, a benchmark for evaluating vision-language models on subtle difference comparative reasoning, covering 10 difference types and 6 image domains (natural, gaming, industrial, aerial, medical, and synthetic). It reveals a performance gap of over 30% between VLMs and humans on spatial, temporal, and viewpoint reasoning tasks.

Background & Motivation¶

Distinguishing subtle visual differences is a core human cognitive ability, widely applied in industrial inspection, medical diagnosis, remote sensing analysis, and related fields. Existing VLM benchmarks have two critical shortcomings:

Insufficient subtlety: Benchmarks such as MLLM-CompBench feature image pairs with obvious differences (low DINOv3 similarity), which state-of-the-art VLMs like GPT-4o can already solve with ease.

Insufficient domain coverage: Most benchmarks are limited to natural images and do not cover specialized domains such as industrial, medical, or aerial imagery.

Core Problem: How far are VLMs from human-level performance on tasks requiring fine-grained comparative reasoning?

Method¶

Benchmark Design¶

Image domains covered (6): - Natural scenes, gaming environments, aerial imagery, industrial inspection, medical imaging, and synthetic primitives

Difference types covered (10): - Attribute (color/size/shape), State (damage/status change), Emotion (facial expression) - Temporal (temporal order), Spatial (spatial position), Existence (object appearance/disappearance) - Quantity (count differences), Quality (image quality), Viewpoint (perspective change), Action (action differences)

Dataset Construction¶

A total of 13K triplets (image pairs + questions + answers), with at least 1K samples per difference type.

Key construction strategies: - Attribute: MVTEC-AD defect pairs + COCO object color editing + medical X-ray comparisons - Temporal/Viewpoint: Frame pairs sampled from videos (YT8M, VLM4D, CameraBench) + manual annotation and verification - Spatial: Translation/rotation actions from VLM4D 4D annotations - Existence: LEVIR-MCI remote sensing change detection + synthetic addition/deletion - Quality: Best and worst quality frames manually selected from video sequences

Difference Description Annotation¶

Human-annotated difference descriptions were additionally collected for 1,200 image pairs (10% test set) to support captioning evaluation.

Dataset Statistics¶

Test set: 11.7K
Validation set: 1.3K
Each difference type includes natural domain data

Key Experimental Results¶

Model Evaluation¶

Model	AT	ST	EM	TM	SP	EX	QN	QL	VP	AC	AVG
Random	35.9	50.0	50.0	50.0	36.6	23.2	48.9	50.0	42.1	50.0	43.3
Human	92.0	93.0	93.0	93.0	95.0	97.0	97.0	99.0	98.0	98.0	95.5
LLaVA-NeXT-7B	37.0	51.3	51.8	47.4	37.3	25.6	49.5	48.0	43.7	46.9	43.6
Qwen2.5-VL-7B	46.5	63.7	87.8	50.2	39.5	73.8	58.0	70.9	47.5	69.3	59.4
Qwen2.5-VL-72B	-	-	-	-	-	-	-	-	-	-	~65

Key Findings¶

Large human–machine gap: Even GPT-5 and Gemini-2.5-pro lag behind humans by more than 30 percentage points on spatial, temporal, and viewpoint reasoning.
Limited effect of prompting strategies: Techniques such as CoT, grid layout, and image overlay yield only marginal improvements.
High sensitivity to difficulty factors: Object size and quantity significantly affect VLM performance.
Large open-source vs. closed-source gap: LLaVA-NeXT-7B performs near random (43.6 vs. 43.3).
Emotion recognition as a relative strength: Qwen2.5-VL-7B achieves 87.8 on Emotion, approaching human-level performance.

Prompting Strategy Analysis¶

Strategy	Effect
Chain-of-Thought	Marginal improvement
Two-step reasoning	Limited gains
Grid overlay	Slight help
Pixel difference highlighting	Effective for certain types
Horizontal concatenation	Inconsistent results

Comparison with MLLM-CompBench¶

Image pairs in VLM-SubtleBench exhibit substantially higher DINOv3 similarity than those in MLLM-CompBench (>0.8 vs. <0.6), confirming the greater subtlety of the differences.

Highlights & Insights¶

Fills an important gap: The first comprehensive benchmark focused on subtle difference comparative reasoning.
Multi-domain coverage: The only comparative reasoning benchmark that simultaneously covers specialized domains including industrial, medical, and aerial imagery.
Systematic analysis: In-depth ablation studies on prompting strategies and difficulty factors.
High practical value: Directly targets critical weaknesses of VLMs in real-world applications.

Limitations & Future Work¶

Some image pairs for certain difference types are generated through editing, which may introduce unnatural artifacts.
The medical domain covers only chest X-rays; domain coverage could be further expanded.
The human baseline is based on 10% sampling, which may lack statistical robustness.
Synthetic primitive scenes are relatively simple and do not fully reflect the complexity of real-world applications.
The evaluation focuses solely on final answer correctness, without in-depth analysis of the reasoning process.

Multi-image benchmarks: BLINK (Fu et al., 2024) evaluates low-level visual perception; MuirBench (Wang et al., 2025) covers 12 types of multi-image tasks.
Comparative reasoning benchmarks: MLLM-CompBench (Kil et al., 2024) evaluates 8 difference types but with conspicuous differences.
Difference description: Img-Diff, OneDiff, DiffTell, and others focus on difference captioning.
Domain-specific: MIMIC-Diff-VQA (medical), GeoBench (remote sensing).

Rating¶

Novelty: ⭐⭐⭐⭐ — Focusing on subtle difference comparative reasoning represents a novel perspective.
Practicality: ⭐⭐⭐⭐⭐ — Directly serves high-value evaluation scenarios such as industrial inspection and medical diagnosis.
Clarity: ⭐⭐⭐⭐ — Benchmark design and experimental analysis are clear and systematic.
Significance: ⭐⭐⭐⭐ — Reveals fundamental deficiencies of VLMs in fine-grained visual reasoning.