Skip to content

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

  • Conference: ICLR 2026
  • arXiv: 2603.07888
  • Code: GitHub / Dataset
  • Area: Multimodal VLM
  • Keywords: VLM, Comparative Reasoning, Benchmark, Subtle Differences, Multi-Image

TL;DR

This paper introduces VLM-SubtleBench, a benchmark for evaluating vision-language models on subtle difference comparative reasoning, covering 10 difference types and 6 image domains (natural, gaming, industrial, aerial, medical, and synthetic). It reveals a performance gap of over 30% between VLMs and humans on spatial, temporal, and viewpoint reasoning tasks.

Background & Motivation

Distinguishing subtle visual differences is a core human cognitive ability, widely applied in industrial inspection, medical diagnosis, remote sensing analysis, and related fields. Existing VLM benchmarks have two critical shortcomings:

Insufficient subtlety: Benchmarks such as MLLM-CompBench feature image pairs with obvious differences (low DINOv3 similarity), which state-of-the-art VLMs like GPT-4o can already solve with ease.

Insufficient domain coverage: Most benchmarks are limited to natural images and do not cover specialized domains such as industrial, medical, or aerial imagery.

Core Problem: How far are VLMs from human-level performance on tasks requiring fine-grained comparative reasoning?

Method

Benchmark Design

Image domains covered (6): - Natural scenes, gaming environments, aerial imagery, industrial inspection, medical imaging, and synthetic primitives

Difference types covered (10): - Attribute (color/size/shape), State (damage/status change), Emotion (facial expression) - Temporal (temporal order), Spatial (spatial position), Existence (object appearance/disappearance) - Quantity (count differences), Quality (image quality), Viewpoint (perspective change), Action (action differences)

Dataset Construction

A total of 13K triplets (image pairs + questions + answers), with at least 1K samples per difference type.

Key construction strategies: - Attribute: MVTEC-AD defect pairs + COCO object color editing + medical X-ray comparisons - Temporal/Viewpoint: Frame pairs sampled from videos (YT8M, VLM4D, CameraBench) + manual annotation and verification - Spatial: Translation/rotation actions from VLM4D 4D annotations - Existence: LEVIR-MCI remote sensing change detection + synthetic addition/deletion - Quality: Best and worst quality frames manually selected from video sequences

Difference Description Annotation

Human-annotated difference descriptions were additionally collected for 1,200 image pairs (10% test set) to support captioning evaluation.

Dataset Statistics

  • Test set: 11.7K
  • Validation set: 1.3K
  • Each difference type includes natural domain data

Key Experimental Results

Model Evaluation

Model AT ST EM TM SP EX QN QL VP AC AVG
Random 35.9 50.0 50.0 50.0 36.6 23.2 48.9 50.0 42.1 50.0 43.3
Human 92.0 93.0 93.0 93.0 95.0 97.0 97.0 99.0 98.0 98.0 95.5
LLaVA-NeXT-7B 37.0 51.3 51.8 47.4 37.3 25.6 49.5 48.0 43.7 46.9 43.6
Qwen2.5-VL-7B 46.5 63.7 87.8 50.2 39.5 73.8 58.0 70.9 47.5 69.3 59.4
Qwen2.5-VL-72B - - - - - - - - - - ~65

Key Findings

  1. Large human–machine gap: Even GPT-5 and Gemini-2.5-pro lag behind humans by more than 30 percentage points on spatial, temporal, and viewpoint reasoning.
  2. Limited effect of prompting strategies: Techniques such as CoT, grid layout, and image overlay yield only marginal improvements.
  3. High sensitivity to difficulty factors: Object size and quantity significantly affect VLM performance.
  4. Large open-source vs. closed-source gap: LLaVA-NeXT-7B performs near random (43.6 vs. 43.3).
  5. Emotion recognition as a relative strength: Qwen2.5-VL-7B achieves 87.8 on Emotion, approaching human-level performance.

Prompting Strategy Analysis

Strategy Effect
Chain-of-Thought Marginal improvement
Two-step reasoning Limited gains
Grid overlay Slight help
Pixel difference highlighting Effective for certain types
Horizontal concatenation Inconsistent results

Comparison with MLLM-CompBench

Image pairs in VLM-SubtleBench exhibit substantially higher DINOv3 similarity than those in MLLM-CompBench (>0.8 vs. <0.6), confirming the greater subtlety of the differences.

Highlights & Insights

  1. Fills an important gap: The first comprehensive benchmark focused on subtle difference comparative reasoning.
  2. Multi-domain coverage: The only comparative reasoning benchmark that simultaneously covers specialized domains including industrial, medical, and aerial imagery.
  3. Systematic analysis: In-depth ablation studies on prompting strategies and difficulty factors.
  4. High practical value: Directly targets critical weaknesses of VLMs in real-world applications.

Limitations & Future Work

  1. Some image pairs for certain difference types are generated through editing, which may introduce unnatural artifacts.
  2. The medical domain covers only chest X-rays; domain coverage could be further expanded.
  3. The human baseline is based on 10% sampling, which may lack statistical robustness.
  4. Synthetic primitive scenes are relatively simple and do not fully reflect the complexity of real-world applications.
  5. The evaluation focuses solely on final answer correctness, without in-depth analysis of the reasoning process.
  • Multi-image benchmarks: BLINK (Fu et al., 2024) evaluates low-level visual perception; MuirBench (Wang et al., 2025) covers 12 types of multi-image tasks.
  • Comparative reasoning benchmarks: MLLM-CompBench (Kil et al., 2024) evaluates 8 difference types but with conspicuous differences.
  • Difference description: Img-Diff, OneDiff, DiffTell, and others focus on difference captioning.
  • Domain-specific: MIMIC-Diff-VQA (medical), GeoBench (remote sensing).

Rating

  • Novelty: ⭐⭐⭐⭐ — Focusing on subtle difference comparative reasoning represents a novel perspective.
  • Practicality: ⭐⭐⭐⭐⭐ — Directly serves high-value evaluation scenarios such as industrial inspection and medical diagnosis.
  • Clarity: ⭐⭐⭐⭐ — Benchmark design and experimental analysis are clear and systematic.
  • Significance: ⭐⭐⭐⭐ — Reveals fundamental deficiencies of VLMs in fine-grained visual reasoning.