HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?¶
Conference: ICCV 2025
arXiv: 2504.18406
Code: Project Page
Area: Multimodal VLM
Keywords: High-Resolution Image Understanding, VLM Benchmark, Vision-Language Models, Multimodal Evaluation, Needle-in-a-Haystack
TL;DR¶
This paper introduces HRScene, a benchmark covering 25 real-world scenarios and 2 diagnostic datasets (resolution 1K–35K). Evaluating 28 VLMs reveals that current state-of-the-art models achieve an average accuracy of only ~50% on real high-resolution tasks, with significant regional performance divergence and a pronounced lost-in-middle problem.
Background & Motivation¶
High-resolution image (HRI) understanding is critical in domains such as pathology, autonomous driving, and document understanding. Although VLMs such as Gemini, Claude, and GPT claim to support high-resolution inputs, a serious evaluation gap persists:
Lack of benchmarks: Benchmarks used in mainstream VLM reports (MMMU, VQAv2, AI2D, etc.) have average resolutions below 1K, making them unsuitable for HRI evaluation.
Narrow scenario coverage: Existing HRI datasets focus on specific scenarios (e.g., long-range imagery) or specific resolutions (e.g., 8K).
Insufficient diagnostics: Existing multimodal NIAH tests primarily address long-context text or low-resolution multi-image settings, lacking diagnostics for regional utilization in HRI.
Goal: To construct a unified, comprehensive, and practical HRI benchmark that systematically evaluates VLMs' high-resolution understanding capabilities and identifies their core deficiencies.
Method¶
Overall Architecture¶
HRScene consists of two main components: - 25 real-world scenario datasets: Resolution 1K–35K, spanning 8 major categories. - 2 synthetic diagnostic datasets: Designed to precisely localize VLM deficiencies.
Key Designs¶
-
Taxonomy:
- 8 major categories: Daily photos, urban planning, scanned documents, artwork, sub-images, remote sensing, medical diagnosis, and research understanding.
- 25 specific scenarios: Ranging from microscopy to radio telescopes, covering diverse camera types.
- Multiple capability tests: Counting, temporal/semantic reasoning, holistic judgment, visual retrieval, spatial relationships, and small-object detection.
- 6 datasets require domain expert knowledge; 19 belong to the general domain.
-
Data Collection and Re-annotation:
- Data collected from 25 existing sources; 8 datasets re-annotated by 10 graduate-level annotators.
- All images have resolution ≥ 1024×1024.
- Distractor options constructed for 6 datasets (at least 4 options per sample).
- Numerical answer options generated automatically via random offset.
- An additional 750-sample human performance subset collected as an upper bound.
-
WhiteBackground NIAH Diagnostic:
- VQAv2 images (needle) are placed at varying row/column positions in an N×N white grid (haystack).
- Evaluates performance variation across spatial positions to detect Regional Divergence.
- Grid sizes range from 1×1 to 10×10.
-
ComplexGrid NIAH Diagnostic:
- Visually similar distractors retrieved via image retrieval tools are combined with the needle into a larger grid.
- Models are required to identify the row and column of the needle.
- Evaluates VLMs' ability to retrieve the correct image among multiple hard distractors.
Dataset Scale and Splits¶
| Statistic | Count |
|---|---|
| Total samples | 7,068 |
| Re-annotated | 2,005 |
| Annotated from scratch | 384 |
| val | 750 (= human-annotated) |
| testmini | 1,000 |
| test | 5,323 |
Key Experimental Results¶
Main Results (Tables)¶
Overall Performance on Real-World Datasets:
| Model | Art | Daily | Medical | Paper | Remote | Research | Sub-Img | Urban | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2-VL 7B | 69.46 | 64.20 | 40.40 | 64.62 | 50.60 | 36.69 | 71.42 | 40.17 | 56.65 |
| InternVL2 40B | 74.35 | 62.67 | 38.10 | 70.89 | 44.16 | 43.15 | 74.10 | 44.40 | 58.45 |
| Qwen2-VL 72B | 75.85 | 66.20 | 43.69 | 78.13 | 52.48 | 39.36 | 74.89 | 44.66 | 61.85 |
| Gemini2.0 Flash | 76.46 | 62.27 | 51.94 | 75.12 | 47.59 | 34.85 | 68.62 | 44.54 | 59.82 |
| GPT-4o | 69.13 | 55.90 | 22.63 | 66.80 | 44.05 | 35.38 | 65.13 | 41.72 | 52.91 |
| Human | 75.33 | 77.75 | 23.81 | 88.75 | 58.33 | 48.50 | 90.00 | 55.25 | 64.72 |
| 28-model Avg | 61.54 | 53.18 | 36.64 | 58.17 | 41.75 | 36.08 | 60.60 | 37.84 | 49.68 |
Ablation Study (Tables)¶
WhiteBackground NIAH Diagnostic — Regional Divergence Analysis:
| Model | 1×1 Perf | 3×3 Perf | 3×3 Region↓ | 5×5 Perf | 5×5 Region↓ | 10×10 Perf | 10×10 Region↓ |
|---|---|---|---|---|---|---|---|
| Qwen2-VL 7B | 85.93 | 84.22 | 5.30 | 83.14 | 6.52 | 79.91 | 10.56 |
| Qwen2-VL 72B | 84.13 | 84.51 | 5.62 | 84.04 | 6.62 | 84.56 | 9.61 |
| GPT-4o-mini | 68.66 | 60.69 | 13.77 | 52.53 | 19.59 | 32.94 | 33.65 |
| DeepSeek-VL2 | 72.06 | 49.71 | 15.75 | 34.29 | 23.37 | 23.95 | 23.30 |
| InternVL2 40B | 84.53 | 83.42 | 4.57 | 80.02 | 8.84 | 74.95 | 13.18 |
(Region metric = standard deviation of performance across spatial positions; lower is better.)
Key Findings¶
- Significant overall gap: The average accuracy of 28 VLMs is only 49.68%; even the strongest model, Qwen2-VL 72B, reaches only 61.85%.
- Large category disparities: Medical (36.64%) and Research (36.08%) are the weakest categories, while Paper (58.17%) and Art (61.54%) perform relatively better.
- Human vs. model: Human average is 64.72%, but only 23.81% on Medical (requiring expert knowledge); humans reach 90% on Sub-Img while models average only 60%.
- Regional Divergence: As grid size increases, most models degrade substantially (e.g., GPT-4o-mini drops from 68.66% at 1×1 to 32.94% at 10×10), whereas Qwen2-VL 72B remains nearly unaffected.
- Scale is not always decisive: Qwen2-VL 7B outperforms many larger models (e.g., LLaVA-Next 34B) on most metrics.
- Lost-in-middle phenomenon: VLMs recognize images placed at central grid positions less accurately than those at peripheral positions.
Highlights & Insights¶
- HRScene is the most comprehensive HRI benchmark to date: 25 scenarios, resolution spanning 4 orders of magnitude, covering both expert and general domains.
- The two diagnostic datasets are elegantly designed, quantitatively exposing two core VLM deficiencies: regional divergence and the lost-in-middle effect.
- Qwen2-VL 72B shows near-zero performance degradation with increasing resolution in the WhiteBackground NIAH (Region metric consistently below 10), suggesting a superior internal resolution processing strategy.
- Human performance on Medical is only 23.81%, yet the model average is 36.64%, indicating that models have already surpassed non-expert humans in certain specialist domains.
Limitations & Future Work¶
- The benchmark is predominantly multiple-choice, offering limited evaluation of open-ended responses.
- Diagnostic datasets rely on white backgrounds and visually similar distractors; more naturalistic composite scenes could further challenge models.
- Some datasets have relatively small sample sizes (e.g., Galaxy and Grass), warranting stronger statistical validation.
- High-resolution video understanding (e.g., high-resolution video frame sequences) is not addressed.
- Test set answers are not publicly released and require online platform submission, which may hinder rapid iterative research.
Related Work & Insights¶
- HRScene complements MME-Realworld and HR-Bench by providing broader scenario coverage.
- Extending NIAH testing from text/multi-image settings to single high-resolution images is a natural and important generalization.
- Qwen2-VL's tile-based processing strategy warrants further investigation, given its markedly superior resistance to resolution-induced degradation.
- Comparative analysis of VLM high-resolution processing architectures (dual-encoder vs. tiling strategies) offers useful guidance for model design.
Rating¶
- Novelty: ⭐⭐⭐⭐ First large-scale unified HRI benchmark with cleverly designed diagnostic datasets.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 28 models, 27 datasets, human performance comparison, and diagnostic analysis.
- Writing Quality: ⭐⭐⭐⭐ Clear taxonomy and precise articulation of identified issues.
- Value: ⭐⭐⭐⭐⭐ Provides a much-needed systematic evaluation tool for high-resolution VLM understanding.