ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps¶

Conference: CVPR 2026 arXiv: 2505.18675 Code: Project Page Area: Multimodal VLM Keywords: Visual reasoning, transit maps, MLLM evaluation, spatial reasoning, reinforcement fine-tuning

TL;DR¶

This paper introduces the ReasonMap benchmark, constructed from high-resolution transit maps of 30 cities comprising 1,008 QA pairs, to systematically evaluate fine-grained visual understanding and spatial reasoning capabilities of 16 MLLMs. The work reveals the counter-intuitive phenomenon that base variants of open-source models consistently outperform their reasoning counterparts, and establishes a GRPO-based reinforcement fine-tuning training baseline.

Background & Motivation¶

Existing MLLM reasoning benchmarks exhibit notable blind spots: - Math/Logic benchmarks (MathVQA, MMMU, MathVerse): visual understanding plays a limited role - Fine-grained visual benchmarks (VBench, VisualPuzzles): require detailed perception but rarely involve spatial planning and reasoning - Spatial reasoning benchmarks* (CityBench, MapBench): relatively coarse-grained and often rely on external tools (map APIs) to bypass genuine visual reasoning

Core Problem: Benchmarks that simultaneously require fine-grained visual understanding (identifying station names, line colors/numbers) and spatial reasoning (planning transfer routes) remain absent.

Transit maps serve as an ideal test medium — they are information-dense, structured, require precise spatial interpretation, and are closely tied to real-world applications (navigation, urban planning).

Method¶

Overall Architecture¶

The ReasonMap construction pipeline consists of three stages: 1. Data collection and preprocessing: High-resolution transit maps from 30 cities across 13 countries are collected; line and station information is extracted via MLLM + human correction and standardized into JSON (Metro Data). 2. QA pair construction: Two stations are randomly selected from each map to generate short-form questions (fixed template) and long-form questions (two templates); reference routes are collected via Google Maps/Amap APIs. 3. Quality control: Correctness verification, diversity assurance, and difficulty balancing (map difficulty: easy/medium/hard, 10 maps each; question difficulty stratified by number of transfers).

Key Designs¶

Two-tier evaluation framework:
Correctness evaluation (Accuracy): Verifies consistency of departure/arrival stations, line names, and intermediate stops in the answer — all checks must pass for a response to be considered correct.
Quality evaluation (Map Score): Assesses route quality even when the answer is not fully correct — 1 point for matching station names, 2 points for matching line names, intermediate stop count comparison or set IoU; the maximum score is capped per question type. Correct answers receive additional bonus points, ensuring correct responses always score higher than incorrect ones.
Difficulty-aware weighting: Evaluation metrics incorporate difficulty weighting, assigning greater weight to harder samples to prevent models from achieving inflated scores by solving only easy instances.
GRPO reinforcement fine-tuning baseline: Based on Qwen2.5-VL-3B/7B-Instruct, the paper designs an accuracy reward (binary signal from correctness evaluation) and a format reward (encouraging parseable outputs), with generalization validated under a cross-city setting.

Loss & Training¶

GRPO optimization: AdamW, initial learning rate \(1.0 \times 10^{-6}\), KL divergence coefficient \(1.0 \times 10^{-3}\)
8 responses sampled per query, global batch size 16
Training and test cities are completely non-overlapping (cross-city generalization validation)

Key Experimental Results¶

Main Results¶

Performance of 16 MLLMs on ReasonMap (weighted accuracy):

Model	Type	Short Acc	Long Acc	Map Score (S/L)
OpenAI o3	Closed-source Reasoning	63.02%	59.11%	9.53/17.96
Gemini-2.5-Flash	Closed-source Reasoning	46.09%	29.86%	7.64/9.98
Doubao-415	Closed-source Reasoning	43.14%	46.09%	7.33/14.67
OpenAI 4o	Closed-source Base	41.15%	42.80%	6.84/13.57
Qwen2.5-VL-72B	Open-source Base	26.65%	24.22%	5.09/8.80
InternVL3-78B	Open-source Base	25.35%	19.62%	4.80/7.50
QvQ-72B-Preview	Open-source Reasoning	9.03%	4.25%	1.59/1.55
Skywork-R1V	Open-source Reasoning	6.86%	3.21%	2.11/3.11

Ablation Study¶

GRPO reinforcement fine-tuning effectiveness (cross-city generalization):

Model	Short Acc	Long Acc	Map Score (S/L)
Qwen2.5-VL-3B	8.68%	7.99%	2.75/3.70
+RL	11.46% (↑2.78)	10.50% (↑2.51)	3.81/6.09
Qwen2.5-VL-7B	13.28%	7.12%	4.01/5.74
+RL	26.22% (↑12.94)	26.04% (↑18.92)	5.52/9.52

Visual masking experiment (text-only input): - Most models show significant performance degradation (Qwen2.5-VL-72B: 26.65%→16.41%; Doubao-415: 43.14%→21.53%) - Smaller models (Qwen2.5-VL-3B) show a slight improvement (8.68%→9.38%), suggesting greater reliance on prior knowledge rather than genuine visual reasoning

Key Findings¶

Counter-intuitive phenomenon: Among open-source models, base variants consistently outperform reasoning variants (e.g., Qwen2.5-VL-72B 26.65% vs. QvQ-72B 9.03%), whereas closed-source reasoning variants outperform their base counterparts (o3 63.02% vs. 4o 41.15%).
Root cause analysis: Open-source reasoning models tend to introduce "visual confusion" during repeated self-verification — correctly identifying a route initially but overwriting it with an incorrect answer during self-reflection. Closed-source reasoning models possess stronger visual grounding, enabling self-correction within the reasoning chain.
Model scaling laws remain valid: Larger models within the same family achieve higher accuracy with fewer tokens.
The 7B model yields the largest gain after reinforcement fine-tuning (+18.92%), with a concurrent reduction in token usage.

Highlights & Insights¶

Exposing MLLM blind spots: This work is the first to systematically demonstrate the severe deficiencies of current MLLMs on spatial reasoning tasks that require genuine visual grounding.
The base vs. reasoning reversal phenomenon provides an important clue for understanding the effect of RL fine-tuning on visual reasoning.
Refined evaluation framework design: The two-tier evaluation separating correctness from quality, combined with difficulty-aware weighting, is more informative than simple answer comparison.
High-resolution challenge: The average map resolution of 5839×5449 far exceeds typical VQA benchmarks, testing models' ability to process information-dense visual inputs.

Limitations & Future Work¶

The data scale is relatively limited (1,008 QA pairs / 30 cities); extending to more cities and transportation modes would enhance generalization evaluation.
Only metro/light rail systems are evaluated, excluding buses, walking, and other multimodal transportation.
Station name languages in certain cities may affect model OCR performance, though this has not been rigorously quantified.
Even the strongest closed-source model (o3) achieves only 63% accuracy, indicating high task difficulty but also potentially suggesting ambiguity in some data instances.
Reinforcement fine-tuning is validated only on 3B/7B models; the benefits for larger models remain unknown.

Comparison with MapBench/CityBench: These benchmarks are coarser-grained or rely on external APIs; ReasonMap requires purely visual reasoning.
Comparison with MathVerse: MathVerse reinforces visual dependency by generating multiple visual/textual variants; ReasonMap achieves this naturally through information-dense high-resolution maps.
RL fine-tuning trend: The success of GRPO in textual reasoning is extending to multimodal reasoning; ReasonMap provides an effective training and evaluation scenario.
Inspiration: The benchmark design methodology is transferable to domains such as architectural floor plan understanding and circuit diagram reasoning, which similarly require fine-grained visual perception combined with spatial reasoning.

Rating¶

Novelty: ⭐⭐⭐⭐ — Using transit maps as a visual reasoning testbed is a creative choice with a well-designed evaluation framework, though the benchmark construction methodology itself is not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Systematic evaluation of 16 models + visual masking ablation + RL training baseline + detailed error analysis.
Writing Quality: ⭐⭐⭐⭐ — Well-structured with clearly articulated findings and information-rich tables.
Value: ⭐⭐⭐⭐ — Reveals critical shortcomings of MLLMs in fine-grained visual reasoning and provides the community with a valuable evaluation tool.