RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning¶

Conference: ICLR 2026 arXiv: 2510.02240 Code: Project Page Area: Reinforcement Learning Keywords: Multimodal Large Language Models, Visual Reasoning, Sparse Rewards, Multi-Stage RL, Metro Route Planning

TL;DR¶

This paper proposes the RewardMap framework, which addresses the sparse reward problem in fine-grained visual reasoning through difficulty-aware detail reward design and a multi-stage RL curriculum that progresses from simple perception to complex reasoning.

Background & Motivation¶

Fine-grained visual reasoning (e.g., metro route planning) poses a core challenge for multimodal large language models (MLLMs). The ReasonMap benchmark reveals that even state-of-the-art MLLMs struggle with spatial reasoning in structured, information-dense visual scenes.

Directly applying standard RL methods (e.g., GRPO) to such complex tasks encounters a sparse reward bottleneck: - Success signals are only provided at the end of long reasoning chains (final answer correct/incorrect) - Task difficulty further amplifies sparsity — most sampled rollouts yield reward \(r_i \approx 0\) - In GRPO, when all samples fail, the group-relative advantage \(\hat{A}_i\) approaches zero, resulting in weak gradient signals and difficult convergence

Conventional SFT provides dense supervision but cannot equip models with long-chain decision-making capabilities. The root cause is a mismatch between task complexity and the density of supervision signals.

The paper's starting point is two-fold: (1) constructing the ReasonMap-Plus dataset as a dense-reward cold-start source; and (2) designing a multi-stage RL curriculum progressing from easy to hard, transitioning from perception to reasoning.

Method¶

Overall Architecture¶

RewardMap consists of two core components: (1) difficulty-aware reward design that augments format and correctness rewards with detail rewards; and (2) a multi-stage GRPO training curriculum that progresses from simple VQA to complex route planning.

Key Designs¶

ReasonMap-Plus Dataset Construction:
- Function: Constructs 4,018 VQA questions covering 5 extended question types across 30 cities in 13 countries.
- Mechanism: Designs three major question categories — global counting, local counting, and judgment — with answers automatically generated from Metro Data.
- Design Motivation: VQA questions are simple and reward-dense, making them suitable as an RL cold-start to develop the model's foundational visual understanding.
Difficulty-Aware Detail Rewards:
- Function: Introduces partial-credit rewards on top of correctness rewards.
- Mechanism: \(R = W_{\text{difficulty}}(R_{\text{format}} + R_{\text{correctness}} + \alpha \times R_{\text{detail}})\)
- Detail rewards assign credit/penalty separately for start/end stations, line names, transfer stations, and segment counts.
- Difficulty weight \(W_{\text{difficulty}} = W_{\text{map}} + W_{\text{question}}\), integrating map complexity and the number of transfers.
- Design Motivation: Alleviates sparse rewards in planning tasks, enabling learning from partially correct information even when the final answer is wrong.
Multi-Stage GRPO Curriculum:
- Function: Divides training into multiple stages following a global curriculum principle.
- Mechanism: Judgment questions → Counting questions → Planning questions (visual understanding → visual reasoning). Samples are shuffled randomly within each stage.
- Design Motivation: (1) Lower-level tasks have dense rewards, supporting effective cold-start; (2) Gradually bridging perception and reasoning avoids training collapse when facing difficult tasks directly; (3) Local randomness prevents overfitting to a fixed curriculum trajectory.

Loss & Training¶

The standard policy gradient objective of GRPO is used, driven by group-relative advantages. The key distinctions lie in the reward function design (three-level rewards with difficulty weighting) and the data scheduling strategy (multi-stage curriculum). The cold-start phase uses RL directly rather than SFT, ensuring alignment between the reward signal and task objective from the outset.

Key Experimental Results¶

Main Results (Qwen2.5-VL-7B-Instruct)¶

Method	ReasonMap Weighted Acc. (S/L)	ReasonMap-Plus Weighted Acc.
Base Model	13.28% / 7.12%	44.21%
+RL (GRPO)	26.22% / 26.04%	44.64%
+RL (REINFORCE++)	27.17% / 27.60%	—
+RewardMap (Full)	Best	Best

Ablation Study¶

Configuration	Key Metric	Note
Format + Correctness reward only	Baseline performance	Difficult to learn under sparse rewards
+ Detail reward	Significant improvement	Partial credit alleviates sparsity
+ Difficulty weight	Further improvement	Hard examples contribute more learning signal
+ Multi-stage curriculum	Best performance	Cold-start strategy is effective

Key Findings¶

Models trained with RewardMap achieve an average improvement of 3.47% across 6 external benchmarks, demonstrating strong generalization.
RL cold-start outperforms SFT cold-start, avoiding the overfitting and cognitive rigidity induced by SFT.
Among reference models, GPT-5 achieves 59.98%/62.50% on ReasonMap, reflecting the extreme difficulty of this task.
Seed1.5-VL and GPT-4o achieve 73.58% and 64.42% on ReasonMap-Plus, respectively.

Highlights & Insights¶

Valuable problem formulation: Metro route planning serves as a natural testbed for MLLM visual reasoning, combining practical utility with scientific significance.
Using RL instead of SFT for cold-start is an insightful design choice that avoids the mismatch between reward signals and the loss function.
Detail reward design is elegant: The structured nature of planning tasks (start station, end station, transfer stations, etc.) allows independent verification of sub-components, enabling principled reward decomposition.

Limitations & Future Work¶

The detail reward design relies on task-specific structural information; adapting it to other visual reasoning tasks requires redesign.
Hyperparameters for difficulty weighting (\(\gamma_e, \gamma_m, \gamma_h, \beta_0, \beta_1\)) must be specified in advance.
Experiments are currently conducted only on the Qwen2.5-VL model family; generalization to other architectures remains unexplored.

ReasonMap (Feng et al., 2025b) serves as the benchmark and data foundation for this work.
GRPO (Shao et al., 2024) provides the RL optimization framework.
Curriculum RL (Parashar et al., 2025) inspires the multi-stage design via its easy-to-hard strategy.
Insight: For complex reasoning tasks, reward engineering may be more critical than algorithmic innovation.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of using multi-stage RL cold-start in place of SFT is novel, though individual components are relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark evaluation including external generalization, with ablation studies.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear and framework diagrams are well-presented.
Value: ⭐⭐⭐⭐ Provides a practical solution for RL training of MLLMs in visual reasoning.