Skip to content

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Conference: ICLR 2026 arXiv: 2602.03733 Code: RegionReasoner Area: Image Segmentation Keywords: multi-round reasoning, region grounding, reinforcement-learning, GRPO, VLM, referring segmentation

TL;DR

This paper proposes RegionReasoner, a reinforcement learning-based multi-round visual reasoning framework that employs reference annotation rewards and global-local consistency rewards to enforce explicit citation of reference region coordinates in reasoning traces while maintaining semantic coherence. The approach achieves significant improvements in multi-round localization and segmentation accuracy on the newly constructed RegionDial-Bench.

Background & Motivation

Limitations of Prior Work

Limitations of Prior Work: State of the Field: 1. Existing VLM reasoning is primarily single-step or operates in pure text space, lacking iterative visual context refinement. 2. VisionReasoner provides single-round structured reasoning but does not propagate region references across turns. 3. SegLLM supports multi-round interactive segmentation but lacks verifiable reasoning traces or RL signals. 4. Naively stacking single-round reasoning leads to fragile reference propagation and difficult-to-detect coordinate hallucinations. 5. As dialogue turns increase, semantic drift emerges between global descriptions and local evidence. 6. No evaluation benchmark exists for multi-round reasoning precision and consistency.

Method

Structured Output: Each turn generates four tagged blocks: <scene><focus><think><answer>

Reference-Grounded Thinking: - The reasoning trace <think> must explicitly cite reference bounding box coordinates. - Reference reward \(R_{ref}\): correct citation score + hallucinated coordinate penalty (\(\eta=0.5\))

Global-Local Consistency Reward: - Keywords are extracted from <scene> and <focus> and their asymmetric overlap with <think> is computed. - Spatial, comparative, and localization vocabulary priors \(\ell(h_t)\) are incorporated. - \(R_{cons} = w_s \cdot \text{Ov}(s_t, h_t) + w_f \cdot \text{Ov}(f_t, h_t) + w_\ell \cdot \ell(h_t)\)

Training: Based on GRPO, initialized from Qwen2.5-VL-7B, trained on 4×H100 for approximately 10 hours.

RegionDial-Bench: - Multi-round dialogues constructed from RefCOCO+/RefCOCOg. - RefCOCO+ Multi-turn: 715 images / 2,355 turns; RefCOCOg: 1,580 images / 4,405 turns. - Supports per-turn evaluation for both detection (AP50) and segmentation (gIoU).

Key Experimental Results

7-Round Detection (RefCOCO+ Multi-turn, AP↑)

Method R1 R2 R3 R4 R5 R6 R7 Avg
Qwen2.5-VL-7B 65.5 49.0 48.1 36.5 30.0 38.2 25.9 49.9
Seg-Zero-7B 90.5 71.2 73.6 59.6 48.8 58.2 48.2 73.1
VisionReasoner-7B 88.3 74.7 75.8 64.2 56.3 57.3 47.0 74.8
RegionReasoner-7B 89.3 83.2 81.6 69.6 61.9 69.1 64.7 80.7

7-Round Segmentation (RefCOCO+ Multi-turn, gIoU↑)

Method R1 R2 R3 R4 R5 R6 R7 Avg
Seg-Zero-7B 78.6 62.8 64.0 51.6 42.4 50.8 46.7 64.0
SegLLM-7B 71.1 71.7 70.4 58.7 41.9 39.2 30.3 60.7
VisionReasoner-7B 75.6 65.0 65.9 54.9 46.6 48.9 40.8 64.3
RegionReasoner-7B 76.4 73.1 72.0 58.8 51.3 59.4 54.6 69.6

Ablation Study

Reward Configuration RefCOCO+ AP Avg RefCOCOg gIoU Avg Note
Base rewards only 74.8 64.3 VisionReasoner baseline
+ Reference reward \(R_{ref}\) 77.5 66.8 Reduces coordinate hallucination
+ Consistency reward \(R_{cons}\) 76.9 66.2 Stabilizes weak spatial scenes
+ Both combined 80.7 69.6 Complementary effects optimal

Key Findings

  • Greatest gains in later rounds: Detection AP improvements of +5.6/+11.8/+17.7 at R5/R6/R7 vs. VisionReasoner, indicating that reference propagation and consistency constraints effectively suppress error accumulation.
  • The two rewards are complementary: the reference reward primarily reduces coordinate hallucinations and improves region reuse/correction; the consistency reward stabilizes reasoning semantics in scenes with weak spatial cues.
  • SegLLM performs reasonably at R1–R3 but degrades sharply at R7 (30.3 gIoU), as the absence of structured reasoning traces leads to uncontrolled behavior in long dialogues.
  • Training completes in approximately 10 hours on 4×H100; constrained decoding is applied at inference to guarantee format validity.

Highlights & Insights

  • Verifiable reasoning traces: Bounding box references within reasoning traces can be automatically parsed and audited—every conclusion is supported by traceable spatial evidence.
  • Precisely complementary reward signals: The reference reward ensures "what region is mentioned is actually attended to," while the consistency reward ensures "scene description, local description, and reasoning remain semantically coherent."
  • Multi-round stability: Performance degradation is substantially smaller than all baselines; RegionReasoner retains 64.7 AP at R7, compared to only 47.0 for VisionReasoner.
  • Unified detection and segmentation: No task-specific heads are used; detection employs bounding box JSON and segmentation employs point_2d JSON, both within the same framework and training procedure.
  • RegionDial-Bench: The first multi-round reasoning benchmark to jointly cover detection and segmentation, supporting per-turn evaluation and reference propagation analysis.

Limitations & Future Work

  • The benchmark scale is relatively small (RefCOCO+ contains only 715 images / 2,355 turns), and generalization to larger-scale and more diverse scenarios remains to be verified.
  • The keyword matching approach (lemmatization + stopword removal + noun filtering) is coarse and may miss genuine consistency in semantically rich but lexically diverse scenes.
  • Validation is limited to the 7B scale; larger models (e.g., 72B) may achieve multi-round stability without such structured constraints.
  • Constrained decoding increases inference complexity, and strict enforcement of JSON formats and tag patterns may limit generation flexibility.
  • The vocabulary priors for spatial relations (left/right/inside/overlap, etc.) are manually defined and may have insufficient coverage.
  • vs. VisionReasoner: A strong baseline for single-round structured reasoning; RegionReasoner extends it to the multi-round setting while inheriting its tag structure and base rewards.
  • vs. SegLLM: Supports multi-round segmentation interaction with conversational supervision but lacks explicit reasoning traces and RL signals—this work addresses both the verifiability and learning signal gaps.
  • vs. Vision-R1/VLM-R1/Pixel Reasoner: Concurrent work on RL-enhanced VLM reasoning, but all operate in single-round settings; RegionReasoner introduces multi-round operation with region annotation.
  • vs. GRPO: The adopted policy optimization algorithm, which is better suited than PPO for RL fine-tuning of large models.

Rating

  • Novelty: ⭐⭐⭐⭐ The combination of reference-grounded reasoning and global-local consistency rewards is novel and practically motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Detection + segmentation, fine-grained per-round analysis, ablations, and comparison with multiple baselines.
  • Writing Quality: ⭐⭐⭐⭐ Formalization is complete and the pipeline description is clear.
  • Value: ⭐⭐⭐⭐ Opens a new direction in multi-round visual reasoning; both the benchmark and the method constitute independent contributions.