ICLR 2026 Segmentation multi-round reasoning region grounding reinforcement-learning GRPO VLM referring segmentation

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning¶

Conference: ICLR 2026 arXiv: 2602.03733 Code: RegionReasoner Area: Image Segmentation Keywords: multi-round reasoning, region grounding, reinforcement-learning, GRPO, VLM, referring segmentation

TL;DR¶

This paper proposes RegionReasoner, a reinforcement learning-based multi-round visual reasoning framework that employs reference annotation rewards and global-local consistency rewards to enforce explicit citation of reference region coordinates in reasoning traces while maintaining semantic coherence. The approach achieves significant improvements in multi-round localization and segmentation accuracy on the newly constructed RegionDial-Bench.

Background & Motivation¶

Limitations of Prior Work¶

Limitations of Prior Work: State of the Field: 1. Existing VLM reasoning is primarily single-step or operates in pure text space, lacking iterative visual context refinement. 2. VisionReasoner provides single-round structured reasoning but does not propagate region references across turns. 3. SegLLM supports multi-round interactive segmentation but lacks verifiable reasoning traces or RL signals. 4. Naively stacking single-round reasoning leads to fragile reference propagation and difficult-to-detect coordinate hallucinations. 5. As dialogue turns increase, semantic drift emerges between global descriptions and local evidence. 6. No evaluation benchmark exists for multi-round reasoning precision and consistency.

Method¶

Structured Output: Each turn generates four tagged blocks: <scene> → <focus> → <think> → <answer>

Reference-Grounded Thinking: - The reasoning trace <think> must explicitly cite reference bounding box coordinates. - Reference reward \(R_{ref}\): correct citation score + hallucinated coordinate penalty (\(\eta=0.5\))

Global-Local Consistency Reward: - Keywords are extracted from <scene> and <focus> and their asymmetric overlap with <think> is computed. - Spatial, comparative, and localization vocabulary priors \(\ell(h_t)\) are incorporated. - \(R_{cons} = w_s \cdot \text{Ov}(s_t, h_t) + w_f \cdot \text{Ov}(f_t, h_t) + w_\ell \cdot \ell(h_t)\)

Training: Based on GRPO, initialized from Qwen2.5-VL-7B, trained on 4×H100 for approximately 10 hours.

RegionDial-Bench: - Multi-round dialogues constructed from RefCOCO+/RefCOCOg. - RefCOCO+ Multi-turn: 715 images / 2,355 turns; RefCOCOg: 1,580 images / 4,405 turns. - Supports per-turn evaluation for both detection (AP50) and segmentation (gIoU).

Key Experimental Results¶

7-Round Detection (RefCOCO+ Multi-turn, AP↑)¶

Method	R1	R2	R3	R4	R5	R6	R7	Avg
Qwen2.5-VL-7B	65.5	49.0	48.1	36.5	30.0	38.2	25.9	49.9
Seg-Zero-7B	90.5	71.2	73.6	59.6	48.8	58.2	48.2	73.1
VisionReasoner-7B	88.3	74.7	75.8	64.2	56.3	57.3	47.0	74.8
RegionReasoner-7B	89.3	83.2	81.6	69.6	61.9	69.1	64.7	80.7

7-Round Segmentation (RefCOCO+ Multi-turn, gIoU↑)¶

Method	R1	R2	R3	R4	R5	R6	R7	Avg
Seg-Zero-7B	78.6	62.8	64.0	51.6	42.4	50.8	46.7	64.0
SegLLM-7B	71.1	71.7	70.4	58.7	41.9	39.2	30.3	60.7
VisionReasoner-7B	75.6	65.0	65.9	54.9	46.6	48.9	40.8	64.3
RegionReasoner-7B	76.4	73.1	72.0	58.8	51.3	59.4	54.6	69.6

Ablation Study¶

Reward Configuration	RefCOCO+ AP Avg	RefCOCOg gIoU Avg	Note
Base rewards only	74.8	64.3	VisionReasoner baseline
+ Reference reward \(R_{ref}\)	77.5	66.8	Reduces coordinate hallucination
+ Consistency reward \(R_{cons}\)	76.9	66.2	Stabilizes weak spatial scenes
+ Both combined	80.7	69.6	Complementary effects optimal

Key Findings¶

Greatest gains in later rounds: Detection AP improvements of +5.6/+11.8/+17.7 at R5/R6/R7 vs. VisionReasoner, indicating that reference propagation and consistency constraints effectively suppress error accumulation.
The two rewards are complementary: the reference reward primarily reduces coordinate hallucinations and improves region reuse/correction; the consistency reward stabilizes reasoning semantics in scenes with weak spatial cues.
SegLLM performs reasonably at R1–R3 but degrades sharply at R7 (30.3 gIoU), as the absence of structured reasoning traces leads to uncontrolled behavior in long dialogues.
Training completes in approximately 10 hours on 4×H100; constrained decoding is applied at inference to guarantee format validity.

Highlights & Insights¶

Verifiable reasoning traces: Bounding box references within reasoning traces can be automatically parsed and audited—every conclusion is supported by traceable spatial evidence.
Precisely complementary reward signals: The reference reward ensures "what region is mentioned is actually attended to," while the consistency reward ensures "scene description, local description, and reasoning remain semantically coherent."
Multi-round stability: Performance degradation is substantially smaller than all baselines; RegionReasoner retains 64.7 AP at R7, compared to only 47.0 for VisionReasoner.
Unified detection and segmentation: No task-specific heads are used; detection employs bounding box JSON and segmentation employs point_2d JSON, both within the same framework and training procedure.
RegionDial-Bench: The first multi-round reasoning benchmark to jointly cover detection and segmentation, supporting per-turn evaluation and reference propagation analysis.

Limitations & Future Work¶

The benchmark scale is relatively small (RefCOCO+ contains only 715 images / 2,355 turns), and generalization to larger-scale and more diverse scenarios remains to be verified.
The keyword matching approach (lemmatization + stopword removal + noun filtering) is coarse and may miss genuine consistency in semantically rich but lexically diverse scenes.
Validation is limited to the 7B scale; larger models (e.g., 72B) may achieve multi-round stability without such structured constraints.
Constrained decoding increases inference complexity, and strict enforcement of JSON formats and tag patterns may limit generation flexibility.
The vocabulary priors for spatial relations (left/right/inside/overlap, etc.) are manually defined and may have insufficient coverage.

vs. VisionReasoner: A strong baseline for single-round structured reasoning; RegionReasoner extends it to the multi-round setting while inheriting its tag structure and base rewards.
vs. SegLLM: Supports multi-round segmentation interaction with conversational supervision but lacks explicit reasoning traces and RL signals—this work addresses both the verifiability and learning signal gaps.
vs. Vision-R1/VLM-R1/Pixel Reasoner: Concurrent work on RL-enhanced VLM reasoning, but all operate in single-round settings; RegionReasoner introduces multi-round operation with region annotation.
vs. GRPO: The adopted policy optimization algorithm, which is better suited than PPO for RL fine-tuning of large models.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of reference-grounded reasoning and global-local consistency rewards is novel and practically motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ Detection + segmentation, fine-grained per-round analysis, ablations, and comparison with multiple baselines.
Writing Quality: ⭐⭐⭐⭐ Formalization is complete and the pipeline description is clear.
Value: ⭐⭐⭐⭐ Opens a new direction in multi-round visual reasoning; both the benchmark and the method constitute independent contributions.