NeurIPS 2025 Segmentation small object segmentation MLLM reinforcement-learning GRPO coarse-to-fine high-resolution UAV

FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning¶

Conference: NeurIPS 2025 arXiv: 2510.21311 Code: https://iiau-zhanglu.github.io/FINERS/ Area: Image Segmentation Keywords: small object segmentation, MLLM, reinforcement-learning, GRPO, coarse-to-fine, high-resolution, UAV

TL;DR¶

FineRS is a two-stage MLLM reinforcement learning framework comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR), coupled via a locate-informed retrospective reward. Evaluated on the newly constructed FineRS-4k UAV high-resolution dataset, it achieves reasoning and segmentation of ultra-small objects with a gIoU of 55.1% (surpassing Seg-Zero† by 8.5%) while simultaneously supporting VQA (MVQA 83.3%).

Background & Motivation¶

Background: Methods such as LISA integrate MLLMs with SAM to enable reasoning segmentation; however, they are designed for standard-resolution images and large-scale objects, failing severely on ultra-small objects (area ratio <0.1%) in high-resolution images — LISA 7B achieves only 9.0% gIoU on FineRS-4k.

Limitations of Prior Work: Existing high-resolution understanding methods (SEAL, DC2, MLLMs-Know) capture fine-grained details via tiling or attention mechanisms, but most operate in a training-free manner, lack precise localization capability, and cannot produce pixel-level masks.

Limitations of Vision RFT: Seg-Zero introduces GRPO into segmentation but is constrained by resolution, does not support multi-task unification (joint VQA and segmentation), and employs insufficiently flexible reward designs.

Key Challenge: A unified framework is needed to handle ultra-small objects in 4K images, support instruction-guided segmentation, open-ended VQA, and multiple-choice VQA simultaneously, and achieve data-efficient training through reinforcement learning.

Method¶

Overall Architecture: Two-Stage Coarse-to-Fine¶

FineRS is built upon Qwen2.5-VL-7B and adopts a two-stage pipeline:

Stage 1: Global Semantic Exploration (GSE)

Input: 1920×1080 high-resolution image + user instruction
Output: text response \(A^{pre}\) + coarse localization region \(B_r^{pre}\) (fixed 256×256 size, optimizing only the center offset)
Function: comprehends instruction semantics in a global view and predicts the approximate target location

Stage 2: Localized Perceptual Refinement (LPR)

Input: 512×512 local crop guided by GSE output + instruction
Output: precise bounding box \(B^{pre}\) + two points \(P_1^{pre}, P_2^{pre}\)
The predicted box and points are subsequently fed into frozen SAM2 to generate the segmentation mask

Training: GRPO Reinforcement Learning¶

Training employs the GRPO algorithm in two sequential steps without relying on large-scale supervised data:

Train LPR first: learns precise localization on randomly cropped local images, using 6 reward signals — Box IoU, Box L1, Point L1, JSON format, Think format, and QA accuracy.
Then train GSE: leverages the trained LPR to select the optimal coarse region as pseudo-GT, using 7 reward signals — Region IoU, Region L1, region size, Box-in-region, format, and QA accuracy.

Key Designs: Locate-informed Retrospective Reward¶

Problem: Coarse regions predicted by GSE lack explicit GT annotations.
Solution: For each sample, \(n\) randomly offset candidate coarse regions are generated (each covering the GT box); the trained LPR then predicts on each region, and the region yielding the highest LPR prediction IoU is selected as the pseudo-GT \(B_r^{gt}\) for GSE.
Effect: LPR's localization capability inversely supervises GSE's exploration behavior, forming a closed-loop optimization.

Unified Multi-task Response Reward¶

\(R_{response}\) unifies three task types:

Instruction-guided segmentation: a score of 1 is assigned if the response contains "is detected/found"
Multiple-choice VQA: exact match against the option
Open-ended VQA: fuzzy matching with similarity >0.8

FineRS-4k Dataset¶

Source: YouTube and self-collected drone videos at 3840×2060 resolution
Scale: 4,563 high-resolution images, 8,411 small object instances, 12,132 text–mask annotation pairs
Splits: Train 8,956 / Validation 749 / Test 2,427
Object scales: Small (>0.055%), Extra Small (0.017%–0.055%), Extra-Extra Small (<0.017%)
Task types: instruction-guided segmentation 39%, multiple-choice VQA 30.5%, open-ended VQA 30.5%
Attribute dimensions: color, shape, position, and others
Annotation pipeline: 14 volunteers with pairwise cross-checking + final quality review by 4 senior annotators

Compared with existing datasets: V* contains only 191 samples without masks; HR-Bench has only 200 samples without masks; refCOCOg and ReasonSeg target standard resolution. FineRS-4k is the first high-resolution small-object dataset that provides both QA and mask annotations.

Key Experimental Results¶

Main Results: FineRS-4k Test Set¶

Method	gIoU	cIoU	MVQA	OVQA
LISA 7B (training-free)	9.0	2.4	0.0	5.5
LISA++ 7B	12.3	5.2	6.7	8.2
MLLMs-Know 13B + LISA 13B	17.9	12.8	52.6	48.8
Seg-Zero 7B (training-free)	32.1	6.6	–	–
LISA† 7B (retrained)	12.1	9.6	23.9	22.0
Seg-Zero† 7B (retrained)	46.6	38.6	–	–
FineRS 7B	55.1	46.5	83.3	56.7

Key observations:

FineRS surpasses retrained Seg-Zero† by 8.5 points in gIoU and 7.9 points in cIoU.
FineRS is the only method that achieves strong performance on both segmentation and VQA simultaneously.
The advantage is most pronounced on XXS (Extra-Extra Small) objects: gIoU 47.2 vs. Seg-Zero† 31.7 (+15.5).

Cross-dataset Generalization (without additional fine-tuning)¶

Dataset	FineRS	Runner-up
V* Overall	77.5	SEAL 75.4
HR-Bench 4K Avg	63.8	DC2 50.0
HR-Bench 8K Avg	58.1	DC2 40.8

FineRS achieves substantial improvements on non-UAV scenes as well, demonstrating strong generalizability.

Ablation Study¶

Setting	gIoU	cIoU	MVQA	OVQA
Full FineRS	55.1	46.5	83.3	56.7
w/o Retrospective Reward	54.0	44.0	82.3	53.0
w/o LPR random region augmentation	53.9	46.7	83.7	61.9
w/o QA Acc. Reward	52.8	45.7	–	–
w/o Box Size Reward	51.0	43.4	56.5	40.8
w/o Box-in-region Reward	50.1	42.0	56.5	40.8

The retrospective reward has the largest impact on cIoU (+2.5), validating the effectiveness of cross-stage closed-loop optimization.
Box Size and Box-in-region rewards have a substantial effect on VQA performance (removing them drops MVQA from 83.3 to 56.5), indicating that region constraints are critical for multi-task unification.

Highlights & Insights¶

Strengths:

The two-stage coarse-to-fine design effectively bypasses the resolution bottleneck of MLLMs, enabling a 7B model to handle 4K small objects.
The retrospective reward is elegantly designed, providing effective supervision for coarse regions without additional annotation.
The unified framework jointly outputs text responses and masks, integrating reasoning and segmentation.
Data-efficient — approximately 9K training samples suffice to outperform large-scale SFT methods.

Limitations:

The two-stage sequential pipeline increases inference latency; inference speed is not reported in the paper.
The GSE coarse region size is fixed at 256×256, limiting adaptability to extreme scale variations.
The dataset focuses exclusively on UAV aerial-view scenes; applicability to other high-resolution domains (e.g., medical imaging) remains unverified.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to combine two-stage coarse-to-fine with GRPO reinforcement learning; the cross-stage closed-loop design of the retrospective reward is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers a self-constructed dataset, multiple public benchmarks, and comprehensive ablations, but lacks inference efficiency analysis.
Writing Quality: ⭐⭐⭐⭐ — Method descriptions are clear and figures are well presented.
Value: ⭐⭐⭐⭐⭐ — Fills a critical gap in MLLM-based high-resolution small-object segmentation; both the dataset and the method offer long-term research value.