NeurIPS 2025 Segmentation Remote sensing instruction-oriented training-free SAM2 binary integer programming object counting open-vocabulary detection/segmentation

InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition¶

Conference: NeurIPS 2025 arXiv: 2505.15818 Code: https://VoyagerXvoyagerx.github.io/InstructSAM/ Area: Remote Sensing Object Recognition / Open-Vocabulary Segmentation Keywords: Remote sensing, instruction-oriented, training-free, SAM2, binary integer programming, object counting, open-vocabulary detection/segmentation

TL;DR¶

This paper introduces a new task — Instruction-oriented Counting, Detection, and Segmentation (InstructCDS) — along with the EarthInstruct remote sensing benchmark covering three settings (open-vocabulary, open-ended, and open-subcategory). It proposes InstructSAM, a training-free framework that uses an LVLM to parse instructions and predict counts, SAM2 to generate mask proposals, and CLIP to compute similarities. A Binary Integer Programming (BIP) formulation then performs optimal mask-label assignment under counting constraints, achieving near-constant inference time while outperforming task-specific baselines.

Background & Motivation¶

Background: Remote sensing object recognition plays an important role in sustainable development goals, including wildlife monitoring, poverty estimation, and disaster relief. CLIP-driven open-vocabulary detection and segmentation have seen growing adoption in the remote sensing domain.

Limitations of Prior Work: - Existing open-vocabulary methods rely on explicit category instructions and cannot handle implicit reasoning (e.g., inferring which subcategories fall under "vehicles"). - Fixed category lists are inherently incomplete for remote sensing — aerial views expose a wide diversity of object types. - Traditional detectors depend on confidence thresholds for filtering, which are unavailable in zero-shot scenarios. - Directly prompting LVLMs to generate bounding boxes one by one leads to inference time that scales linearly with the number of objects.

Core Idea: Decompose the problem into three steps — LVLM counting (\(O(1)\) inference time) + SAM2 mask proposals + BIP optimal matching — requiring no training, no thresholds, and near-constant inference time.

InstructCDS Task & EarthInstruct Benchmark¶

Three Settings¶

Setting	Description	Example Instruction
Open-Vocabulary	User specifies target categories	"Detect football fields and parking lots"
Open-Ended	Detect all visible objects	"Detect all objects in the image"
Open-Subcategory	Detect all subcategories under a parent class	"Detect all sports venues"

EarthInstruct Benchmark¶

Built upon two remote sensing datasets — NWPU-VHR-10 and DIOR — covering 20 categories.
The two datasets differ in annotation conventions and spatial resolution, requiring models to interpret dataset-specific instructions.
For example, vehicles are not annotated in low-resolution NWPU-VHR-10 images; in DIOR, airports are only annotated when fully visible.

Evaluation Metric Innovations¶

Counting metrics: Precision/Recall/F1 based on TP/FP/FN (replacing MAE/RMSE; normalized and capable of distinguishing over- vs. under-counting).
Confidence-free detection metrics: mF1 + mAP\(_{nc}\) (independent of confidence score ranking), suitable for generative detectors.
Semantic matching: Cosine similarity > 0.95 via GeoRSCLIP text encoder is treated as category equivalence.

Method¶

InstructSAM Framework (Three-Step Pipeline)¶

Step 1: LVLM Instruction Parsing and Counting¶

GPT-4o or Qwen2.5-VL-7B serves as the counter. Given the image and a structured JSON prompt (including dataset-specific instructions), it outputs target categories \(\{cat_j\}\) and counts \(\{num_j\}\):

\[\{cat_j, num_j\}_{j=1}^M = \text{LVLM-Counter}(I, P)\]

Step 2: Class-Agnostic Mask Proposals via SAM2¶

SAM2-hiera-large automatically generates mask proposals \(\{mask_i\}_{i=1}^N\) over a regular point grid. An additional mask generation pass is applied to cropped image regions to improve small-object recall.

Step 3: Counting-Constrained Mask-Label Matching (Core Contribution)¶

A semantic similarity matrix \(S \in \mathbb{R}^{N \times M}\) is computed (image–text cosine similarity via GeoRSCLIP), and the following Binary Integer Programming problem is formulated:

\[\min_{\mathbf{X}} \sum_{i=1}^N \sum_{j=1}^M (1 - s_{ij}) \cdot x_{ij}\]

Subject to: - Each mask is assigned to at most one category: \(\sum_{j=1}^M x_{ij} \leq 1, \; \forall i\) - The number of masks assigned to each category equals its count: \(\sum_{i=1}^N x_{ij} = num_j, \; \forall j\) (when proposals are sufficient) - When proposals are insufficient, all are assigned: \(\sum_{i=1}^N \sum_{j=1}^M x_{ij} = N\)

The problem is solved efficiently via the PuLP solver. The BIP formulation elegantly integrates three sources of information: visual (CLIP mask embeddings), semantic (category text embeddings), and quantitative (LVLM counting constraints).

Key Advantages¶

Training-free: No task-specific training data for remote sensing is required.
Threshold-free: Counting constraints replace confidence thresholds, avoiding the threshold selection problem in zero-shot scenarios.
Near-constant inference time: The LVLM outputs only counts (few tokens) rather than generating bounding boxes one by one; SAM2 mask proposals are independent of the number of objects.

Key Experimental Results¶

Open-Vocabulary Setting (Zero-Shot)¶

Method	NWPU Cnt-F1↑	Box-F1↑	Mask-F1↑	DIOR Cnt-F1↑	Box-F1↑	Mask-F1↑
Grounding DINO	14.9	14.0	-	10.7	6.0	-
OWLv2	39.4	27.2	-	23.4	14.3	-
Qwen2.5-VL	68.0	36.4	-	52.0	27.8	-
InstructSAM-Qwen	73.2	38.9	23.7	59.3	24.7	24.0
InstructSAM-GPT4o	83.0	41.8	26.1	79.9	29.1	28.1

InstructSAM-GPT4o achieves a substantial lead in counting F1 (83.0 vs. 68.0) while simultaneously providing segmentation outputs.

Open-Ended Setting¶

Method	NWPU Cnt-F1↑	Box-F1↑	DIOR Cnt-F1↑	Box-F1↑
Qwen2.5-VL	48.6	32.0	36.6	21.7
GeoPixel	40.8	29.9	21.4	13.8
LAE-Label	46.2	27.3	23.3	11.5
InstructSAM-GPT4o	57.4	31.3	47.9	22.1

InstructSAM consistently outperforms remote sensing-specific models (e.g., SkySenseGPT, EarthDial, GeoPixel) in the open-ended setting as well.

Open-Subcategory Setting¶

Method	NWPU-S F1↑	NWPU-T F1↑	DIOR-S F1↑	DIOR-T F1↑
Qwen2.5-VL	32.4	42.2	34.0	39.2
GPT4o+OWL	19.8	65.9	27.6	70.9
InstructSAM-GPT4o	46.9	44.2	40.9	38.3

InstructSAM achieves a large margin on the "sports venues" (S) subcategory, while general-purpose detectors perform better on "vehicles" (T) — attributed to the richness of vehicle categories in OWLv2's training data.

Efficiency Comparison¶

InstructSAM reduces output token count by 89% compared to Qwen2.5-VL.
Total runtime is reduced by >32%.
Inference time remains near-constant and does not scale with the number of objects.

Highlights & Insights¶

BIP matching replaces thresholds — object recognition is formulated as a constrained optimization problem, eliminating the need for threshold tuning in zero-shot scenarios.
Counting as a global constraint — the LVLM's global perspective yields accurate counts that constrain mask assignment.
Training-free generalization — three foundation models (LVLM + SAM2 + CLIP) are composed without any remote sensing-specific fine-tuning.
New task + new benchmark — the InstructCDS three-setting formulation, EarthInstruct benchmark, and confidence-free evaluation metrics constitute substantial contributions.

Limitations & Future Work¶

LVLM counting accuracy is the performance ceiling — GPT-4o achieves only 83% counting F1, and errors propagate to final detection results.
The framework relies on GeoRSCLIP's remote sensing-specific pretraining — transferring to natural image domains requires replacing the CLIP backbone.
The BIP formulation assumes each mask belongs to at most one category — heavily overlapping instances cannot be handled.
SAM2 may miss very small objects (<10 px); the cropping strategy partially alleviates but does not fully resolve this issue.
Inference depends on the GPT-4o API, introducing cost and network dependency.

The BIP matching approach generalizes to any scenario where category counts are known and detection/segmentation proposals are available.
The paradigm of replacing confidence thresholds with counting constraints is broadly applicable to all generative detection models.
The dataset-specific instruction design in EarthInstruct is worth adopting in other domains, where annotation conventions vary across datasets.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ BIP matching with counting constraints as a threshold replacement represents a genuine paradigm shift.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across three settings with comparisons against diverse baselines.
Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is precise, the framework is elegant, and the metric design is carefully motivated.
Value: ⭐⭐⭐⭐⭐ The training-free, threshold-free paradigm has broad impact potential.