InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition¶
Conference: NeurIPS 2025 arXiv: 2505.15818 Code: https://VoyagerXvoyagerx.github.io/InstructSAM/ Area: Remote Sensing Object Recognition / Open-Vocabulary Segmentation Keywords: Remote sensing, instruction-oriented, training-free, SAM2, binary integer programming, object counting, open-vocabulary detection/segmentation
TL;DR¶
This paper introduces a new task — Instruction-oriented Counting, Detection, and Segmentation (InstructCDS) — along with the EarthInstruct remote sensing benchmark covering three settings (open-vocabulary, open-ended, and open-subcategory). It proposes InstructSAM, a training-free framework that uses an LVLM to parse instructions and predict counts, SAM2 to generate mask proposals, and CLIP to compute similarities. A Binary Integer Programming (BIP) formulation then performs optimal mask-label assignment under counting constraints, achieving near-constant inference time while outperforming task-specific baselines.
Background & Motivation¶
Background: Remote sensing object recognition plays an important role in sustainable development goals, including wildlife monitoring, poverty estimation, and disaster relief. CLIP-driven open-vocabulary detection and segmentation have seen growing adoption in the remote sensing domain.
Limitations of Prior Work: - Existing open-vocabulary methods rely on explicit category instructions and cannot handle implicit reasoning (e.g., inferring which subcategories fall under "vehicles"). - Fixed category lists are inherently incomplete for remote sensing — aerial views expose a wide diversity of object types. - Traditional detectors depend on confidence thresholds for filtering, which are unavailable in zero-shot scenarios. - Directly prompting LVLMs to generate bounding boxes one by one leads to inference time that scales linearly with the number of objects.
Core Idea: Decompose the problem into three steps — LVLM counting (\(O(1)\) inference time) + SAM2 mask proposals + BIP optimal matching — requiring no training, no thresholds, and near-constant inference time.
InstructCDS Task & EarthInstruct Benchmark¶
Three Settings¶
| Setting | Description | Example Instruction |
|---|---|---|
| Open-Vocabulary | User specifies target categories | "Detect football fields and parking lots" |
| Open-Ended | Detect all visible objects | "Detect all objects in the image" |
| Open-Subcategory | Detect all subcategories under a parent class | "Detect all sports venues" |
EarthInstruct Benchmark¶
- Built upon two remote sensing datasets — NWPU-VHR-10 and DIOR — covering 20 categories.
- The two datasets differ in annotation conventions and spatial resolution, requiring models to interpret dataset-specific instructions.
- For example, vehicles are not annotated in low-resolution NWPU-VHR-10 images; in DIOR, airports are only annotated when fully visible.
Evaluation Metric Innovations¶
- Counting metrics: Precision/Recall/F1 based on TP/FP/FN (replacing MAE/RMSE; normalized and capable of distinguishing over- vs. under-counting).
- Confidence-free detection metrics: mF1 + mAP\(_{nc}\) (independent of confidence score ranking), suitable for generative detectors.
- Semantic matching: Cosine similarity > 0.95 via GeoRSCLIP text encoder is treated as category equivalence.
Method¶
InstructSAM Framework (Three-Step Pipeline)¶
Step 1: LVLM Instruction Parsing and Counting¶
GPT-4o or Qwen2.5-VL-7B serves as the counter. Given the image and a structured JSON prompt (including dataset-specific instructions), it outputs target categories \(\{cat_j\}\) and counts \(\{num_j\}\):
Step 2: Class-Agnostic Mask Proposals via SAM2¶
SAM2-hiera-large automatically generates mask proposals \(\{mask_i\}_{i=1}^N\) over a regular point grid. An additional mask generation pass is applied to cropped image regions to improve small-object recall.
Step 3: Counting-Constrained Mask-Label Matching (Core Contribution)¶
A semantic similarity matrix \(S \in \mathbb{R}^{N \times M}\) is computed (image–text cosine similarity via GeoRSCLIP), and the following Binary Integer Programming problem is formulated:
Subject to: - Each mask is assigned to at most one category: \(\sum_{j=1}^M x_{ij} \leq 1, \; \forall i\) - The number of masks assigned to each category equals its count: \(\sum_{i=1}^N x_{ij} = num_j, \; \forall j\) (when proposals are sufficient) - When proposals are insufficient, all are assigned: \(\sum_{i=1}^N \sum_{j=1}^M x_{ij} = N\)
The problem is solved efficiently via the PuLP solver. The BIP formulation elegantly integrates three sources of information: visual (CLIP mask embeddings), semantic (category text embeddings), and quantitative (LVLM counting constraints).
Key Advantages¶
- Training-free: No task-specific training data for remote sensing is required.
- Threshold-free: Counting constraints replace confidence thresholds, avoiding the threshold selection problem in zero-shot scenarios.
- Near-constant inference time: The LVLM outputs only counts (few tokens) rather than generating bounding boxes one by one; SAM2 mask proposals are independent of the number of objects.
Key Experimental Results¶
Open-Vocabulary Setting (Zero-Shot)¶
| Method | NWPU Cnt-F1↑ | Box-F1↑ | Mask-F1↑ | DIOR Cnt-F1↑ | Box-F1↑ | Mask-F1↑ |
|---|---|---|---|---|---|---|
| Grounding DINO | 14.9 | 14.0 | - | 10.7 | 6.0 | - |
| OWLv2 | 39.4 | 27.2 | - | 23.4 | 14.3 | - |
| Qwen2.5-VL | 68.0 | 36.4 | - | 52.0 | 27.8 | - |
| InstructSAM-Qwen | 73.2 | 38.9 | 23.7 | 59.3 | 24.7 | 24.0 |
| InstructSAM-GPT4o | 83.0 | 41.8 | 26.1 | 79.9 | 29.1 | 28.1 |
InstructSAM-GPT4o achieves a substantial lead in counting F1 (83.0 vs. 68.0) while simultaneously providing segmentation outputs.
Open-Ended Setting¶
| Method | NWPU Cnt-F1↑ | Box-F1↑ | DIOR Cnt-F1↑ | Box-F1↑ |
|---|---|---|---|---|
| Qwen2.5-VL | 48.6 | 32.0 | 36.6 | 21.7 |
| GeoPixel | 40.8 | 29.9 | 21.4 | 13.8 |
| LAE-Label | 46.2 | 27.3 | 23.3 | 11.5 |
| InstructSAM-GPT4o | 57.4 | 31.3 | 47.9 | 22.1 |
InstructSAM consistently outperforms remote sensing-specific models (e.g., SkySenseGPT, EarthDial, GeoPixel) in the open-ended setting as well.
Open-Subcategory Setting¶
| Method | NWPU-S F1↑ | NWPU-T F1↑ | DIOR-S F1↑ | DIOR-T F1↑ |
|---|---|---|---|---|
| Qwen2.5-VL | 32.4 | 42.2 | 34.0 | 39.2 |
| GPT4o+OWL | 19.8 | 65.9 | 27.6 | 70.9 |
| InstructSAM-GPT4o | 46.9 | 44.2 | 40.9 | 38.3 |
InstructSAM achieves a large margin on the "sports venues" (S) subcategory, while general-purpose detectors perform better on "vehicles" (T) — attributed to the richness of vehicle categories in OWLv2's training data.
Efficiency Comparison¶
- InstructSAM reduces output token count by 89% compared to Qwen2.5-VL.
- Total runtime is reduced by >32%.
- Inference time remains near-constant and does not scale with the number of objects.
Highlights & Insights¶
- BIP matching replaces thresholds — object recognition is formulated as a constrained optimization problem, eliminating the need for threshold tuning in zero-shot scenarios.
- Counting as a global constraint — the LVLM's global perspective yields accurate counts that constrain mask assignment.
- Training-free generalization — three foundation models (LVLM + SAM2 + CLIP) are composed without any remote sensing-specific fine-tuning.
- New task + new benchmark — the InstructCDS three-setting formulation, EarthInstruct benchmark, and confidence-free evaluation metrics constitute substantial contributions.
Limitations & Future Work¶
- LVLM counting accuracy is the performance ceiling — GPT-4o achieves only 83% counting F1, and errors propagate to final detection results.
- The framework relies on GeoRSCLIP's remote sensing-specific pretraining — transferring to natural image domains requires replacing the CLIP backbone.
- The BIP formulation assumes each mask belongs to at most one category — heavily overlapping instances cannot be handled.
- SAM2 may miss very small objects (<10 px); the cropping strategy partially alleviates but does not fully resolve this issue.
- Inference depends on the GPT-4o API, introducing cost and network dependency.
Related Work & Insights¶
- The BIP matching approach generalizes to any scenario where category counts are known and detection/segmentation proposals are available.
- The paradigm of replacing confidence thresholds with counting constraints is broadly applicable to all generative detection models.
- The dataset-specific instruction design in EarthInstruct is worth adopting in other domains, where annotation conventions vary across datasets.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ BIP matching with counting constraints as a threshold replacement represents a genuine paradigm shift.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across three settings with comparisons against diverse baselines.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem definition is precise, the framework is elegant, and the metric design is carefully motivated.
- Value: ⭐⭐⭐⭐⭐ The training-free, threshold-free paradigm has broad impact potential.