SenseSearch: Empowering Vision-Language Models with High-Resolution Agentic Search-Reasoning via Reinforcement Learning¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: https://github.com/OpenSenseNova/SenseNova-MARS (Available)
Area: Multimodal VLM / Agent
Keywords: agentic VLM, multi-tool coordination, reinforcement learning, high-resolution vision, search-augmented reasoning
TL;DR¶
SenseSearch enables a 7B VLM to autonomously coordinate three tools—"text search + image search + image crop"—during multi-turn reasoning. Through two-stage training (cold-start SFT + self-developed BN-GSPO reinforcement learning), the model learns to address both "knowledge-intensive" and "high-resolution fine-grained perception" challenges, outperforming same-scale baselines by 19.18 points on the new HR-MMSearch benchmark.
Background & Motivation¶
Background: VLMs are limited by the static knowledge in their training corpora and exhibit weak analysis of small objects in high-resolution images. To mitigate the former, recent works equip models with external tools (text/image search) and train agentic search capabilities via end-to-end RL (GRPO/DeepSeek-R1 paradigm), such as Search-R1 and MMSearch-R1. To mitigate the latter, the "Thinking with Images" paradigm (OpenAI-o3, DeepEyes, Pixel Reasoner) allows models to repeatedly view images in pixel space using cropping/zooming tools.
Limitations of Prior Work: These two research lines remain disjointed. Search-based agents focus on "whole-image level" context acquisition and cannot answer questions requiring fine-grained perception of small objects (e.g., covering 5% of the area). Conversely, pixel-reasoning agents only possess image manipulation tools and are helpless when external knowledge (long-tail or real-time info) is required. Even approaches like DeepMMSearch-R1, which integrate cropping as a pre-processing step for image search, only concatenate tools without true adaptive multi-tool coordination.
Key Challenge: Real-world "knowledge-intensive + visually complex" problems require both capabilities simultaneously—precisely locating and magnifying key visual clues in high-resolution images, then initiating external searches based on those clues. These must occur alternately in multi-turn reasoning. Single-tool or fixed-pipeline systems cannot achieve such adaptive coordination.
Goal: Construct an end-to-end agentic VLM capable of adaptively scheduling multiple tools during reasoning, alongside an evaluation benchmark to test "high-resolution + search-driven" capabilities.
Key Insight: Treat image cropping as an "action" at the same level as searching within a unified action space. This allows the model to decide in each turn whether to zoom in, search text, or reverse-search images. To stabilize training, cold-start SFT is used to teach protocols, followed by a refined RL algorithm for strategy optimization.
Core Idea: Integrate search and fine-grained perception into a single agent using a "three-tool unified action space + cold-start SFT + Batch Normalized Sequence-level Policy Optimization (BN-GSPO)," enabling task-adaptive tool orchestration.
Method¶
Overall Architecture¶
SenseSearch models the problem as a multi-turn interaction: given a natural language question \(q\) and an initial (typically 4K high-resolution) image \(I_0\), the policy VLM first outputs a <think> block, then selects one of four actions: text search, image search, image crop, or final answer. The text/images returned by tools are appended to the interaction history \(\mathcal{T}_t\), forming a growing trajectory until the model provides an <answer>. Trajectories failing to yield a valid answer within \(T\) turns are marked incorrect.
Capabilities are derived from two-stage training: first, cold-start SFT using approximately 3,000 multi-turn trajectories to teach interaction protocols and tool calls; second, BN-GSPO reinforcement learning to refine tool-calling and reasoning strategies. Rewards consist of "accuracy + format compliance," judged by GPT-4o. Evaluation is conducted on the new HR-MMSearch benchmark.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Raw QA Pool <br/>FVQA + Pixel-Reasoner + Expert Annotations"] --> B["Cold-start Data Pipeline <br/>Mining → Synthesis → Validation"]
B --> C["Cold-start SFT <br/>Learning Tool Protocols"]
C --> D["BN-GSPO RL <br/>Two-layer Normalized Advantage"]
D --> E["Three-tool Agentic Search-Reasoning Cycle <br/>text/image search + image crop"]
E -->|Multi-turn iteration| F["Final Answer"]
E --> G["HR-MMSearch Benchmark Evaluation"]
Key Designs¶
1. Unified Agentic Search-Reasoning Action Space: Coordinating "Seeing" and "Searching"
To address the gap between search agents (failing at small objects) and pixel-reasoning agents (lacking external knowledge), SenseSearch promotes image cropping to a first-class action. In each turn, the model observes \(\mathcal{T}_t\) and chooses: ① text search (Serper API retrieves top-5 results, summarized by Qwen3-32B to prevent context overflow); ② image search (Serper retrieves similar images, returning top-5 titles + thumbnails); ③ image crop (given normalized bbox \([0,1]^4\), crops a region for fine-grained analysis); ④ Final answer. Each turn must contain one reasoning step and one valid action. This allows adaptive sequences, e.g., cropping a racing driver's suit to see a 5% area logo, then text-searching the brand's founding year.
2. Two-stage Training + Three-stage Cold-start Pipeline: Protocol First, Optimization Second
Pure RL in multi-tool scenarios often falls into a "learning trap" where models bypass new tools in favor of familiar text search. Cold-start SFT maximizes the log-likelihood of target trajectories:
The cold-start data is produced via a three-stage pipeline: Data Mining (merging FVQA, Pixel-Reasoner, and expert QA, then filtering for "hard" problems); Trajectory Synthesis (using Gemini-2.5-Flash to generate reasoning chains); and Quality Validation (using GPT-4o to check format, logic, and answers). SFT fine-tunes the LM while freezing the vision encoder and projector.
3. BN-GSPO: Batch Normalized Sequence-level Policy Optimization
Reinforcement learning over long trajectories with external rewards is sensitive to reward scales and task difficulties. BN-GSPO applies two-layer normalization to advantage estimation. First, length-normalized importance ratios are calculated:
The first layer performs intra-group standardization as in GSPO: \(\bar{A}_{b,g} = (r_{b,g} - \text{mean}(\{r_{b,g'}\})) / \text{std}(\{r_{b,g'}\})\). The second layer normalizes these values across the entire minibatch:
This corrects for variance between different prompts within a batch. Using a clipped objective \(\min(s_{b,g}\tilde{A}_{b,g}, \text{clip}(s_{b,g})\tilde{A}_{b,g}) - \beta D_{\text{KL}}(\pi_\theta\|\pi_{\text{ref}})\), this batch normalization proves critical for stabilizing perception-heavy RL tasks.
4. HR-MMSearch: The First High-Resolution, Search-Driven Benchmark
Existing benchmarks (FVQA, MMSearch) focus on whole-image understanding. HR-MMSearch uses 305 4K images of recent 2025 events to prevent data leakage. Questions are anchored to small objects or text occupying less than 5% of the image area, forcing the agent to crop/magnify before searching.
Key Experimental Results¶
Main Results¶
Based on Qwen2.5-VL-7B-Instruct. RL used a global batch of 128, lr 1e-6, and \(T=10\) maximum turns.
Search Benchmarks (Agentic Workflow, Pass@1):
| Benchmark / Avg | Qwen2.5-VL-7B(base) | MMSearch-R1 | SenseSearch-SFT | SenseSearch-RL |
|---|---|---|---|---|
| Average | 35.50 | 52.49 | 53.06 | 57.43 |
| MMSearch | 32.16 | 53.80 | 53.80 | 59.06 |
| HR-MMSearch | 19.34 | 20.33 | 29.80 | 38.52 |
| FVQA-test | 36.00 | 58.40 | 56.72 | 61.17 |
SenseSearch-RL outperforms MMSearch-R1 by 4.94 points on average and is comparable to Gemini-2.5-Flash and GPT-4o. On HR-MMSearch, it exceeds the base agentic baseline by 19.18 points.
Fine-grained Visual Understanding: SenseSearch-RL achieves an average of 72.8 on perception benchmarks (V*, HR-Bench), surpassing GPT-4o (71.5) and DeepEyes (72.5).
Ablation Study¶
BN-GSPO vs. GRPO/GSPO (Pure RL, Table 3):
| Algorithm | MMSearch | V* Bench | HR-Bench 4K |
|---|---|---|---|
| GRPO | 50.88 | 67.54 | 61.38 |
| GSPO | 53.80 | 53.93 | 44.50 |
| BN-GSPO | 56.72 | 79.05 | 69.12 |
Key Findings¶
- Batch normalization is key to stabilizing multi-tool RL: Standard GSPO collapses on perception tasks (V* 53.93). BN-GSPO stabilizes training by suppressing reward scale variance across heterogeneous trajectories.
- Mixed data prevents overfitting: RL on perception data alone improves V* but degrades search. Combined training ensures balanced multi-tool strategies.
- RL improves tool-use efficiency: Average tool calls dropped from ~4 to ~2 during training, as the agent learned task-specific strategies (e.g., using only crop for V* and only search for MMSearch).
Highlights & Insights¶
- Unified action space bridges two disjoint research lines: Integrating "cropping" as an action equal to "searching" allows fine-grained perception and external knowledge retrieval to co-exist in one reasoning chain.
- BN-GSPO's second normalization layer is a lightweight stability trick: Minibatch-level normalization suppresses variance from heterogeneous prompts, offering a low-cost improvement for sequence-level RL.
- Clear division: SFT for protocol, RL for efficiency: SFT ensures rule compliance, while RL optimizes precision and reduces redundant actions.
- Benchmark constraints: The "5% area target + 2025 images" design ensures that models cannot rely on direct search or pre-training knowledge, truly testing coordination.
Limitations & Future Work¶
- Dependency on commercial judges and APIs: Reward signals rely on GPT-4o, and tools rely on Serper, affecting reproducibility and cost.
- Limited scale and turns: Only verified on 7B models with up to 10 turns; scalability to deeper reasoning chains is unexplored.
- Small benchmark size: 305 images across 8 domains may have limited statistical robustness.
- Future Work: Generalizing image operations (annotation, multi-image stitching), integrating tool-call costs into rewards, and adopting lightweight open-source judges.
Related Work & Insights¶
- vs. MMSearch-R1: SenseSearch adds image cropping and BN-GSPO, significantly improving performance on high-resolution targets (HR-MMSearch 38.52 vs 20.33).
- vs. Pixel Reasoner / DeepEyes: These focus on pixel-space operations without external knowledge. SenseSearch unifies both.
- vs. GSPO/GRPO: SenseSearch extends GSPO with batch normalization to handle the reward variance typical of multi-tool interaction.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐