Asking like Socrates: Socrates helps VLMs understand remote sensing images¶

Conference: CVPR 2026
arXiv: 2511.22396
Code: https://geox-lab.github.io/Asking_like_Socrates
Area: Remote Sensing / Multimodal Reasoning
Keywords: Remote Sensing Image Understanding, Chain-of-Evidence Reasoning, Pseudo-reasoning, Socratic Method, Two-stage Reinforcement Learning

TL;DR¶

This work reveals the "pseudo-reasoning" phenomenon in remote sensing VLMs (where explicit reasoning chains lead to performance degradation), attributed to the "glance effect" (insufficient single coarse-grained perception). It proposes the RS-EoT (Evidence-of-Thought) iterative evidence search paradigm. The method uses SocraticAgent self-play to synthesize reasoning trajectories for SFT cold startup, followed by two-stage progressive RL (grounding → VQA) for enhancement and generalization. RS-EoT-7B achieves SOTA on multiple remote sensing VQA and grounding benchmarks.

Background & Motivation¶

Background: Deep reasoning models (DeepSeek-R1 style SFT-RL paradigm) have achieved breakthroughs in mathematics/code and have been extended to multimodal domains (Vision-R1, WeThink, R1-OneVision, etc.). However, anomalous phenomena occur in remote sensing tasks.

Limitations of Prior Work: Remote sensing VLMs generate explicit reasoning chains, but performance stagnates or decreases. Models merely "narrate the reasoning process" rather than performing "actual reasoning."

Glance Effect: Remote sensing images involve large spatial extents, significant scale variations, and sparse, subtle visual cues. Models start reasoning after a single coarse perception ("glance") \(\rightarrow\) based on incomplete visual evidence \(\rightarrow\) reasoning degrades into linguistically self-consistent narration rather than logic grounded in visual evidence.

Key Challenge: Remote sensing reasoning requires iterative, non-static evidence acquisition, yet existing models adopt a "glance-and-reason" paradigm. Human remote sensing analysts utilize repeated check-refinement loops.

Key Insight: RS-EoT — Let reasoning guide perception, dynamically searching for new visual evidence during the reasoning process (reasoning \(\rightarrow\) perception \(\rightarrow\) reasoning \(\rightarrow\) perception... loop), rather than relying on a fixed initial perspective.

Method¶

Overall Architecture¶

To address "pseudo-reasoning," RS-EoT replaces "glance-and-reason" with an iterative "reasoning \(\rightarrow\) evidence search \(\rightarrow\) re-reasoning" loop. The reasoning process drives perception to find new evidence as needed.

The pipeline consists of three steps to produce RS-EoT-7B (base: Qwen2.5-VL-7B). Step 1 is SFT cold startup: a three-role self-play SocraticAgent synthesizes reasoning trajectories featuring iterative evidence search (RS-EoT-4K dataset). This is followed by two-stage progressive RL—Stage 1 utilizes IoU rewards on grounding tasks to sharpen evidence search capabilities, and Stage 2 generalizes these capabilities to general remote sensing VQA.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    IN["RS Image + Task Question<br/>(RGB / IR / SAR)"]
    subgraph SA["SocraticAgent: Tri-role Self-play Trajectory Synthesis"]
        direction TB
        R["Reasoner (GPT-5-mini)<br/>Text-only · Questioning · Integration"]
        P["Perceiver (Gemini-2.5-flash)<br/>Vision-only · Concise Answers"]
        R -->|Incremental Questions| P
        P -->|Visual Evidence| R
        V["Verifier (Doubao)<br/>Included only if Blind-Correct"]
    end
    IN --> SA
    SA --> D["RS-EoT-4K Reasoning Trajectories"]
    D --> SFT["SFT Cold Start<br/>Injecting Iterative Evidence Reasoning"]
    SFT --> RL1["Two-stage RL · Stage 1: Grounding<br/>IoU Reward for Evidence Search"]
    RL1 --> RL2["Two-stage RL · Stage 2: VQA<br/>MCQ Reconstruction + Tiered Reward"]
    RL2 --> OUT["RS-EoT-7B<br/>Reasoning-Perception-Reasoning Loop"]

Key Designs¶

1. SocraticAgent: Tri-role Self-play for Iterative Evidence Search Trajectories

The SocraticAgent synthesizes trajectories using three roles: The Reasoner (GPT-5-mini) only reads text and manages reasoning/questioning; the Perceiver (Gemini-2.5-flash) only sees the image and answers specific questions; the Verifier (Doubao-seed-1.6-thinking) ensures that if the blind Reasoner can reach the correct answer through the dialogue, the evidence is reliable. This "information isolation" decouples reasoning from perception. Self-play prompts (e.g., telling the Reasoner the Perceiver is "weak") force the decomposition of complex tasks into incremental questions, resulting in the RS-EoT-4K dataset.

2. Two-stage Progressive RL: Grounding First, VQA Generalization Second

Stage 1 uses grounding tasks because they require precise, iterative visual evidence search. Rewards utilize IoU scores and format rewards based on DIOR-RSVG and VRSBench data. Stage 2 generalizes to general VQA. To prevent reward hacking in simple Yes/No questions, Multiple-Choice Question (MCQ) reconstruction is used, forcing the model to verify each option. The hierarchical reward is defined as:

\[r_{qa} = 1 - \frac{1}{N}\sum_i |y_i - \hat{y}_i|\]

3. RS-EoT Paradigm: Language-Driven Reasoning and On-demand Evidence

The paradigm follows two principles: reasoning is driven by natural language as a "control signal" for perception, and visual information serves as on-demand evidence. This replaces the "glance" with iterative evidence acquisition to eliminate pseudo-reasoning.

Loss & Training¶

SFT uses RS-EoT-4K (5 epochs, lr=3e-5). Both RL stages employ GRPO (2 epochs each, lr=1e-6, batch=512) based on Qwen2.5-VL-7B.

Key Experimental Results¶

Main Results (RS VQA + Grounding)¶

Benchmark	Metric	RS-EoT-7B	Qwen2.5VL	WeThink	VL-Rethinker	Geo-R1
RSFG-VQA	Avg@5	67.85	62.45	55.04	58.80	45.03
RSFG-SC	Object@F1	56.52	36.78	38.35	34.84	20.82
VRSBench	Avg@5	63.09	62.45	62.17	55.04	57.00
RSVQA	Avg@5	75.16	67.20	40.74	65.57	34.50
DIOR-RSVG	mIoU	45.29	35.64	33.96	25.48	20.97
VRSBench-Ref	mIoU	48.04	21.99	34.07	25.29	4.51

RS-EoT-7B achieves SOTA across all VQA and Grounding tasks, notably improving Object@F1 from 36.78 to 56.52 (+53.7%) and Grounding mIoU from 35.64 to 45.29 (+27.1%).

Ablation Study¶

Stage	RSFG-VQA	DIOR mIoU	Explanation
Qwen2.5-VL Baseline	62.45	35.64	No reasoning
+ SFT Cold Start	+ Gain	+ Gain	Injection of RS-EoT mode
+ RL-Grounding	+ Further Gain	Massive Gain	Enhanced evidence search
+ RL-VQA	Optimal	Maintained	Generalization to VQA

Key Findings¶

Quantification of Pseudo-reasoning: Reasoning models like WeThink performed worse than non-reasoning baselines on RS tasks, confirming the issue.
Attention map analysis shows clear alternating cycles of "reasoning \(\rightarrow\) evidence search \(\rightarrow\) reasoning."
Grounding RL exhibits positive transfer to VQA tasks.
MCQ reconstruction effectively avoids reward hacking.

Highlights & Insights¶

Diagnosis of Pseudo-reasoning: Systematically identifies and explains why reasoning can decrease performance in RS VLMs.
SocraticAgent Mechanism: The "mutual depreciation" prompt strategy effectively decouples reasoning and perception for high-quality data synthesis.
Progressive RL Strategy: Starting with the most difficult evidence-seeking task (grounding) before generalizing to VQA follows an intuitive skill-learning curriculum.
MCQ Reconstruction: A practical solution to reward hacking in RS RL training with simple ground-truth labels.

Limitations & Future Work¶

The current loop is within the language domain; it does not yet explicitly crop image sub-regions.
SocraticAgent relies on expensive APIs for data synthesis.
Scaling effects for models larger than 7B are not yet verified.
Expansion to hyperspectral and other modalities is needed.

vs Geo-R1/VHM-RL: These use SFT-RL but rely on single global perception, leading to pseudo-reasoning in RS. RS-EoT solves this through iterative search.
vs EagleVision: RS-EoT shares the "reasoning-driven perception" philosophy but applies it to iterative local evidence search within single remote sensing images.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐