Skip to content

GeoExplorer: Active Geo-Localization with Curiosity-Driven Exploration

Conference: ICCV 2025 arXiv: 2508.00152 Code: Project Page Area: Remote Sensing Keywords: Active Geo-Localization, Curiosity-Driven Exploration, Reinforcement Learning, UAV Navigation, Multi-Modal Target

TL;DR

This paper proposes GeoExplorer, an active geo-localization (AGL) agent that integrates goal-directed extrinsic rewards with curiosity-driven intrinsic rewards. By jointly modeling action-state dynamics and curiosity-based exploration within a reinforcement learning framework, GeoExplorer achieves more robust UAV search strategies and demonstrates superior generalization to unseen targets and environments.

Background & Motivation

Active Geo-Localization (AGL) refers to the task of navigating a UAV agent to a target location within a predefined search region. The target may be specified in multiple modalities (aerial image, ground-level image, or text), but its precise location remains unknown at inference time. This capability is critical for search-and-rescue operations.

Existing methods (e.g., GOMAA-Geo) rely on extrinsic rewards (distance-based rewards) and suffer from three key limitations:

Unreliable distance estimation: Since target location is unknown at inference time, distance-based rewards cannot be computed directly, reducing the reliability of learned policies.

Insufficient environment modeling: These methods only predict action sequences without modeling state transitions, and thus fail to capture how actions alter the environment.

Weak generalization: Distance estimation tailored to training environments struggles to transfer to unseen targets and novel environments.

Core Insight: Curiosity-driven intrinsic rewards (target-agnostic) are introduced to complement goal-directed extrinsic rewards. The curiosity reward is based on the discrepancy between predicted and actual states, requiring no knowledge of target location, thereby providing dense, target-agnostic, content-aware exploration guidance.

Method

Overall Architecture

GeoExplorer training proceeds in three stages: 1. Feature Representation: Aligned multi-modal encoders process targets of different modalities. 2. Action-State Dynamics Modeling (DM): Supervised pre-training with a causal Transformer for joint action and state transition modeling. 3. Curiosity-driven Exploration (CE): PPO-based Actor-Critic RL combining extrinsic and intrinsic rewards.

Key Designs

  1. Multi-Modal Feature Representation: Three aligned encoders are employed:

    • Aerial image encoder Sat2Cap (ViT, aligned with CLIP)
    • Ground-level image encoder CLIP_img
    • Text encoder CLIP_text
    • All encoders are frozen after pre-training, ensuring targets from different modalities are embedded in a shared feature space.
  2. Joint Action-State Dynamics Modeling: A causal Transformer (Falcon-7B) simultaneously predicts:

    • Optimal action \(\hat{a}_t = \text{CausalTrans}(s_t | x_{t-1}, s_{goal})\) (which action brings the agent closer to the target)
    • State representation \(\hat{s}_t = \text{CausalTrans}(a_{t-1} | x_{t-1}, s_{goal})\) (how actions affect the environment)
    • No architectural modifications are required; a state modeling loss \(\mathcal{L}_{State} = \sum_{t=1}^{N-1} \|\hat{s}_t - s_t\|^2_2\) is simply appended.
    • Total DM loss: \(\mathcal{L}_{DM} = \mathcal{L}_{Action} + \alpha \mathcal{L}_{State}\)
  3. Curiosity-Driven Intrinsic Reward: The discrepancy between the predicted next state \(\hat{s}_{t+1}\) and the actual next state \(s_{t+1}\), learned during the DM stage, serves as a measure of "surprise":

    • MSE formulation: \(r^{in}_t = \|\hat{s}_{t+1} - s_{t+1}\|^2_2\)
    • Cosine similarity formulation: \(r^{in}_t = -\cos(\hat{s}_{t+1}, s_{t+1})\)
    • Final reward: \(r^{CE}_t = r^{ex}_t + \beta r^{in}_t\), where \(\beta = 0.25\)
    • Key: intrinsic rewards are normalized to \([-1, 1]\) before weighting.

Loss & Training

  • DM stage: supervised pre-training on randomly generated {start, goal} trajectory pairs.
  • CE stage: the causal Transformer is frozen; only the Actor-Critic action prediction head (MLP) is trained.
  • Search grid: \(5\times5\); search budget \(B=10\); evaluation distances \(C\in\{4,5,6,7,8\}\).
  • Training is conducted exclusively on the Masa dataset; zero-shot transfer is evaluated on other datasets.

Key Experimental Results

Main Results (Tables)

Masa Dataset Validation Set (Success Rate SR):

Method C=4 C=5 C=6 C=7 C=8
Random 0.141 0.058 0.064 0.025 0.024
DiT 0.201 0.296 0.357 0.422 0.456
GOMAA-Geo 0.409 0.506 0.717 0.803 0.785
GeoExplorer 0.432 0.532 0.816 0.923 0.950

Generalization to Unseen Targets (SwissViewMonuments):

Method C=4 C=5 C=6 C=7 C=8
GOMAA-Geo 0.403 0.383 0.627 0.728 0.783
GeoExplorer (I) 0.413 0.533 0.770 0.889 0.883
GeoExplorer (G) 0.416 0.517 0.773 0.878 0.783

Ablation Study (Table)

Action Loss State Loss \(r^{ex}\) \(r^{in}\) C=4 C=5 C=6 C=7 C=8
0.409 0.506 0.717 0.803 0.785
0.398 0.494 0.761 0.841 0.865
MSE 0.389 0.494 0.741 0.862 0.914
COS 0.396 0.539 0.813 0.905 0.935
MSE 0.432 0.532 0.816 0.923 0.950

The full combination of state modeling and curiosity reward yields the best performance; even without explicit supervision of state prediction, the intrinsic reward alone provides measurable gains.

Key Findings

  • Improvements are most pronounced on long-horizon paths: at \(C=8\), GeoExplorer outperforms GOMAA-Geo by 0.1643.
  • Cross-domain transfer: an average improvement of 0.0556 is observed on xBD-disaster (post-disaster environment with pre-disaster targets).
  • The curiosity reward is content-aware: transition patches from forest to urban scenes receive the highest intrinsic rewards.
  • More comprehensive exploration: at \(C=4\), GeoExplorer visits 30.79% of patches located inside the search region, compared to only 20.08% for GOMAA-Geo.
  • The performance advantage scales with larger search budgets (greater exploration capacity).

Highlights & Insights

  • Curiosity-driven RL is introduced into the AGL task for the first time, integrating seamlessly into the sequence modeling framework without additional components.
  • Joint action-state modeling requires no modification to the Transformer architecture — a single state loss suffices.
  • A new SwissView benchmark is proposed, with the SwissViewMonuments subset specifically designed to evaluate generalization to unseen targets.
  • Visualization analysis of curiosity rewards is highly intuitive: semantic transitions (e.g., forest → urban) receive high intrinsic rewards.

Limitations & Future Work

  • The search space is a discrete grid (\(5\times5\)); future work should extend to continuous state and action spaces.
  • Real-world deployment challenges such as UAV ego-pose noise and observation distortion are not addressed.
  • The optimal balance between intrinsic and extrinsic rewards warrants further investigation.
  • The training set (Masa) covers relatively homogeneous scenes, which may limit performance in more diverse environments.
  • Compared to GOMAA-Geo (action modeling only + extrinsic rewards), GeoExplorer introduces two additional dimensions: state modeling and intrinsic rewards.
  • Curiosity-driven RL has been extensively studied in classical control and game-playing tasks; this work applies it to geo-localization for the first time.
  • The "by-product" of state prediction is elegantly repurposed to construct intrinsic rewards.

Rating

  • Novelty: ⭐⭐⭐⭐ (first integration of curiosity-driven RL with AGL)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 benchmarks + new dataset + comprehensive ablations + rich visualizations)
  • Writing Quality: ⭐⭐⭐⭐ (clear structure, detailed supplementary material)
  • Value: ⭐⭐⭐⭐ (practical significance for search-and-rescue UAV deployment)