ICCV 2025 Autonomous Driving Cross-View Geo-localization Natural Language Text-to-Image Retrieval Satellite OSM LMM Explainable Retrieval

Where am I? Cross-View Geo-localization with Natural Language Descriptions¶

Conference: ICCV 2025 arXiv: 2412.17007 Code: yejy53.github.io/CVG-Text Area: Autonomous Driving / Cross-View Geo-localization Keywords: Cross-View Geo-localization, Natural Language, Text-to-Image Retrieval, Satellite, OSM, LMM, Explainable Retrieval

TL;DR¶

This paper introduces a novel task of cross-view geo-localization via natural language descriptions, constructs the CVG-Text multimodal dataset covering 30,000+ coordinates across 3 cities (street-view + satellite + OSM + text), and proposes CrossText2Loc — a method employing Extended Positional Embedding for long-text handling and an Explainable Retrieval Module for localization rationale, achieving over 10% improvement in Top-1 Recall.

Background & Motivation¶

Problem Scenario¶

In GPS-denied environments (urban canyon occlusion, indoor-outdoor transitions, emergency calls, etc.), users need to determine their location by describing the surrounding environment in natural language. Examples include: - A taxi passenger verbally communicating their location to the driver - A pedestrian describing surroundings during an emergency call for rescue localization

Limitations of Prior Work¶

Cross-view localization focuses on image matching: Methods such as Sample4G and SAFA concentrate on street-view-to-satellite image retrieval, but in practice users may only be able to provide textual descriptions.

Text-based localization limited to point clouds: Text2Pose and Text2Loc perform text-based localization in 3D point clouds, but point cloud acquisition is costly, storage overhead is large, and global-scale deployment is impractical.

Satellite/OSM data are more practical: Satellite imagery and OpenStreetMap offer global coverage at low storage cost, yet text-to-satellite/OSM cross-view retrieval has not been previously studied.

Insufficient handling of long text: Scene descriptions are typically long (averaging 126 tokens), whereas models such as CLIP enforce a maximum sequence length of 77 tokens, truncating critical information.

Method¶

CVG-Text Dataset Construction¶

Data Collection¶

Covers 3 cities: New York (urban), Brisbane (suburban), and Tokyo (urban)
30,000+ coordinate points, each containing:
- Panoramic street-view images (2048×1024) and single-view street-view images
- Satellite images (512×512, zoom level 20, resolution ~0.12 m)
- OSM raster tiles (512×512, retaining POI annotations)

GPT-4o-Driven Text Generation¶

A progressive scene analysis strategy is adopted:

OCR Preprocessing: PaddleOCR extracts textual information from street-view images (shop names, bus stop signs, etc.), helping GPT accurately capture key localization cues and reducing hallucinations.
Open-World Segmentation: Semantic segmentation is applied to street-view images to provide positional and semantic details; OCR results from moving objects (e.g., vehicles) are filtered out.
Systematic Prompting: GPT-4o is guided to progressively describe scenes in the order of "road features → building landmarks → overall environment," using simple directional terms such as front, back, left, and right.
Quality Control: Format filtering → GPT self-review → expert human annotation (20% of samples, 10 annotators × 100 hours, pass rate 77.6%).

Text statistics: average length 126 tokens, type-token ratio (TTR) 0.76, low inter-text similarity (0.17), indicating high quality.

CrossText2Loc Model¶

Image–Text Contrastive Learning¶

A dual-stream architecture consisting of a text encoder and a visual encoder aligns cross-domain features via contrastive learning:

\[L_{itc} = \sum_{i=1}^n \sum_{j=1}^n -\log\frac{\exp(\text{sim}(v_i, t_j)/\tau)}{\sum_{k=1}^n \exp(\text{sim}(v_i, t_k)/\tau)}\]

where \(\tau\) is a learnable temperature parameter.

Extended Positional Embedding (EPE)¶

Scene descriptions average 126 tokens, yet CLIP is limited to 77 tokens. Positional embeddings are extended to \(N=300\) tokens via linear interpolation:

\[P^*(x) = (1-(x-\lfloor x\rfloor)) \cdot P(\lfloor x\rfloor) + (x-\lfloor x\rfloor) \cdot P(\lceil x\rceil)\]

Unlike LongCLIP, since GPT-generated text does not contain a salient short title at the beginning, full-sequence interpolation is adopted rather than knowledge-preserving stretching.

Explainable Retrieval Module (ERM)¶

An optional inference-time module that enhances retrieval interpretability and trustworthiness:

Attention Heatmap Generation: Non-negative gradient contributions are iteratively accumulated from starting layer \(s\) to output layer \(L\):

\[R^{(l)} = R^{(l-1)} + \frac{1}{H}\sum_{h=1}^H \max(0, \nabla A_h^{(l)} \odot A_h^{(l)}) R^{(l-1)}\]

LMM Explanation: Text and image heatmaps are fed to GPT-4o, which performs key-cue analysis → comparative reasoning → retrieval rationale and confidence output.
Confidence-Based Re-ranking: ERM confidence scores are normalized and summed with similarity scores to re-rank the Top-5 results.

Key Experimental Results¶

Main Results: Cross-View Text Retrieval Localization¶

Method	NYC-Sat R@1	NYC-OSM R@1	Brisbane-Sat R@1	Brisbane-OSM R@1	Tokyo-Sat R@1	Tokyo-OSM R@1
CLIP-L/14	35.08	31.50	34.08	32.50	28.08	21.00
SigLIP-SO400M	33.50	27.75	34.25	29.75	28.42	17.50
BLIP	34.58	52.92	34.50	43.00	29.75	30.67
Ours (w/o ERM)	46.25	59.08	43.58	46.08	36.83	34.33
Ours (w/ ERM)	50.33	62.33	47.58	48.75	41.75	36.92

Key Findings: - CrossText2Loc outperforms the strongest baseline BLIP by 15.75% on satellite retrieval (New York) and 9.41% on OSM retrieval. - ERM re-ranking further improves R@1 by 4–5%, simulating user decision-making based on provided rationales. - OSM retrieval performs best in New York (rich POI data such as bus stops and shop names) and worst in Tokyo (insufficient CLIP pretraining on Japanese text).

Ablation Study: EPE Module¶

Method	Sat R@1	OSM R@1
CLIP	35.08	31.50
CLIP + EPE	46.25	59.08
SigLIP	19.67	20.17
SigLIP + EPE	29.50	45.25

EPE yields significant improvements on both encoders, with OSM retrieval showing the most pronounced gain (+27.6%), demonstrating that fine-grained details in long descriptions are critical for POI matching.

Ablation Study: Text Generation Quality¶

Text Source	Len	TTR	Simi.	R@1-OSM	R@1-Sat
Direct GPT generation	108	0.74	0.22	25.17	38.00
CVG-Text (OCR + Seg. + Chain Prompting)	126	0.76	0.17	59.08	46.25

OCR-assisted precise capture of street-view text → reduced GPT hallucinations → 33.9% improvement in OSM retrieval.

Text Augmentation for Cross-View Retrieval¶

Query Modality	OSM R@1	Sat R@1
Street-view image only (Sample4G)	27.10	91.70
Text only	59.08	46.25
Image + Text fusion	67.30	98.40

The text branch provides complementary information to conventional cross-view retrieval, improving OSM retrieval by 40.2%.

Highlights & Insights¶

Novel Task Definition: This is the first work to formulate cross-view geo-localization via natural language descriptions, filling the gap in text-to-satellite/OSM retrieval research.
High-Quality Dataset: CVG-Text employs a progressive pipeline of OCR + segmentation + GPT-4o, generating scene descriptions of substantially higher quality than direct GPT generation.
Long-Text Handling: EPE extends positional embeddings via simple linear interpolation, yet achieves a 27.6% gain on OSM retrieval, confirming that fine-grained details in scene descriptions must not be discarded.
Explainable Retrieval: ERM provides not only similarity scores but also natural language retrieval rationales and confidence estimates, simulating the decision-making process in real-world applications.
Clear Practical Value: Text retrieval complements image retrieval; their fusion achieves 98.4% R@1 on satellite retrieval.

Limitations & Future Work¶

Textual descriptions are generated by GPT-4o, which may diverge from real users' description habits (typically shorter and more ambiguous).
Coverage is limited to 3 cities, with insufficient geographic diversity and limited consideration of regional street-view style variation.
ERM relies on GPT-4o for inference-time explanation, introducing additional computational cost and latency.
Degraded performance in Tokyo exposes CLIP's limitations in non-English scenarios.
The retrieval scope is \(M=100\) (approximately 10 km²); scalability to large-area retrieval remains to be validated.

Cross-View Geo-localization: CVUSA, VIGOR, Sample4G — image-to-image retrieval
Vision-Language Navigation and Localization: Text2Pose, Text2Loc — text-based localization in point clouds, with high acquisition cost
LMM-Based Data Synthesis: LatteCLIP — leveraging LMMs to synthesize text for unsupervised CLIP fine-tuning
Multimodal Alignment: CLIP, LongCLIP — text-image contrastive learning with sequence length limitations

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (Pioneering definition of text-to-satellite/OSM cross-view geo-localization task)
Technical Depth: ⭐⭐⭐⭐ (EPE is simple yet effective; ERM enhances interpretability; overall methodology is engineering-oriented)
Experimental Thoroughness: ⭐⭐⭐⭐ (Multi-city, multi-source, extensive ablations; lacks comparison with other text-based localization methods)
Practical Value: ⭐⭐⭐⭐⭐ (Clear application scenarios including emergency localization and pedestrian navigation)
Overall Recommendation: ⭐⭐⭐⭐ (A complete contribution introducing a new task and dataset with the potential to open a new research direction)