Where am I? Cross-View Geo-localization with Natural Language Descriptions¶
Conference: ICCV 2025 arXiv: 2412.17007 Code: yejy53.github.io/CVG-Text Area: Autonomous Driving / Cross-View Geo-localization Keywords: Cross-View Geo-localization, Natural Language, Text-to-Image Retrieval, Satellite, OSM, LMM, Explainable Retrieval
TL;DR¶
This paper introduces a novel task of cross-view geo-localization via natural language descriptions, constructs the CVG-Text multimodal dataset covering 30,000+ coordinates across 3 cities (street-view + satellite + OSM + text), and proposes CrossText2Loc — a method employing Extended Positional Embedding for long-text handling and an Explainable Retrieval Module for localization rationale, achieving over 10% improvement in Top-1 Recall.
Background & Motivation¶
Problem Scenario¶
In GPS-denied environments (urban canyon occlusion, indoor-outdoor transitions, emergency calls, etc.), users need to determine their location by describing the surrounding environment in natural language. Examples include: - A taxi passenger verbally communicating their location to the driver - A pedestrian describing surroundings during an emergency call for rescue localization
Limitations of Prior Work¶
Cross-view localization focuses on image matching: Methods such as Sample4G and SAFA concentrate on street-view-to-satellite image retrieval, but in practice users may only be able to provide textual descriptions.
Text-based localization limited to point clouds: Text2Pose and Text2Loc perform text-based localization in 3D point clouds, but point cloud acquisition is costly, storage overhead is large, and global-scale deployment is impractical.
Satellite/OSM data are more practical: Satellite imagery and OpenStreetMap offer global coverage at low storage cost, yet text-to-satellite/OSM cross-view retrieval has not been previously studied.
Insufficient handling of long text: Scene descriptions are typically long (averaging 126 tokens), whereas models such as CLIP enforce a maximum sequence length of 77 tokens, truncating critical information.
Method¶
CVG-Text Dataset Construction¶
Data Collection¶
- Covers 3 cities: New York (urban), Brisbane (suburban), and Tokyo (urban)
- 30,000+ coordinate points, each containing:
- Panoramic street-view images (2048×1024) and single-view street-view images
- Satellite images (512×512, zoom level 20, resolution ~0.12 m)
- OSM raster tiles (512×512, retaining POI annotations)
GPT-4o-Driven Text Generation¶
A progressive scene analysis strategy is adopted:
- OCR Preprocessing: PaddleOCR extracts textual information from street-view images (shop names, bus stop signs, etc.), helping GPT accurately capture key localization cues and reducing hallucinations.
- Open-World Segmentation: Semantic segmentation is applied to street-view images to provide positional and semantic details; OCR results from moving objects (e.g., vehicles) are filtered out.
- Systematic Prompting: GPT-4o is guided to progressively describe scenes in the order of "road features → building landmarks → overall environment," using simple directional terms such as front, back, left, and right.
- Quality Control: Format filtering → GPT self-review → expert human annotation (20% of samples, 10 annotators × 100 hours, pass rate 77.6%).
Text statistics: average length 126 tokens, type-token ratio (TTR) 0.76, low inter-text similarity (0.17), indicating high quality.
CrossText2Loc Model¶
Image–Text Contrastive Learning¶
A dual-stream architecture consisting of a text encoder and a visual encoder aligns cross-domain features via contrastive learning:
where \(\tau\) is a learnable temperature parameter.
Extended Positional Embedding (EPE)¶
Scene descriptions average 126 tokens, yet CLIP is limited to 77 tokens. Positional embeddings are extended to \(N=300\) tokens via linear interpolation:
Unlike LongCLIP, since GPT-generated text does not contain a salient short title at the beginning, full-sequence interpolation is adopted rather than knowledge-preserving stretching.
Explainable Retrieval Module (ERM)¶
An optional inference-time module that enhances retrieval interpretability and trustworthiness:
- Attention Heatmap Generation: Non-negative gradient contributions are iteratively accumulated from starting layer \(s\) to output layer \(L\):
- LMM Explanation: Text and image heatmaps are fed to GPT-4o, which performs key-cue analysis → comparative reasoning → retrieval rationale and confidence output.
- Confidence-Based Re-ranking: ERM confidence scores are normalized and summed with similarity scores to re-rank the Top-5 results.
Key Experimental Results¶
Main Results: Cross-View Text Retrieval Localization¶
| Method | NYC-Sat R@1 | NYC-OSM R@1 | Brisbane-Sat R@1 | Brisbane-OSM R@1 | Tokyo-Sat R@1 | Tokyo-OSM R@1 |
|---|---|---|---|---|---|---|
| CLIP-L/14 | 35.08 | 31.50 | 34.08 | 32.50 | 28.08 | 21.00 |
| SigLIP-SO400M | 33.50 | 27.75 | 34.25 | 29.75 | 28.42 | 17.50 |
| BLIP | 34.58 | 52.92 | 34.50 | 43.00 | 29.75 | 30.67 |
| Ours (w/o ERM) | 46.25 | 59.08 | 43.58 | 46.08 | 36.83 | 34.33 |
| Ours (w/ ERM) | 50.33 | 62.33 | 47.58 | 48.75 | 41.75 | 36.92 |
Key Findings: - CrossText2Loc outperforms the strongest baseline BLIP by 15.75% on satellite retrieval (New York) and 9.41% on OSM retrieval. - ERM re-ranking further improves R@1 by 4–5%, simulating user decision-making based on provided rationales. - OSM retrieval performs best in New York (rich POI data such as bus stops and shop names) and worst in Tokyo (insufficient CLIP pretraining on Japanese text).
Ablation Study: EPE Module¶
| Method | Sat R@1 | OSM R@1 |
|---|---|---|
| CLIP | 35.08 | 31.50 |
| CLIP + EPE | 46.25 | 59.08 |
| SigLIP | 19.67 | 20.17 |
| SigLIP + EPE | 29.50 | 45.25 |
EPE yields significant improvements on both encoders, with OSM retrieval showing the most pronounced gain (+27.6%), demonstrating that fine-grained details in long descriptions are critical for POI matching.
Ablation Study: Text Generation Quality¶
| Text Source | Len | TTR | Simi. | R@1-OSM | R@1-Sat |
|---|---|---|---|---|---|
| Direct GPT generation | 108 | 0.74 | 0.22 | 25.17 | 38.00 |
| CVG-Text (OCR + Seg. + Chain Prompting) | 126 | 0.76 | 0.17 | 59.08 | 46.25 |
OCR-assisted precise capture of street-view text → reduced GPT hallucinations → 33.9% improvement in OSM retrieval.
Text Augmentation for Cross-View Retrieval¶
| Query Modality | OSM R@1 | Sat R@1 |
|---|---|---|
| Street-view image only (Sample4G) | 27.10 | 91.70 |
| Text only | 59.08 | 46.25 |
| Image + Text fusion | 67.30 | 98.40 |
The text branch provides complementary information to conventional cross-view retrieval, improving OSM retrieval by 40.2%.
Highlights & Insights¶
- Novel Task Definition: This is the first work to formulate cross-view geo-localization via natural language descriptions, filling the gap in text-to-satellite/OSM retrieval research.
- High-Quality Dataset: CVG-Text employs a progressive pipeline of OCR + segmentation + GPT-4o, generating scene descriptions of substantially higher quality than direct GPT generation.
- Long-Text Handling: EPE extends positional embeddings via simple linear interpolation, yet achieves a 27.6% gain on OSM retrieval, confirming that fine-grained details in scene descriptions must not be discarded.
- Explainable Retrieval: ERM provides not only similarity scores but also natural language retrieval rationales and confidence estimates, simulating the decision-making process in real-world applications.
- Clear Practical Value: Text retrieval complements image retrieval; their fusion achieves 98.4% R@1 on satellite retrieval.
Limitations & Future Work¶
- Textual descriptions are generated by GPT-4o, which may diverge from real users' description habits (typically shorter and more ambiguous).
- Coverage is limited to 3 cities, with insufficient geographic diversity and limited consideration of regional street-view style variation.
- ERM relies on GPT-4o for inference-time explanation, introducing additional computational cost and latency.
- Degraded performance in Tokyo exposes CLIP's limitations in non-English scenarios.
- The retrieval scope is \(M=100\) (approximately 10 km²); scalability to large-area retrieval remains to be validated.
Related Work & Insights¶
- Cross-View Geo-localization: CVUSA, VIGOR, Sample4G — image-to-image retrieval
- Vision-Language Navigation and Localization: Text2Pose, Text2Loc — text-based localization in point clouds, with high acquisition cost
- LMM-Based Data Synthesis: LatteCLIP — leveraging LMMs to synthesize text for unsupervised CLIP fine-tuning
- Multimodal Alignment: CLIP, LongCLIP — text-image contrastive learning with sequence length limitations
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Pioneering definition of text-to-satellite/OSM cross-view geo-localization task)
- Technical Depth: ⭐⭐⭐⭐ (EPE is simple yet effective; ERM enhances interpretability; overall methodology is engineering-oriented)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Multi-city, multi-source, extensive ablations; lacks comparison with other text-based localization methods)
- Practical Value: ⭐⭐⭐⭐⭐ (Clear application scenarios including emergency localization and pedestrian navigation)
- Overall Recommendation: ⭐⭐⭐⭐ (A complete contribution introducing a new task and dataset with the potential to open a new research direction)