AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization¶
Conference: ECCV 2024
arXiv: 2407.08156
Code: GitHub
Area: Multimodal VLM
Keywords: Image Address Localization, CLIP, Contrastive Learning, Manifold Learning, Vision-Language Models
TL;DR¶
The AddressCLIP framework is proposed, which models the Image Address Localization (IAL) problem as an end-to-end vision-language alignment task through two core components: image-text alignment (contrastive learning of address and scene descriptions) and image-geography matching (manifold learning based on GPS distance). It achieves a Top-1 accuracy of up to 85.92% on three self-constructed IAL datasets.
Background & Motivation¶
Traditional image address localization relies on a two-stage pipeline: first predicting GPS coordinates via image geolocalisation, and then converting them into readable addresses via reverse geocoding. This approach suffers from three core limitations:
Lack of semantic meaning in GPS coordinates: Numerical coordinates are unreadable to humans and cannot be directly used for downstream tasks such as recommendation and navigation.
Ambiguity in GPS-to-address conversion: Images at street intersections can be attributed to different streets, leading to non-unique conversion results.
Non-end-to-end pipeline: Constructing a retrieval database incurs heavy storage and retrieval overhead, and errors propagate and accumulate between the two stages.
The authors introduce a new task definition: Image Address Localization (IAL), aiming to predict human-readable textual addresses directly from images (e.g., "Forbes Avenue & Craig Street, Oakland") without intermediate GPS coordinates or a retrieval database.
Practical application scenarios for this task include social media geotagging, news image location verification, and location recommendation on travel platforms.
Method¶
Overall Architecture¶
AddressCLIP is based on the pretrained CLIP model, formulating IAL as a vision-text alignment problem. The framework incorporates three core loss functions:
- Image-address contrastive loss \(\mathcal{L}_{address}\): Learns the alignment between images and address texts.
- Image-caption contrastive loss \(\mathcal{L}_{caption}\): Leverages scene descriptions to supplement the lack of details in address texts.
- Image-geography matching loss \(\mathcal{L}_{geography}\): Constrains the spatial distribution of image features using GPS coordinates.
During inference, the similarity between the query image embedding and all candidate address text embeddings is computed, and the most similar address is selected as the prediction.
Semantic Address Partitioning Strategy¶
Original administrative addresses present two challenges: (1) Inconsistent street lengths, where long streets result in overly coarse localization; (2) Ambiguous addresses at intersections.
The authors design a Semantic Address Partition strategy: - Identify all cross streets for each main street. - Partition main streets into sub-segments at junctions. - Remove overly close intersections to prevent excessively short sub-segments. - Merge overly short sub-segments (<5 location points) to alleviate long-tail issues.
The final address representation is formatted as: "Main Street Name + Cross Street Name(s) (1-2) + Neighborhood Name", for example, "Forbes Ave & Craig St, Oakland".
Image-Text Alignment¶
Problem: Address texts are often too concise to provide contextual cues such as surrounding environments or landmarks.
Solution: Introduce scene descriptions generated by BLIP as a complement to addresses. Scene descriptions are generated using the prompt "A street view of", capturing visual details such as buildings and street signs.
Image features \(V_i = \mathcal{V}(I_i)\), address features \(T_i^A = \mathcal{T}(A_i)\), and caption features \(T_i^C = \mathcal{T}(C_i + A_i)\) (appending address info to the caption text).
Image-Address Contrastive Loss:
The Image-Caption Contrastive Loss \(\mathcal{L}_{caption}\) takes an identical form, with \(T^A\) replaced by \(T^C\).
Key Design: Scene descriptions are only used during training; only the address text encoder is required for inference. Empirical results demonstrate that appending address information to the scene description (\(C_i + A_i\)) outperforms descriptions alone (\(C_i\)), as the address provides explicit geospatial anchoring.
Image-Geography Matching¶
Motivation: In an urban setting, address texts may be geolocated far apart yet share highly similar wording (e.g., streets with the same name in different districts), or be geographically close but textually distinct. It is challenging for text alignment alone to capture such spatial relationships.
Core Idea: From the perspective of manifold learning, the image embedding space should align with the geographic coordinate space; images that are geographically close should also reside close to each other in the feature space.
Geographic Distance Matrix: Min-max normalization is applied to the UTM coordinates of all images in a batch, followed by calculating the Manhattan distance:
Feature Similarity Matrix:
Image-Geography Matching Loss:
This loss aligns the feature similarity matrix with the geographic distance matrix, thereby preserving the topological structure of the geographic space within the feature space.
Loss & Training¶
Total Objective Function:
Where \(\alpha=1, \beta=0.2, \gamma=0.8\).
Training Details: - Based on OpenAI CLIP (ViT-B/16 by default) - Directly fine-tunes the image and text encoders of CLIP without introducing additional parameters - Adam optimizer (\(\beta_1=0.9, \beta_2=0.98\)) with a cosine learning rate scheduler decaying from 2.4e-5 to 2.4e-8 - Input image size is 224×224, with a batch size of 32 per GPU, trained on 8 V100 GPUs for 100 epochs - BLIP-Caption-Large is used to generate scene descriptions (prompt: "A street view of", length of 10-30 words)
Key Experimental Results¶
Datasets¶
The authors constructed three IAL datasets based on the Pitts-250K and SF-XL datasets:
| Dataset | Scale | Train/Val | Test | Image Size | Covered Area | Sub-streets (Train/Test) |
|---|---|---|---|---|---|---|
| Pitts-IAL | 6.7GB / 234K | 234K | 19K | 480×640 | 20 km² | 428/327 |
| SF-IAL-Base | 6.8GB / 184K | 184K | 21K | 512×512 | 6 km² | 400/369 |
| SF-IAL-Large | 121GB / 1.96M | 1.96M | 280K | 512×512 | 170 km² | 3616/3406 |
Main Results¶
Evaluation metrics include SSA (Sub-Street level Accuracy) and SA (Street level Accuracy), with Top-1 and Top-5 reported.
| Method | Pitts-IAL SSA-1 | Pitts-IAL SA-1 | SF-IAL-Base SSA-1 | SF-IAL-Base SA-1 | SF-IAL-Large SSA-1 | SF-IAL-Large SA-1 |
|---|---|---|---|---|---|---|
| Zero-shot CLIP | 0.85 | 1.28 | 1.25 | 2.80 | 0.26 | 0.50 |
| CLIP + address | 77.66 | 80.86 | 83.66 | 85.76 | 81.84 | 84.56 |
| CLIP + CoOp | 67.91 | 71.19 | 77.77 | 79.90 | 74.84 | 78.23 |
| CLIP + CoCoOp | 69.04 | 73.28 | 79.19 | 81.15 | 76.92 | 79.85 |
| CLIP + MaPLe | 72.98 | 76.04 | 81.46 | 83.69 | 79.63 | 82.34 |
| AddressCLIP | 80.39 | 82.62 | 86.32 | 87.44 | 85.92 | 88.10 |
AddressCLIP improves upon the best prompt learning method (MaPLe) by 7.41%, 4.86%, and 6.29% (SSA-1) across the three datasets, respectively.
Ablation Study¶
Ablation of key components (Pitts-IAL / SF-IAL-Base, SSA-1):
| \(\mathcal{L}_{address}\) | \(\mathcal{L}_{caption}\) | \(\mathcal{L}_{geography}\) | Pitts-IAL SSA-1 | SF-IAL-Base SSA-1 |
|---|---|---|---|---|
| ✔ | 77.66 | 83.66 | ||
| ✔ | 69.27 | 75.85 | ||
| ✔ | ✔ | 79.20 | 84.86 | |
| ✔ | ✔ | 79.27 | 85.54 | |
| ✔ | ✔ | ✔ | 80.39 | 86.32 |
- Utilizing \(\mathcal{L}_{caption}\) alone is significantly weaker than utilizing \(\mathcal{L}_{address}\) alone, indicating that direct supervision of address info is more critical.
- \(\mathcal{L}_{caption}\) and \(\mathcal{L}_{geography}\) each contribute approximately +1.5-1.9%, and their combination brings a +2.7% improvement, demonstrating complementarity.
Encoder freezing strategy: Unfreezing only the image encoder yields about a 30% gain, while unfreezing only the text encoder offers limited improvements. Unfreezing both simultaneously yields the best performance.
Comparison with retrieval-based methods (Pitts-IAL, ResNet50):
| Method | Storage | Inference & Retrieval Time | SSA-1 |
|---|---|---|---|
| SALAD | 2.34 GB | 2.53 ms | 75.17 |
| AnyLoc | - | - | 74.83 |
| AddressCLIP | 0.34 GB | 3.46 ms | 77.01 |
AddressCLIP does not require a retrieval database, requiring only 0.34GB of storage and 0.64MB of inference memory, while outperforming retrieval-based methods by 1.84% in SSA-1.
Key Findings¶
- SF-IAL-Base outperforms Pitts-IAL: San Francisco street patterns are more regular, and street-view image collection is denser.
- SA performance is higher than SSA: Sub-street level label learning can further boost the main street identification capability.
- Performance remains at 85.92% on SF-IAL-Large (170 km²): This demonstrates the scalability of the proposed method to large-scale urban scenarios.
- Multi-city joint training: Joint training on Pitts and SF leads to a performance drop of only \(<0.8\%\), showing potential for cross-city scaling.
- Geographic coverage analysis: Denser coverage is not strictly necessary; with only 12.5% location coverage, about 75% of the original performance is retained.
Highlights & Insights¶
- Novel task definition: The study formulates image address localization as an independent task (IAL) for the first time, shifting from "coordinate prediction" to "readable address prediction", which is more aligned with human usage.
- Semantic address partitioning strategy: Partitioning streets into sub-segments at intersections mitigates long-tail distributions while resolving intersection ambiguities. This design is simple yet effective.
- Manifold learning constraint: The image-geography matching loss constrains the feature space from a topology-preserving perspective, serving as an effective complement to contrastive learning.
- Distinct end-to-end advantages: Compared to the two-stage pipeline, AddressCLIP eliminates the cumulative errors and ambiguities of GPS-to-address translation while substantially reducing storage overhead.
Limitations & Future Work¶
- Discriminative model limitations: Inference is bound by the candidate address set, making it unable to predict unseen addresses from outside the training set.
- City-level constraints: The evaluation is currently restricted to city-wide levels. Scaling to larger geographical ranges (cross-city/cross-country) requires further research.
- Exploration of generative approaches: Preliminary experiments with instruction tuning using LLaVA (LLaVA-IAL) demonstrate the feasibility of generative models for the IAL task.
- Diminishing returns of scene descriptions: Performance variance between descriptions generated by BLIP-Base and BLIP-Large is minimal, suggesting limited room for improvement purely from captions.
Related Work & Insights¶
- StreetCLIP / GeoCLIP: Pioneer works leveraging CLIP for geolocation, though they still output coordinates rather than addresses.
- CoOp / CoCoOp / MaPLe: General-purpose prompt learning methods demonstrate limited performance on the IAL task, indicating that generic transfer is inferior to task-specific design.
- Insight: Manifold learning constraints can be generalized to other vision-language tasks that require preservation of spatial topology.
Rating¶
| Dimension | Score (1-10) | Description |
|---|---|---|
| Novelty | 8 | Combination of a new task definition, semantic address partitioning, and geographic manifold constraint |
| Technical Depth | 7 | Neat methodology, but individual components are relatively straightforward without complex architectural design |
| Experimental Thoroughness | 9 | Three datasets, multi-dimensional ablations, qualitative visualizations, and comparison with retrieval-based methods |
| Writing Quality | 8 | Well-structured, with clear and well-articulated task motivation |
| Value | 8 | End-to-end address prediction has direct applications in social media, navigation, and related scenarios |
| Overall Score | 8.0 | Novel task definition, effective methodology, and comprehensive experiments |