UniGeoCLIP: Unified Geospatial Contrastive Learning¶

Conference: CVPR 2026 arXiv: 2604.11668 Code: https://gastruc.github.io/unigeoclip Area: Self-Supervised Learning Keywords: geospatial representation learning, contrastive learning, multimodal, coordinate encoding, unified embedding space

TL;DR¶

UniGeoCLIP is the first to align five complementary geospatial modalities (aerial imagery, street-view imagery, digital surface models, text, and GPS coordinates) into a unified embedding space via pure contrastive learning, and proposes a multi-scale coordinate encoder to enhance spatial representation capacity.

Background & Motivation¶

Background: Geospatial representation learning encompasses three paradigms — embedding fields (coordinates → vectors), multimodal fusion (multi-sensor → single representation), and contrastive alignment (e.g., GeoCLIP/SatCLIP aligning coordinates with satellite imagery).

Limitations of Prior Work: (1) Embedding fields are static snapshots that cannot model dynamics; (2) fusion models compress all modalities into a single representation, precluding cross-modal retrieval and comparison; (3) existing contrastive methods align only two modalities (typically coordinates + satellite imagery), neglecting text, street-view, and terrain modalities.

Key Challenge: Different geospatial modalities provide complementary information (aerial imagery for layout, street-view for facades, terrain for elevation, text for semantic description), yet no framework exists to unify them in a shared space.

Core Idea: All-to-all contrastive learning — five modalities are contrasted against one another (without a central pivot), constructing a truly unified embedding space. A novel multi-scale coordinate encoder further overcomes the expressive bottleneck of raw coordinate embeddings.

Method¶

Overall Architecture¶

Five modality-specific encoders (SigLIP-2 image/text encoders, a DSM ViT encoder, and a multi-scale GPS encoder) → all-to-all contrastive loss aligning \(\binom{5}{2}=10\) modality pairs → a unified \(D\)-dimensional embedding space.

Key Designs¶

All-to-All Contrastive Alignment:
- Function: All modalities are treated as first-class citizens, requiring no central pivot.
- Mechanism: A weighted sum of InfoNCE contrastive losses is computed over all modality pairs within each batch. Unlike ImageBind (which aligns modalities indirectly through image as a pivot), direct pairwise contrastive alignment ensures that embeddings of any two modalities are directly comparable.
- Design Motivation: Pivot-based methods suffer cascading degradation when the pivot modality is of poor quality; all-to-all alignment eliminates this dependency.
Multi-Scale Coordinate Encoder (Scaled Lat-Lon Encoder):
- Function: Encodes geographic coordinates at multiple frequencies to capture multi-scale spatial structure.
- Mechanism: Latitude and longitude are first mapped to a plane via an equal-area projection, then encoded separately by multiple random Fourier feature matrices with different bandwidths \(\sigma\) (low \(\sigma\) for large-scale structure, high \(\sigma\) for fine-scale structure). Each frequency encoding is treated as a token; cross-scale interaction is achieved via self-attention, and the final \(D\)-dimensional embedding is obtained by average pooling.
- Design Motivation: A single-\(\sigma\) Fourier feature captures either large-scale or small-scale structure but not both. The multi-scale pyramid design simultaneously covers spatial structures from continental to neighborhood level.
DSM Encoder:
- Function: Encodes digital surface models (terrain and building elevation information).
- Mechanism: A ViT with register tokens trained from scratch; the CLS token serves as the modality embedding.
- Design Motivation: DSMs provide geometric elevation information that is inaccessible to other visual modalities.

Loss & Training¶

A weighted sum of InfoNCE contrastive losses over all 10 modality pairs. Image and text encoders are initialized from SigLIP-2; the DSM and GPS encoders are trained from scratch. Hard negative mining is employed during training.

Key Experimental Results¶

Main Results¶

Task	Metric	UniGeoCLIP	Single-Modality Baseline	Gain
Land-use classification	Acc	Improved	GeoCLIP / SatCLIP	Consistently better
Cross-modal retrieval	Recall@K	Substantially better	Pairwise methods	New capability
Socioeconomic inference	R²	Improved	Coordinate baseline	Significant

Ablation Study¶

Configuration	Classification Accuracy	Notes
5-modality all-to-all	Best	Full model
Pivot (image as pivot only)	Second	Indirect alignment loss
2-modality (coordinates + aerial)	Degraded	Incomplete information
Single-scale coordinate encoding	Degraded	Limited spatial resolution

Key Findings¶

Joint alignment of five modalities consistently outperforms naive combinations of pairwise alignments.
The gap between all-to-all and pivot-based alignment is most pronounced for weaker modalities (e.g., DSM).
The multi-scale coordinate encoder substantially outperforms standard Fourier features on geolocalization tasks.

Highlights & Insights¶

Truly unified embedding space: Any combination of modalities can be directly compared and retrieved, which is fundamentally beyond the capability of pure fusion models.
Multi-scale coordinate encoding: Cross-scale interaction via self-attention is more elegant than simple concatenation.

Limitations & Future Work¶

Requires co-located training data covering all five modalities simultaneously.
The temporal dimension is not modeled.
Future work may extend to temporal satellite imagery and dynamic monitoring scenarios.

vs. GeoCLIP / SatCLIP: These methods align only coordinates with a single image modality; UniGeoCLIP aligns five modalities.
vs. ImageBind / UniBind: These rely on pivot-based indirect alignment; UniGeoCLIP adopts all-to-all alignment.

Rating¶

Novelty: ⭐⭐⭐⭐ First five-modality geospatial contrastive learning framework
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple downstream tasks
Writing Quality: ⭐⭐⭐⭐ Clear and well-structured presentation
Value: ⭐⭐⭐⭐ Provides a general-purpose representational foundation for geospatial AI