Skip to content

UniGeoCLIP: Unified Geospatial Contrastive Learning

Conference: CVPR 2026 arXiv: 2604.11668 Code: https://gastruc.github.io/unigeoclip Area: Self-Supervised Learning Keywords: geospatial representation learning, contrastive learning, multimodal, coordinate encoding, unified embedding space

TL;DR

UniGeoCLIP is the first to align five complementary geospatial modalities (aerial imagery, street-view imagery, digital surface models, text, and GPS coordinates) into a unified embedding space via pure contrastive learning, and proposes a multi-scale coordinate encoder to enhance spatial representation capacity.

Background & Motivation

Background: Geospatial representation learning encompasses three paradigms — embedding fields (coordinates → vectors), multimodal fusion (multi-sensor → single representation), and contrastive alignment (e.g., GeoCLIP/SatCLIP aligning coordinates with satellite imagery).

Limitations of Prior Work: (1) Embedding fields are static snapshots that cannot model dynamics; (2) fusion models compress all modalities into a single representation, precluding cross-modal retrieval and comparison; (3) existing contrastive methods align only two modalities (typically coordinates + satellite imagery), neglecting text, street-view, and terrain modalities.

Key Challenge: Different geospatial modalities provide complementary information (aerial imagery for layout, street-view for facades, terrain for elevation, text for semantic description), yet no framework exists to unify them in a shared space.

Core Idea: All-to-all contrastive learning — five modalities are contrasted against one another (without a central pivot), constructing a truly unified embedding space. A novel multi-scale coordinate encoder further overcomes the expressive bottleneck of raw coordinate embeddings.

Method

Overall Architecture

Five modality-specific encoders (SigLIP-2 image/text encoders, a DSM ViT encoder, and a multi-scale GPS encoder) → all-to-all contrastive loss aligning \(\binom{5}{2}=10\) modality pairs → a unified \(D\)-dimensional embedding space.

Key Designs

  1. All-to-All Contrastive Alignment:

    • Function: All modalities are treated as first-class citizens, requiring no central pivot.
    • Mechanism: A weighted sum of InfoNCE contrastive losses is computed over all modality pairs within each batch. Unlike ImageBind (which aligns modalities indirectly through image as a pivot), direct pairwise contrastive alignment ensures that embeddings of any two modalities are directly comparable.
    • Design Motivation: Pivot-based methods suffer cascading degradation when the pivot modality is of poor quality; all-to-all alignment eliminates this dependency.
  2. Multi-Scale Coordinate Encoder (Scaled Lat-Lon Encoder):

    • Function: Encodes geographic coordinates at multiple frequencies to capture multi-scale spatial structure.
    • Mechanism: Latitude and longitude are first mapped to a plane via an equal-area projection, then encoded separately by multiple random Fourier feature matrices with different bandwidths \(\sigma\) (low \(\sigma\) for large-scale structure, high \(\sigma\) for fine-scale structure). Each frequency encoding is treated as a token; cross-scale interaction is achieved via self-attention, and the final \(D\)-dimensional embedding is obtained by average pooling.
    • Design Motivation: A single-\(\sigma\) Fourier feature captures either large-scale or small-scale structure but not both. The multi-scale pyramid design simultaneously covers spatial structures from continental to neighborhood level.
  3. DSM Encoder:

    • Function: Encodes digital surface models (terrain and building elevation information).
    • Mechanism: A ViT with register tokens trained from scratch; the CLS token serves as the modality embedding.
    • Design Motivation: DSMs provide geometric elevation information that is inaccessible to other visual modalities.

Loss & Training

A weighted sum of InfoNCE contrastive losses over all 10 modality pairs. Image and text encoders are initialized from SigLIP-2; the DSM and GPS encoders are trained from scratch. Hard negative mining is employed during training.

Key Experimental Results

Main Results

Task Metric UniGeoCLIP Single-Modality Baseline Gain
Land-use classification Acc Improved GeoCLIP / SatCLIP Consistently better
Cross-modal retrieval Recall@K Substantially better Pairwise methods New capability
Socioeconomic inference Improved Coordinate baseline Significant

Ablation Study

Configuration Classification Accuracy Notes
5-modality all-to-all Best Full model
Pivot (image as pivot only) Second Indirect alignment loss
2-modality (coordinates + aerial) Degraded Incomplete information
Single-scale coordinate encoding Degraded Limited spatial resolution

Key Findings

  • Joint alignment of five modalities consistently outperforms naive combinations of pairwise alignments.
  • The gap between all-to-all and pivot-based alignment is most pronounced for weaker modalities (e.g., DSM).
  • The multi-scale coordinate encoder substantially outperforms standard Fourier features on geolocalization tasks.

Highlights & Insights

  • Truly unified embedding space: Any combination of modalities can be directly compared and retrieved, which is fundamentally beyond the capability of pure fusion models.
  • Multi-scale coordinate encoding: Cross-scale interaction via self-attention is more elegant than simple concatenation.

Limitations & Future Work

  • Requires co-located training data covering all five modalities simultaneously.
  • The temporal dimension is not modeled.
  • Future work may extend to temporal satellite imagery and dynamic monitoring scenarios.
  • vs. GeoCLIP / SatCLIP: These methods align only coordinates with a single image modality; UniGeoCLIP aligns five modalities.
  • vs. ImageBind / UniBind: These rely on pivot-based indirect alignment; UniGeoCLIP adopts all-to-all alignment.

Rating

  • Novelty: ⭐⭐⭐⭐ First five-modality geospatial contrastive learning framework
  • Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across multiple downstream tasks
  • Writing Quality: ⭐⭐⭐⭐ Clear and well-structured presentation
  • Value: ⭐⭐⭐⭐ Provides a general-purpose representational foundation for geospatial AI