WildSAT: Learning Satellite Image Representations from Wildlife Observations¶

Conference: ICCV 2025 arXiv: 2412.14428 Code: https://github.com/cvl-umass/wildsat Area: Remote Sensing / Representation Learning Keywords: remote sensing representation learning, contrastive learning, wildlife observations, cross-modal, satellite imagery

TL;DR¶

This paper proposes WildSAT, which leverages millions of geotagged wildlife observations from citizen science platforms to align satellite images, species locations, and textual descriptions via contrastive learning, substantially improving remote sensing representation quality and enabling zero-shot text-based retrieval.

Background & Motivation¶

A core challenge in remote sensing representation learning is obtaining supervision signals. Existing approaches include: - Self-supervised learning (SeCo, Prithvi): exploiting spatiotemporal invariances or masked autoencoders, but lacking semantic supervision - Supervised learning (SatlasPretrain): large-scale multi-task labels, but with high annotation costs - Cross-modal learning (GRAFT, TaxaBind, RemoteCLIP): aligning ground-level images or text, but primarily targeting anthropogenic features (roads, buildings)

The key insight of this work is that species distributions encode rich ecological and environmental information. For instance, mountain goats inhabit rugged terrain, and cactus wrens nest in desert cacti — species habitat preferences directly reflect local natural environment characteristics. Such information, freely available and globally distributed from platforms like iNaturalist, amounts to hundreds of millions of observations. Nevertheless, the potential of wildlife observations for improving remote sensing representations has remained largely unexplored.

Method¶

Overall Architecture¶

WildSAT adopts a multimodal contrastive learning framework, jointly training on three types of signals: 1. Satellite images: Sentinel-2 images of the same location at different times provide temporal augmentation 2. Species locations: latitude/longitude coordinates encoded into location vectors via the SINR model, incorporating environmental covariates (climate data) 3. Text descriptions: habitat and behavior descriptions from Wikipedia pages of the corresponding species, encoded via GritLM

The image encoder \(f_\theta\) can be any architecture (ResNet50, ViT-B/16, etc.), with three separate linear projection heads producing embeddings for the image, text, and location modalities respectively.

Key Designs¶

Three-way contrastive learning: - \(\mathcal{L}_{img}\): satellite images of the same location at different times serve as positive pairs (with geometric augmentation) - \(\mathcal{L}_{txt}\): satellite image embeddings are aligned with Wikipedia text embeddings - \(\mathcal{L}_{loc}\): satellite image embeddings are aligned with SINR location embeddings

All losses are based on InfoNCE; the overall objective is their sum.

Parameter-efficient fine-tuning strategy: - Out-of-domain pretrained models (e.g., ImageNet): ResNet50 uses Scale and Shift Fine-tuning (BatchNorm parameters only); ViT uses DoRa (attention parameters only) - Randomly initialized or in-domain pretrained models: full fine-tuning - This ensures that existing domain knowledge is preserved

Data construction: - The iNaturalist dataset provides 35.5 million observations across 47,375 species - Corresponding Sentinel-2 satellite images (10 m/pixel, 512×512) - Wikipedia text covers 127,484 paragraphs for 37,889 species - Total of 980,376 training samples

Loss & Training¶

\[\min_\theta [\mathcal{L}_{img} + \mathcal{L}_{txt} + \mathcal{L}_{loc}]\]

Each contrastive loss adopts standard InfoNCE:

\[\mathcal{L}_{con}(\mathbf{z}_i, \mathbf{e}_{1,...,n}) = -\log \frac{\exp(\mathbf{z}_i \cdot \mathbf{e}_i / \tau)}{\sum_j \exp(\mathbf{z}_i \cdot \mathbf{e}_j / \tau)}\]

During training, one text paragraph is randomly sampled per image–location pair.

Key Experimental Results¶

Main Results¶

Linear probing performance is evaluated on 7 downstream classification datasets and 2 segmentation datasets across 20 baseline models:

Dataset	Base Avg.	+WildSAT Avg.	Gain
AID	72.7	79.4	+6.7
EuroSAT	88.9	94.3	+5.4
RESISC45	77.8	83.5	+5.7
So2Sat20k	37.9	48.2	+10.3
UCM	81.8	87.9	+6.1
BEN20k	45.7	53.4	+7.7

WildSAT achieves improvements in 108 out of 115 configurations, with an average gain of 4.3%–10.4%.

Comparison with CLIP-based methods (ViT-B/16):

Method	Avg. Classification Performance
TaxaBind	59.8%
GRAFT	65.0%
RemoteCLIP	71.0%
CLIP	71.6%
WildSAT	76.6%

Ablation Study¶

Ablation of modality contributions (Random ResNet50 → ImageNet ResNet50):

loc	env	text	img-a	Random R50	ImageNet R50	Random ViT	ImageNet ViT
				24.3%	93.2%	25.2%	84.4%
✓				44.2%	95.0%	41.6%	—

Location signal alone brings a substantial +20% improvement to randomly initialized models
The full four-modality combination yields the best performance

Segmentation results:

Model	Cashew1k IoU	SAcrop3k IoU
Random	40.1% → 72.6%	18.0% → 20.3%
SatlasNet	55.2% → 71.0%	19.4% → 20.5%

Key Findings¶

Remote sensing pretrained models benefit the most: SeCo, SatlasNet, and similar models achieve gains of up to 10%, as WildSAT supplements habitat-relevant information
ViT benefits more than CNN: the flexible attention mechanism of Transformers more readily adapts to multimodal fusion
WildSAT reduces false positives for habitat-related categories: confusion matrix analysis on So2Sat20k shows improved true positive rates across all classes, primarily through reduced false positives for habitat categories such as "Scattered trees" and "Dense trees"
Zero-shot text-based retrieval is supported (e.g., querying "desert" or "ibex" retrieves satellite images of corresponding landscapes)

Highlights & Insights¶

A unique supervision signal: wildlife observation data constitute free, globally distributed, naturally generated ecological labels that are complementary to anthropogenic features
General-purpose framework: WildSAT can serve as a continual pretraining stage to enhance existing models (SatlasNet, SeCo, and Prithvi all benefit)
Zero-shot capability: text alignment enables semantic retrieval of geographic locations, a capability absent from prior remote sensing representation methods
Complementarity with anthropogenic feature methods (WikiSatNet) — natural environment information and built-structure information together yield a more comprehensive understanding of Earth's surface

Limitations & Future Work¶

Species observation data exhibit geographic bias (high density in Europe and North America, sparse in Africa and Asia), potentially limiting global generalization
Only RGB three-channel imagery is used; the multispectral advantages of Sentinel-2 are not fully exploited (preliminary multispectral experiments are included in the appendix)
Wikipedia text quality is uneven; descriptions of some species may be inaccurate or missing
Linear probing evaluation may underestimate the full representational capacity; results under full fine-tuning are not reported

SatlasPretrain: large-scale supervised remote sensing pretraining; WildSAT demonstrates that its representations can be further improved
GRAFT: ground-level image–satellite image alignment, but primarily targeting anthropogenic features
TaxaBind: the first multimodal method to use species location and satellite imagery, but focused on ecological tasks rather than remote sensing
Takeaway: citizen science data (eBird, iNaturalist) are an underutilized source of supervision signals, with potential applicability to a broader range of Earth observation tasks

Rating¶

Dimension	Score (1–5)
Novelty	4.5
Technical Depth	3.5
Experimental Thoroughness	5
Writing Quality	4.5
Value	4.5
Overall	4.5