WildSAT: Learning Satellite Image Representations from Wildlife Observations¶
Conference: ICCV 2025 arXiv: 2412.14428 Code: https://github.com/cvl-umass/wildsat Area: Remote Sensing / Representation Learning Keywords: remote sensing representation learning, contrastive learning, wildlife observations, cross-modal, satellite imagery
TL;DR¶
This paper proposes WildSAT, which leverages millions of geotagged wildlife observations from citizen science platforms to align satellite images, species locations, and textual descriptions via contrastive learning, substantially improving remote sensing representation quality and enabling zero-shot text-based retrieval.
Background & Motivation¶
A core challenge in remote sensing representation learning is obtaining supervision signals. Existing approaches include: - Self-supervised learning (SeCo, Prithvi): exploiting spatiotemporal invariances or masked autoencoders, but lacking semantic supervision - Supervised learning (SatlasPretrain): large-scale multi-task labels, but with high annotation costs - Cross-modal learning (GRAFT, TaxaBind, RemoteCLIP): aligning ground-level images or text, but primarily targeting anthropogenic features (roads, buildings)
The key insight of this work is that species distributions encode rich ecological and environmental information. For instance, mountain goats inhabit rugged terrain, and cactus wrens nest in desert cacti — species habitat preferences directly reflect local natural environment characteristics. Such information, freely available and globally distributed from platforms like iNaturalist, amounts to hundreds of millions of observations. Nevertheless, the potential of wildlife observations for improving remote sensing representations has remained largely unexplored.
Method¶
Overall Architecture¶
WildSAT adopts a multimodal contrastive learning framework, jointly training on three types of signals: 1. Satellite images: Sentinel-2 images of the same location at different times provide temporal augmentation 2. Species locations: latitude/longitude coordinates encoded into location vectors via the SINR model, incorporating environmental covariates (climate data) 3. Text descriptions: habitat and behavior descriptions from Wikipedia pages of the corresponding species, encoded via GritLM
The image encoder \(f_\theta\) can be any architecture (ResNet50, ViT-B/16, etc.), with three separate linear projection heads producing embeddings for the image, text, and location modalities respectively.
Key Designs¶
Three-way contrastive learning: - \(\mathcal{L}_{img}\): satellite images of the same location at different times serve as positive pairs (with geometric augmentation) - \(\mathcal{L}_{txt}\): satellite image embeddings are aligned with Wikipedia text embeddings - \(\mathcal{L}_{loc}\): satellite image embeddings are aligned with SINR location embeddings
All losses are based on InfoNCE; the overall objective is their sum.
Parameter-efficient fine-tuning strategy: - Out-of-domain pretrained models (e.g., ImageNet): ResNet50 uses Scale and Shift Fine-tuning (BatchNorm parameters only); ViT uses DoRa (attention parameters only) - Randomly initialized or in-domain pretrained models: full fine-tuning - This ensures that existing domain knowledge is preserved
Data construction: - The iNaturalist dataset provides 35.5 million observations across 47,375 species - Corresponding Sentinel-2 satellite images (10 m/pixel, 512×512) - Wikipedia text covers 127,484 paragraphs for 37,889 species - Total of 980,376 training samples
Loss & Training¶
Each contrastive loss adopts standard InfoNCE:
During training, one text paragraph is randomly sampled per image–location pair.
Key Experimental Results¶
Main Results¶
Linear probing performance is evaluated on 7 downstream classification datasets and 2 segmentation datasets across 20 baseline models:
| Dataset | Base Avg. | +WildSAT Avg. | Gain |
|---|---|---|---|
| AID | 72.7 | 79.4 | +6.7 |
| EuroSAT | 88.9 | 94.3 | +5.4 |
| RESISC45 | 77.8 | 83.5 | +5.7 |
| So2Sat20k | 37.9 | 48.2 | +10.3 |
| UCM | 81.8 | 87.9 | +6.1 |
| BEN20k | 45.7 | 53.4 | +7.7 |
WildSAT achieves improvements in 108 out of 115 configurations, with an average gain of 4.3%–10.4%.
Comparison with CLIP-based methods (ViT-B/16):
| Method | Avg. Classification Performance |
|---|---|
| TaxaBind | 59.8% |
| GRAFT | 65.0% |
| RemoteCLIP | 71.0% |
| CLIP | 71.6% |
| WildSAT | 76.6% |
Ablation Study¶
Ablation of modality contributions (Random ResNet50 → ImageNet ResNet50):
| loc | env | text | img-a | Random R50 | ImageNet R50 | Random ViT | ImageNet ViT |
|---|---|---|---|---|---|---|---|
| 24.3% | 93.2% | 25.2% | 84.4% | ||||
| ✓ | 44.2% | 95.0% | 41.6% | — |
- Location signal alone brings a substantial +20% improvement to randomly initialized models
- The full four-modality combination yields the best performance
Segmentation results:
| Model | Cashew1k IoU | SAcrop3k IoU |
|---|---|---|
| Random | 40.1% → 72.6% | 18.0% → 20.3% |
| SatlasNet | 55.2% → 71.0% | 19.4% → 20.5% |
Key Findings¶
- Remote sensing pretrained models benefit the most: SeCo, SatlasNet, and similar models achieve gains of up to 10%, as WildSAT supplements habitat-relevant information
- ViT benefits more than CNN: the flexible attention mechanism of Transformers more readily adapts to multimodal fusion
- WildSAT reduces false positives for habitat-related categories: confusion matrix analysis on So2Sat20k shows improved true positive rates across all classes, primarily through reduced false positives for habitat categories such as "Scattered trees" and "Dense trees"
- Zero-shot text-based retrieval is supported (e.g., querying "desert" or "ibex" retrieves satellite images of corresponding landscapes)
Highlights & Insights¶
- A unique supervision signal: wildlife observation data constitute free, globally distributed, naturally generated ecological labels that are complementary to anthropogenic features
- General-purpose framework: WildSAT can serve as a continual pretraining stage to enhance existing models (SatlasNet, SeCo, and Prithvi all benefit)
- Zero-shot capability: text alignment enables semantic retrieval of geographic locations, a capability absent from prior remote sensing representation methods
- Complementarity with anthropogenic feature methods (WikiSatNet) — natural environment information and built-structure information together yield a more comprehensive understanding of Earth's surface
Limitations & Future Work¶
- Species observation data exhibit geographic bias (high density in Europe and North America, sparse in Africa and Asia), potentially limiting global generalization
- Only RGB three-channel imagery is used; the multispectral advantages of Sentinel-2 are not fully exploited (preliminary multispectral experiments are included in the appendix)
- Wikipedia text quality is uneven; descriptions of some species may be inaccurate or missing
- Linear probing evaluation may underestimate the full representational capacity; results under full fine-tuning are not reported
Related Work & Insights¶
- SatlasPretrain: large-scale supervised remote sensing pretraining; WildSAT demonstrates that its representations can be further improved
- GRAFT: ground-level image–satellite image alignment, but primarily targeting anthropogenic features
- TaxaBind: the first multimodal method to use species location and satellite imagery, but focused on ecological tasks rather than remote sensing
- Takeaway: citizen science data (eBird, iNaturalist) are an underutilized source of supervision signals, with potential applicability to a broader range of Earth observation tasks
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4.5 |
| Technical Depth | 3.5 |
| Experimental Thoroughness | 5 |
| Writing Quality | 4.5 |
| Value | 4.5 |
| Overall | 4.5 |