Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing¶
Conference: ICML2025
arXiv: 2506.18587
Code: GitHub - ts_ssl
Area: Remote Sensing
Keywords: Contrastive Learning, Time Series Augmentation, Remote Sensing Time Series, Sentinel-2, Crop Classification
TL;DR¶
The paper proposes a resampling augmentation for time series contrastive learning, which constructs positive pairs through "upsampling + disjoint subsequence extraction + realigning back to the original timeline." This approach outperforms common augmentation strategies on multiple SITS agricultural classification tasks and yields leading results on S2-Agri100.
Background & Motivation¶
Key Challenge¶
Sentinel-2 covers the globe every 5 days, generating massive amounts of multispectral time series data. However, high-quality annotations are scarce and expensive, leaving a vast amount of unlabeled data underutilized.
Why Self-Supervised Contrastive Learning?¶
In settings with "abundant unlabeled data and scarce labels," contrastive learning is naturally suited for pre-training representations before migrating to downstream tasks. The problems are: - Mature augmentations in the image domain (cropping, rotation, color jittering) are not necessarily suitable for time series. - If time series augmentation destroys key temporal structures, it causes distortion in the construction of positive pairs.
Limitations of Prior Work¶
The paper compares masked modeling and contrastive learning methods: - Masked methods incur high computational overhead on spatiotemporal data. - Token definitions heavily depend on datasets, causing unstable transferability. - Prior studies also suggest that "masking + contrastive learning" usually outperforms masking alone.
Therefore, the authors focus on a more fundamental question: how to design a simple, stable, and generalizable time series augmentation that enables contrastive learning to genuinely exploit the temporal structure information of SITS.
Method¶
Core Idea¶
Given an input time series \(S=\{s_1,...,s_T\}\in\mathbb{R}^{T\times C}\), two "homologous but different view" time series are constructed as positive pairs.
The method consists of three steps: 1. Upsampling to a denser temporal grid. 2. Extracting two non-overlapping subsequences that cover the global temporal span. 3. Interpolating back to the original timeline length and alignment.
Step 1: Upsampling¶
First, the original sequence is expanded from \(T\) to \(T_{up}\) via linear interpolation (commonly \(T_{up}=2T\) in the paper):
This provides finer-grained sampling candidates without losing the overall trend.
Step 2: Disjoint Subsequence Sampling¶
Two subsequences are sampled from \(S_{up}\), satisfying two key constraints: - No overlap in temporal points. - At least a certain number of points are sampled in each quarter of the full temporal span (ensuring temporal coverage).
This constraint is crucial. It avoids augmentations that "only jitter in localized time intervals," ensuring that both positive pairs retain global temporal semantics.
Step 3: Realigning to the Original Timeline¶
The two sampled subsequences are mapped and interpolated back to the original timestamp set, ultimately yielding two view sequences of the same length and alignment as the original input.
Differences from Common Augmentations¶
- Unlike jittering, which only adds noise.
- Unlike masking, which directly blanks out local observations.
- Unlike simple resizing, which only performs global stretching and compression.
It essentially produces controllable local differences while "preserving global temporal coverage," which is more suitable for constructing contrastive positive pairs.
Network & Training Configuration (Paper Settings)¶
- Encoder: Time series version of ResNet (first layer with 256 filters, outputting a 512-d embedding).
- Projection Head: 2-layer MLP (512 hidden -> 128 output).
- Contrastive Framework Comparison: SimCLR, MoCo, BYOL, VICReg.
- Group Aggregation of Multi-Temporal Samples: Randomly selecting \(G=4\) sequences during training for shared encoding and subsequent aggregation.
This design emphasizes "lightweight + plug-and-play methods," focusing on the augmentation strategy rather than a complex backbone.
Key Experimental Results¶
Pre-training and Downstream Data Scale (Compiled from the Paper)¶
| Dataset | Sequences per Sample | Time Steps | Channels | Sample Size | Classes |
|---|---|---|---|---|---|
| FranceCrops | 100 | 60 | 12 | Approx. 5.8M | 20 |
| FranceCrops CVdL | 100 | 60 | 12 | - | 20 |
| PASTIS | 100 | 60 | 10 | Approx. 85k | 18 |
| SITS-Former | 25 | 24 | 10 | Approx. 1.6M | - |
| S2-Agri100 | 25 | 24 | 10 | Approx. 120k | 15 |
Main Results¶
| Comparison | Conclusion |
|---|---|
| Comparison with Jittering/Resizing/Masking | Resampling augmentation performs better |
| S2-Agri100 Downstream Classification | Achieves leading performance without relying on spatial information or temporal positional encodings |
| Comparison with Complex Masked Reconstruction SSL Frameworks | Simple contrastive learning + resampling can still outperform more complex schemes |
| Impact of Pre-training Data Distribution | Unlabeled pre-training on the target domain outperforms cross-domain pre-training, and can even enable a simple classification head to achieve stronger performance |
Key Findings¶
- The augmentation strategy itself is the performance bottleneck; rational temporal augmentation is more critical than stacking complex models.
- Unlabeled data in the target domain is highly valuable, often being more "cost-effective" than a small number of new annotations.
- The gap between linear evaluation and full fine-tuning is small, indicating that the quality of learned representations is high.
- This method is effective even without relying on a spatial branch, indicating that the temporal structure information is already sufficiently strong.
Highlights & Insights¶
-
The augmentation design is highly "data-structure friendly." Rather than blind perturbations, it explicitly constrains temporal coverage to minimize semantic corruption.
-
The method is concise with low replication and transfer costs, making it highly suitable as a strong baseline for time series contrastive learning.
-
It holds practical significance for remote sensing applications. In label-scarce scenarios, priority can be given to "unlabeled target domain data collection + contrastive pre-training."
-
It offers a point of reflection on the Foundation Model pathway. Not all scenarios require ultra-large models; appropriate data augmentation and training paradigms are equally critical.
-
Judging from the paper's results, complex masked modeling is not the only viable path. On certain tasks, lightweight contrastive frameworks can deliver higher cost-effectiveness.
Limitations & Future Work¶
-
Current experiments mainly focus on agricultural classification scenarios; the application tasks could be extended to change detection, disaster monitoring, etc.
-
Although the resampling strategy is simple, hyperparameters (upsampling ratio, subsequence length, coverage constraints) still require tuning.
-
The paper primarily focuses on the pixel/time-series level and has not yet integrated stronger spatial context.
-
For extremely irregular sampling or sequences with high missing-data rates, the stability of interpolation remains to be further validated.
-
Systematic, large-scale validation is still needed in scenarios involving ultra-long time series and multi-satellite multimodal fusion.
Related Work & Insights¶
-
Compared to remote sensing SSL frameworks such as SeCo, SSL4EO-S12, and SkySense, this study emphasizes "temporal augmentation quality" rather than merely scaling up model size.
-
Complementary to masked modeling approaches like SatMAE, Prithvi, and Presto: this work demonstrates that contrastive learning remains highly competitive in certain setups.
-
Insights for future work:
- Resampling can serve as a foundational augmentation, upon which a small number of semantic-constrained augmentations can be overlaid.
- Research can be conducted on "task-adaptive augmentation strategy selection" to automatically choose augmentations based on data statistical characteristics.
- Unlabeled pre-training on the target domain can be combined with active learning to optimize annotation budget allocation.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ (4.0/5)
- Experimental Thoroughness: ⭐⭐⭐⭐☆ (4.5/5)
- Writing Quality: ⭐⭐⭐⭐☆ (4.0/5)
- Value: ⭐⭐⭐⭐⭐ (5.0/5)
Overall Evaluation: This is a fine paper presenting a "straightforward yet highly practical" method. It targets the most crucial but often underappreciated link in time series contrastive learning (augmentation design) and provides convincing empirical gains in remote sensing scenarios.