Skip to content

Divide and Contrast: Learning Robust Temporal Features Without Augmentation

Conference: ICML 2026
arXiv: 2605.21241
Code: To be confirmed
Area: Time Series / Self-supervised Learning
Keywords: Time Series Representation Learning, Contrastive Learning, Self-supervised, Augmentation-free, Sub-block Partitioning

TL;DR

Di-COT efficiently learns robust time series representations without data augmentation by randomly partitioning sequences into overlapping sub-blocks and performing contrastive learning on them. It is 2.5x faster and achieves higher accuracy compared to existing methods, with comprehensive validation on 6 large-scale datasets + 124 UCR + 28 UEA.

Background & Motivation

Background: Self-supervised representation learning for time series has become a significant research direction, with contrastive learning being widely applied. Existing methods such as TNC, TS-TCC, and TS2Vec utilize temporal proximity or data augmentation to construct positive and negative pairs.

Limitations of Prior Work: - Requirement for complex data augmentation (time warping, magnitude transformation) leads to representation distortion. - Use of Dynamic Time Warping or multiple encoder forward passes results in high computational overhead. - The recent method CaTT avoids augmentation but assumes that temporal adjacency is equivalent to semantic similarity, which fails on UCR/UEA datasets.

Key Challenge: On datasets with high temporal volatility (frequent event transitions), step-wise contrastive learning generates false positives at temporal transitions. Methods solely relying on temporal proximity cannot handle such scenarios. Meanwhile, the computational complexity of existing loss functions is quadratic \(O(T^2)\) with respect to sequence length \(T\), which is unfriendly to long sequences.

Goal: To design a self-supervised time series learning framework that requires no data augmentation, no multiple encoder passes, and is independent of sequence length.

Key Insight: Instead of performing contrastive learning on individual time steps, it is better to partition the sequence into sub-block units with semantic integrity. This avoids false positives at temporal transitions while retaining sufficient learning signals.

Core Idea: Replace step-wise or augmentation-based contrastive learning with contrastive learning on dynamic overlapping sub-blocks, and reformulate it as a multi-class classification task to achieve efficient, length-independent computation.

Method

Overall Architecture

Three stages: (1) Sub-block Partitioning: The input sequence \(\mathbf{x}^{(i)} \in \mathbb{R}^{T \times D}\) is randomly partitioned into \(k\) overlapping sub-blocks, where \(k\) is uniformly sampled from \(\{k_{\min}, \ldots, k_{\max}\}\); (2) Encoding and Similarity Computation: Each sub-block is encoded and pooled to obtain \(k\) embeddings, followed by calculating a temperature-scaled similarity matrix \(\mathbf{S}^{(i)} \in \mathbb{R}^{k \times k}\); (3) Cross-entropy Contrastive Objective: Predicting adjacent sub-blocks is reformulated as multi-class classification, with each sub-block acting as an anchor to generate dense supervision.

Key Designs

  1. Dynamic Overlapping Sub-block Partitioning:

    • Function: Decomposes long sequences into short sub-sequences with semantic integrity for efficient contrastive learning.
    • Mechanism: In each iteration, the number of sub-blocks \(k \ll T\) is sampled from \(\mathcal{U}\{k_{\min}, \ldots, k_{\max}\}\). The sub-block length \(L = \frac{T}{1 + (k-1)(1-\rho)}\) and stride \(s = \lfloor L(1-\rho) \rfloor\), where \(\rho \in (0, 1)\) is the overlap ratio. After encoding all sub-blocks, embeddings \(\mathbf{z}^{(i)} = \{z_1^{(i)}, \ldots, z_k^{(i)}\}\) are obtained, where \(z_j^{(i)} = f_\theta(\tilde{x}_j^{(i)}) \in \mathbb{R}^F\).
    • Design Motivation: The overlapping design ensures sufficient context sharing between adjacent sub-blocks to avoid artificial boundaries. Randomly sampling \(k\) per iteration exposes the model to various temporal granularities, implicitly learning multi-scale robustness. By reducing the granularity of the contrastive objective (from \(T\) to \(k \ll T\)), false positives at temporal transitions are naturally avoided.
  2. Cross-entropy Contrastive Objective:

    • Function: Reformulates traditional InfoNCE contrastive learning as a multi-class classification problem to achieve efficient, length-independent computation.
    • Mechanism: For each pair of sub-blocks \((j, p)\), the temperature-scaled similarity is computed as \(S_{j, p}^{(i)} = \frac{\mathbf{z}_j^{(i)\top} \mathbf{z}_p^{(i)}}{\tau}\). The previous sub-block is defined as the positive label \(p^*(j) = j - 1\) (for \(j > 0\)), and the others as negatives. The loss is \(\mathcal{L}_{\text{CE}} = -\frac{1}{B k} \sum_i \sum_j \log \frac{\exp(S_{j, p^*(j)}^{(i)})}{\sum_p \exp(S_{j, p}^{(i)})}\).
    • Design Motivation: Compared to the \(O(B T^2 d)\) complexity of CaTT, Di-COT achieves \(O(B k^2 d)\) where \(k \ll T\). It provides dense supervision—every sub-block acts as an anchor generating \(B \times k\) positive pairs, whereas augmentation-based methods only yield \(2B\) pairs.
  3. Augmentation-free Strategy:

    • Function: Eliminates data augmentation, masking, and multiple encoder passes to prevent representation distortion and reduce computational overhead.
    • Mechanism: Contrastive learning is performed entirely within the original sequence space without generating any augmented views. Different parts (sub-blocks) of the sequence are compared instead of augmented versions, and all sub-blocks share the same semantic context.
    • Design Motivation: Augmentations for time series (jittering, permutation) may destroy valid temporal structures. Through sub-block overlap and multi-granularity sampling, the model naturally learns invariance to small displacements. This also avoids multiple encoder passes.

Key Experimental Results

Main Results (Linear Evaluation on 6 Large-scale Datasets)

Dataset Ours CaTT TS2Vec TF-C Gain vs CaTT
ECG 85.28 80.89 71.83 74.67 +4.39%
HARTH 93.23 93.13 90.27 92.24 +0.10%
PAMAP2 71.38 69.86 70.37 71.30 +1.52%
SKODA 99.41 94.87 98.96 98.23 +4.54%
SLEEP 85.21 85.17 84.81 85.18 +0.04%
WISDM2 63.92 63.25 62.39 62.54 +0.67%
Avg. Acc. 83.07 81.20 79.77 80.69 +1.87%
Training Time (h) 2.88 3.47 3.28 6.52 -17%

Low-label Regime (1% Labeled Data)

Dataset Ours TF-C TNC Supervised Baseline
ECG 73.33 74.50 61.06 54.28
HARTH 87.23 78.00 83.04 75.37
SKODA 98.01 93.50 96.11 92.77
Avg. Acc. 76.36 73.55 72.73 70.39

In the low-label setting, improvement is +5.97% compared to the supervised baseline, and it is 2.5x faster than TF-C.

Ablation Study

Configuration Large Datasets UCR UEA Description
Full Model 83.07 81.33 71.24 Standard Di-COT
Remove Overlap (\(\rho=0\)) 81.22 81.12 70.13 -2.23% / -1.56%
Remove Temperature 82.47 81.33 70.79 -0.72% (Small impact)
Fixed Global Partition 82.80 81.32 69.69 Random sampling is better
Contrast Shuffled Blocks 81.80 81.19 70.09 Temporal proximity is important
Non-linear Projection 81.85 79.88 69.73 Different from CV, not applicable

Key Findings

  • Sub-block overlap is most critical: It contributes the most, especially on large-scale datasets (-2.23%).
  • Temporal adjacency is key: Using temporally adjacent sub-blocks as positive pairs is significantly better than random pairing (-1.53%).
  • Backbone selection: InceptionTime outperforms ResNet (-3.56%) and FCN (-2.58%).
  • No need for non-linear projection: Unlike SimCLR, projection actually decreases performance.

Highlights & Insights

  • Clever Granularity Scaling: By reducing the contrastive granularity from time steps (\(T\)) to sub-blocks (\(k \ll T\)), false positives at temporal transitions are naturally avoided while retaining sufficient signals—making it more robust than CaTT's assumptions.
  • Length-independent Computation: Reducing loss complexity from \(O(B T^2 d)\) to \(O(B k^2 d)\) enables the processing of long sequences; the cross-entropy reformulation is more efficient than traditional InfoNCE.
  • Benefits of Augmentation-free: Completely discarding data augmentation prevents representation distortion and reduces computational overhead—suggesting that not all CV tricks are suitable for sequences.
  • Multi-granularity Robustness: Randomly sampling \(k\) per iteration allows the model to naturally learn multi-scale temporal features, implying multi-resolution learning without explicit design.

Limitations & Future Work

  • Di-COT is based on contrastive learning and learns discriminative representations, making it less suitable for time series forecasting tasks.
  • The method relies heavily on the "temporal proximity = semantic similarity" assumption; it may still fail on data with high-frequency state jumps.
  • The number of sub-blocks \(k\) and overlap ratio \(\rho\) require dataset-dependent tuning.
  • Improvements: Adaptive sub-block partitioning strategies; hybrid contrastive strategies; extending to non-sequential tasks to verify generalizability.
  • vs CaTT (Shamba et al. 2025): Also avoids augmentation and multiple encodings but contrasts all time steps; Di-COT avoids false positives via sub-blocks, has sequence-length independent complexity, and achieves better performance.
  • vs TS2Vec (Yue et al. 2022): Contrast cross-views at the same timestamp, requiring two augmented views; Di-COT is more efficient and avoids augmentation bias.
  • vs Augmentation-based methods (TS-TCC, TF-C): Traditional methods rely on complex augmentations; Di-COT proves that a simple augmentation-free strategy with reasonable granularity selection can outperform them on time series.
  • Insight: Caution should be exercised when applying successful CV methods to other domains—sometimes "less" (no augmentation) is better than "more" (complex augmentation).

Rating

  • Novelty: ⭐⭐⭐⭐ Resolves the core trade-off (efficiency vs. accuracy) in time series contrastive learning by improving granularity selection and loss formulation; the idea is direct yet effective.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 6 large-scale + 124 UCR + 28 UEA datasets across 5 downstream tasks with sufficient ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with sufficient differentiation from prior work; some paragraphs could be more concise.
  • Value: ⭐⭐⭐⭐⭐ High practical deployment value—both fast and accurate; code is open-source and directly applicable to various time series tasks.