Learning Time-Scale Invariant Population-Level Neural Representations¶
Conference: NeurIPS 2025 arXiv: 2511.13022 Code: None Area: Time Series / Neural Signal Foundation Models Keywords: neural time series, foundation models, time-scale invariance, population-level representations, brain-computer interface
TL;DR¶
This paper proposes Time-Scale Augmented Pretraining (TSAP), a strategy that introduces data augmentation over multiple temporal window lengths during pretraining, enabling population-level neural signal foundation models to achieve invariance to input time scales and substantially improving decoding performance at both matched and unseen time scales.
Background & Motivation¶
Background: Building general-purpose representations of neural time series is a fundamental goal in neuroscience and brain-computer interface (BCI) research. High-fidelity neural recordings such as intracranial EEG (iEEG) capture complex activity patterns across multiple brain regions, yet modeling them remains highly challenging due to inter-subject and inter-session variability and limited dataset scale.
Limitations of Prior Work: Recent population-level pretraining methods (e.g., Population Transformer, PopT) learn spatially aggregated representations on top of frozen temporal encoders and achieve strong downstream decoding performance; however, these models are highly sensitive to preprocessing parameters—particularly time scale. Performance degrades substantially when the temporal window lengths used during pretraining differ from those used in downstream tasks.
Key Challenge: Neural recordings vary widely in duration across datasets and tasks (ranging from 1 to 5 seconds), yet existing models are pretrained on fixed temporal windows and cannot generalize to inputs of varying lengths.
Goal: To quantify the performance degradation caused by time-scale mismatch and to propose a strategy that enables models to achieve optimal performance across arbitrary input time scales.
Key Insight: A data augmentation perspective is adopted, exposing the model to data across multiple temporal window lengths during pretraining.
Core Idea: By mixing iEEG segments of multiple time scales during pretraining (TSAP), PopT is trained to learn time-scale invariant population-level representations.
Method¶
Overall Architecture¶
The method builds upon the Population Transformer (PopT) framework: for a given temporal interval of each electrode channel, a frozen temporal encoder (BrainBERT) first produces temporal embeddings; positional embeddings derived from 3D electrode coordinates are then added; finally, a Transformer encoder yields spatially contextualized channel representations and an aggregated [CLS] output for downstream decoding.
Key Designs¶
-
Time-Scale Augmented Pretraining (TSAP):
-
Function: Modifies the data generation pipeline so that the model is exposed to iEEG signals of multiple temporal window lengths during pretraining.
- Design Motivation: Eliminates overfitting to any specific time scale and establishes time-scale invariance.
- Mechanism: Recording segments of length \(l \in \{1, 2, 4, 5\}\) seconds are sampled (3 seconds is held out); each channel is independently encoded into BrainBERT embeddings. Embeddings from different window lengths contain overlapping windows that the temporal encoder maps to distinct representations.
-
Novelty: The original PopT is pretrained solely on fixed 5-second windows, whereas TSAP encourages cross-scale generalization through multi-scale exposure.
-
Embedding Space Analysis (PCA + K-Means):
-
Function: Visualizes the distribution of temporal embeddings and [CLS] token representations across different time scales.
- Design Motivation: Verifies whether TSAP genuinely eliminates time-scale-related clustering.
- Mechanism: 100 samples are drawn from a specific subject–session across 1–5 second time scales, followed by 2D PCA projection and K-Means clustering analysis.
- Key Findings: PopT pretrained on 5 seconds produces strong time-scale clusters, whereas the TSAP model's clusters are substantially mixed, indicating stronger time-scale invariance.
Loss & Training¶
- Pretraining steps are doubled from 500,000 to 1,000,000 to accommodate the larger augmented dataset.
- The learning rate is fixed at \(1 \times 10^{-4}\) to improve training stability.
- The best checkpoint is selected based on validation loss.
- During downstream fine-tuning, 90 electrodes are randomly selected per subject, and each experiment is repeated over 5 random seeds.
Key Experimental Results¶
Main Results¶
Experiments are conducted on the public BrainTreeBank dataset (10 subjects, 1,688 electrodes) across two auditory-language classification tasks: Word Onset and Sentence Onset.
| Model | 1s | 2s | 3s (held-out) | 4s | 5s |
|---|---|---|---|---|---|
| Non-Pretrained | 0.645 | 0.665 | 0.663 | 0.671 | 0.678 |
| 1s Pretrained | 0.770 | 0.807 | 0.809 | 0.817 | 0.819 |
| 5s Pretrained | 0.717 | 0.801 | 0.846 | 0.879 | 0.901 |
| TSAP | 0.777 | 0.843 | 0.866 | 0.893 | 0.907 |
Word Onset ROC-AUC (mean ± standard error across subjects and 5 seeds)
TSAP matches or surpasses the "optimal" baseline (i.e., models where pretraining and fine-tuning use the same time scale) at all time scales, including the held-out 3-second scale.
Ablation Study¶
| Comparison | Statistical Significance (p-value) |
|---|---|
| TSAP vs. 1s Optimal (1s) | p = 0.017* |
| TSAP vs. 4s Optimal (4s) | p = 0.00005* |
| TSAP vs. 5s Optimal (5s) | p = 0.004* |
| TSAP vs. 3s Optimal (3s, held-out) | p = 0.442 |
Paired t-tests show that TSAP significantly outperforms the optimal baseline at most time scales; the improvement at the held-out 3-second scale does not reach significance but is occasionally observed.
Key Findings¶
- Time-scale mismatch leads to substantial performance degradation: for example, a model pretrained on 1-second windows performs considerably worse on 5-second inputs than a 5-second pretrained model.
- Even under mismatch, any pretrained model outperforms the non-pretrained baseline, indicating that pretraining still captures valuable information.
- TSAP not only recovers the performance lost due to mismatch but also surpasses the matched "optimal" baseline in most cases.
- PCA analysis confirms that TSAP substantially reduces time-scale clustering in the embedding space.
Highlights & Insights¶
- Simplicity and Effectiveness: TSAP is a purely data-augmentation-based strategy requiring no architectural modifications; mixing multiple time scales only during pretraining suffices.
- Clear Physical Intuition: Although different temporal windows share overlapping information, they produce markedly different embeddings after the temporal encoder, which is the root cause of performance degradation.
- Generalization to Held-Out Time Scales: The 3-second scale is never seen during pretraining, yet the model generalizes well, demonstrating that the invariance learned by TSAP is genuinely transferable.
- High Practical Value: In real-world BCI applications, using neural recordings of varying lengths across tasks and experimental paradigms is the norm rather than the exception.
Limitations & Future Work¶
- Validation is currently limited to iEEG data; other modalities such as EEG have not been tested.
- Only the data augmentation strategy is explored; integration with invariance methods at the temporal encoder level (e.g., frequency-domain approaches such as TF-C or BioFAME) remains unexplored.
- The range of time scales examined is limited (1–5 seconds); generalization over a wider range requires further investigation.
- Computational cost doubles (pretraining steps increase from 500K to 1M), though this overhead is acceptable given the performance gains.
Related Work & Insights¶
- PopT (chau2025population): Population-level Transformer; the base framework of this paper.
- BrainBERT (wang2023brainbert): Channel-independent temporal encoder providing frozen temporal embeddings.
- TS-Rep (somaiya2022ts): Encourages duration-agnostic representations via a triplet objective.
- TF-C (zhang2022self): Promotes time-scale invariance through frequency-domain consistency.
- Insight: Data augmentation constitutes a lightweight yet effective solution to preprocessing diversity and is broadly applicable to other sensor data domains.
Rating¶
- Novelty: ⭐⭐⭐ The method itself is straightforward multi-scale data augmentation, though the problem identification is valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of multiple time scales, two tasks, statistical testing, and embedding analysis.
- Writing Quality: ⭐⭐⭐⭐ Concise workshop paper with clear structure and rigorous argumentation.
- Value: ⭐⭐⭐⭐ Directly beneficial for the engineering deployment of neural signal foundation models.