STAMP: Spatial-Temporal Adapter with Multi-Head Pooling¶
Conference: NeurIPS 2025 arXiv: 2511.10848 Code: https://github.com/autonlab/STAMP Area: EEG Signals / Foundation Model Adaptation Keywords: EEG Classification, TSFM Adapter, Spatial-Temporal Encoding, Multi-Head Pooling, Parameter Efficiency
TL;DR¶
STAMP introduces a lightweight spatial-temporal adapter with only 750K parameters for Time Series Foundation Models (TSFMs). Through three sets of positional encodings (token/spatial/temporal), cross-gated MLP mixing, and multi-head attention pooling, it enables a frozen TSFM (e.g., MOMENT 385M) to compete with or surpass EEG-specific models with 29M parameters (CBraMod) across 8 EEG datasets, achieving 193% higher Kappa than CBraMod on BCIC-IV-2a.
Background & Motivation¶
Background: TSFMs (e.g., MOMENT, Chronos), pretrained across multiple domains, have demonstrated strong general-purpose representation capabilities. EEG-specific foundation models (CBraMod, LaBraM) perform well on EEG classification but require large parameter counts (29M/5.8M) and EEG-specific pretraining.
Limitations of Prior Work: TSFMs process univariate time series — EEG is spatiotemporal data with 64 channels × 1000+ time steps, making direct application infeasible for the spatial dimension. Feeding each channel independently into a TSFM discards inter-channel spatial relationships.
Key Challenge: TSFMs possess strong temporal representations but lack spatial understanding; EEG models understand spatial structure but require large-scale EEG pretraining. The core challenge is enabling TSFMs to understand EEG's spatial dimension at minimal cost.
Goal: Design a lightweight adapter that enables general-purpose TSFMs to efficiently process spatiotemporal EEG data.
Key Insight: Freeze the TSFM and train only a 750K-parameter adapter — three sets of positional encodings inject spatial information, cross-gated MLP mixes spatial and temporal features, and multi-head attention pooling aggregates representations.
Core Idea: Triple positional encoding (token + spatial + temporal) + cross-gated MLP spatiotemporal mixing + multi-head pooling = 750K parameters enabling a frozen TSFM to process EEG spatiotemporal data.
Method¶
Overall Architecture¶
EEG data (\(S\) channels × \(T\) time steps) → frozen TSFM (e.g., MOMENT-L 385M) encodes to \(S \times T' \times D\) (downsampled 1024→128) → Positional Encoding (\(\tilde{e}_{ij} = e'_{ij} + p_{ij} + s_i + t_j\)) → CC-GMLP (decoupled spatial and temporal gated mixing) → MHAP (multi-head attention pooling to fixed-length vector) → Classification Head
Key Designs¶
-
Three-Set Positional Encoding (PE-NST):
- Function: Injects spatial-temporal positional information into TSFM output tokens.
- Mechanism: Token-wise PE \(p_{ij} \in \mathbb{R}^D\) provides an independent embedding for each (channel, time) position; Spatial PE \(s_i\) encodes channel identity (e.g., C3/C4/Oz); Temporal PE \(t_j\) encodes temporal position. All three are summed.
- Design Motivation: Ablation studies demonstrate that all three PE components are necessary — token PE alone is insufficient (lacking general spatial/temporal structure), and spatial + temporal PE alone is insufficient (lacking position-specific embeddings).
-
Cross-Channel GMLP (CC-GMLP):
- Function: Performs feature mixing along spatial and temporal dimensions respectively.
- Mechanism: Spatial gating \(g_S(Z) = Z_1 \odot (W \cdot Z_2)\) (mixing along the spatial dimension); temporal gating operates analogously along the temporal dimension. Both operate independently to preserve spatiotemporal disentanglement.
- Design Motivation: Transformers are parameter-heavy for spatiotemporal sequences; GMLP is more efficient, and the CC (cross-channel) variant further reduces parameters (0.74M vs. GMLP 0.79M) while achieving better performance.
-
Multi-Head Attention Pooling (MHAP):
- Function: Aggregates variable-length spatiotemporal tokens into a fixed-length classification vector.
- Mechanism: Multiple learnable query vectors aggregate token information via attention weights. Final classification: \(\hat{y} = \text{softmax}(W(\lambda z + (1-\lambda)\hat{e}))\)
- Design Motivation: More flexible than mean pooling — capable of learning to attend to different temporal segments and spatial regions.
Loss & Training¶
- Standard cross-entropy classification loss
- MOMENT-Large (385M) is frozen; adapter contains 750K trainable parameters
- Compatible with multiple TSFMs (MOMENT S/B/L, Chronos, TSPulse)
Key Experimental Results¶
Main Results (8 EEG Datasets)¶
| Dataset | STAMP (750K) | CBraMod (29M) | LaBraM (5.8M) | Result |
|---|---|---|---|---|
| SHU-MI | 0.660 AUC | 0.657 | 0.660 | Tie |
| MentalArith | 0.811 | 0.749 | 0.772 | STAMP wins |
| BCIC-IV-2a | 0.409 Kappa | 0.212 | 0.316 | +193% |
| TUEV | 0.662 | 0.618 | 0.664 | Tie |
| SEED-V | 0.208 | 0.259 | 0.239 | CBraMod wins |
| FACED | 0.278 | 0.508 | 0.470 | CBraMod wins |
STAMP is competitive or superior on 6/8 datasets, with relative weakness on emotion recognition tasks.
Ablation Study¶
| Variant | Description |
|---|---|
| PE-NST (all three sets) | Optimal |
| PE-ST (no token PE) | Performance degrades |
| CC-GMLP vs. Transformer | CC-GMLP outperforms on all 4 datasets with fewer parameters |
| MHAP vs. Mean Pooling | MHAP substantially better on BCIC-IV-2a; comparable elsewhere |
| Different TSFM backbones | MOMENT L > B > S; Chronos slightly better on emotion tasks; TSPulse stronger on event tasks |
Key Findings¶
- A 750K-parameter adapter enables a general-purpose TSFM to match a 29M-parameter EEG-specific model on most tasks — 39× improvement in parameter efficiency.
- Emotion recognition (SEED-V, FACED) is a weakness for TSFMs, likely due to the absence of emotion-relevant features in TSFM pretraining.
- CC-GMLP is more efficient than Transformer for spatiotemporal mixing — suggesting EEG spatiotemporal relationships are relatively simple and do not require full attention.
- The choice of TSFM backbone has limited impact — architectural design matters more than pretraining data.
Highlights & Insights¶
- Exceptional Parameter Efficiency: The combination of a 750K adapter and a frozen 385M TSFM outperforms a 29M EEG-specific model, validating that general temporal representations + lightweight spatiotemporal adaptation constitute a more efficient paradigm.
- Elegant CC-GMLP Design: Decoupled spatial and temporal gated mixing avoids the curse of dimensionality while preserving spatiotemporal interaction.
- Discovery of TSFM Capability Boundaries: Failure on emotion recognition reveals that TSFM pretraining signals lack affective semantics — EEG-specific pretraining may be necessary for such tasks.
Limitations & Future Work¶
- Poor performance on emotion recognition tasks — TSFM pretraining data does not contain emotion-related neural signal patterns.
- Evaluation is limited to classification; temporal forecasting and generation tasks — the primary use case of TSFMs — are not assessed, as EEG prediction tasks remain scarce.
- The complete design with five positional encoding components introduces additional hyperparameters requiring tuning for different EEG devices.
- Despite being lightweight, the 750K adapter still requires separate training per dataset — zero-shot cross-dataset generalization is not validated.
- CC-GMLP assumes separability of spatial and temporal dimensions, which may be insufficient for tasks requiring joint spatiotemporal modeling (e.g., rapid-response brain-computer interfaces).
- Dependence on patch-level feature extraction from the TSFM may result in information loss for fine-grained events in raw EEG waveforms.
Related Work & Insights¶
- vs. CBraMod: A 29M-parameter EEG-specific model; STAMP achieves comparable performance with a 750K adapter and a general-purpose TSFM.
- vs. LaBraM: A 5.8M-parameter model; STAMP matches or exceeds it on most tasks.
- Insight: The paradigm of freezing large pretrained models and training lightweight adapters, already successful in NLP and CV, is validated here for the EEG domain.
Rating¶
- Novelty: ⭐⭐⭐⭐ First work to adapt TSFMs for EEG spatiotemporal data
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 datasets + multiple TSFMs + comprehensive ablations
- Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-organized
- Value: ⭐⭐⭐⭐ Validates the feasibility of general-purpose TSFM + lightweight adaptation for EEG