Generalizable, Real-Time Neural Decoding with Hybrid State-Space Models¶
Conference: NeurIPS 2025 arXiv: 2506.05320 Code: Available Area: Medical Imaging / Brain-Computer Interface / Neural Decoding Keywords: State-Space Models, Brain-Computer Interface, Neural Decoding, Cross-Species Transfer, Real-Time Inference
TL;DR¶
POSSM proposes a hybrid SSM-attention architecture that combines spike-level tokenization with a recurrent state-space model backbone, achieving generalizable real-time neural decoding with inference speeds up to 9× faster than Transformers while maintaining comparable accuracy.
Background & Motivation¶
Neural decoding—mapping neural activity to behavioral or cognitive variables—is central to modern neuroscience and brain-computer interfaces (BCIs). An ideal neural decoder must simultaneously satisfy three requirements:
Accuracy: Robust and precise predictions
Real-Time Capability: Causal, low-latency inference suitable for online scenarios (≤10ms latency)
Generalizability: Flexible transfer to new subjects, tasks, and experimental settings
However, existing methods exhibit a fundamental tension among these three criteria:
| Method | Accuracy | Real-Time | Generalizability |
|---|---|---|---|
| RNN (GRU) | Medium | ✓ High | ✗ Poor (fixed input format) |
| Transformer (POYO/NDT-2) | High | ✗ Poor (\(O(n^2)\) complexity) | ✓ Good (flexible tokenization) |
| POSSM | High | ✓ High | ✓ Good |
RNNs are fast but struggle to generalize across sessions due to their dependence on fixed-size temporal binning; Transformers such as POYO achieve strong generalization through flexible spike tokenization but their quadratic complexity renders them infeasible for real-time use. POSSM resolves this tension through a hybrid architecture.
Method¶
Overall Architecture¶
The POSSM architecture, illustrated in Figure 2, consists of three modules:
- Input Cross-Attention: Inherited from POYO, compresses variable-length spike token sequences into a fixed-size latent representation.
- Recurrent Backbone: An SSM (or GRU/Mamba) updates hidden states across temporal chunks, preserving long-range context.
- Output Cross-Attention & Readout: Queries the most recent \(k=3\) hidden states to produce behavioral predictions.
Key insight: Input processing operates only on 50ms temporal chunks; the recurrent backbone updates hidden states in a streaming fashion without reprocessing historical data.
Key Designs¶
Spike Tokenization: Following POYO, each spike is represented by two pieces of information: - Neural unit identity: a learnable unit embedding - Spike timestamp: relative timing encoded via RoPE (Rotary Position Embedding)
This tokenization scheme supports variable numbers of neurons and different sampling rates.
Input Cross-Attention: A PerceiverIO-style cross-attention mechanism compresses \(N\) spike tokens using a learnable query \(\mathbf{q} \in \mathbb{R}^{1 \times M}\):
Recurrent Backbone: The output \(\mathbf{z}^{(t)}\) is passed to the SSM to update the hidden state:
Cross-attention captures local (within-50ms) temporal structure, while the SSM integrates global context across chunks. Three backbone variants are evaluated: S4D, GRU, and Mamba.
Generalization Strategies: Two fine-tuning approaches are employed: - Unit Identification (UI): Freezes model weights and trains only new unit/session embeddings (updating <1% of parameters). - Full Finetuning (FT): Applies UI for several epochs, then unfreezes all parameters for end-to-end training.
Loss & Training¶
- NHP Motor Decoding: Mean Squared Error (MSE) loss predicting 2D hand velocity time series.
- Human Handwriting Decoding: Classification loss predicting intended characters/strokes.
- Human Speech Decoding: CTC (Connectionist Temporal Classification) loss predicting phoneme sequences; a two-stage training procedure is used—first reconstructing spike counts, then training the CTC decoder.
- Training Details: Batch size 128–256, cosine learning rate scheduler, 500 epochs (NHP), LoRA-style data augmentation (unit dropout).
Key Experimental Results¶
Main Results¶
NHP Motor Decoding (\(R^2\), higher is better):
| Method | C-CO 2016 (2) | C-CO 2010 (5) | T-CO (6) | T-RT (6) | H-CO (1) |
|---|---|---|---|---|---|
| MLP | 0.921 | 0.695 | 0.798 | 0.701 | 0.418 |
| GRU | 0.938 | 0.731 | 0.819 | 0.717 | 0.693 |
| POYO (SS) | 0.929 | 0.753 | 0.831 | 0.729 | — |
| POSSM-GRU (SS) | 0.944 | 0.738 | 0.836 | 0.738 | 0.690 |
| o-POSSM-S4D (FT) | 0.951 | 0.781 | 0.856 | 0.769 | 0.760 |
Human Handwriting Decoding (Accuracy %):
| Method | Accuracy |
|---|---|
| PCA-KNN (baseline) | 81.36 ± 7.53 |
| POYO | 94.86 ± 3.53 |
| POSSM-GRU (from scratch) | 95.82 ± 3.41 |
| o-POSSM-S4D (NHP pretrain + finetune) | 97.73 ± 2.56 |
| o-POSSM-GRU (NHP pretrain + finetune) | 97.25 ± 2.30 |
Human Speech Decoding (Phoneme Error Rate PER %, lower is better):
| Method | PER (%) |
|---|---|
| GRU (no noise) | 39.16 |
| GRU (standard) | 30.06 |
| S4D | 35.99 |
| Mamba | 32.19 |
| POSSM-GRU (no noise) | 29.70 |
| POSSM-GRU (standard) | 27.32 |
Ablation Study¶
Inference Efficiency Comparison (Figure 4c):
| Model | Parameters | GPU Inference Time/Chunk |
|---|---|---|
| MLP | Smallest | ~0.3ms |
| GRU | Small | ~0.3ms |
| POSSM-SS | Smallest (0.41–0.68M) | ~0.5ms |
| o-POSSM | ~8M | ~1.5ms |
| POYO | Medium | ~8ms |
| NDT-2 | Larger | ~12ms |
POSSM also achieves ~2.44ms/chunk (single-session) and ~5.65ms/chunk (pretrained) on CPU, both within the real-time BCI requirement (≤10ms).
Cross-Species Transfer Experiments: - o-POSSM pretrained on macaque motor cortex data achieves 2–5% accuracy gains on the human handwriting task after fine-tuning. - This represents the first successful cross-species neural decoding transfer using deep learning.
Key Findings¶
- Hybrid architecture is optimal: POSSM matches or surpasses pure Transformers (POYO) and pure RNNs across all tasks while achieving up to 9× faster inference.
- Pretraining is highly effective: o-POSSM substantially outperforms models trained from scratch in low-data regimes, supporting efficient cross-session and cross-subject transfer.
- Cross-species transfer succeeds: Monkey-to-human transfer learning yields ~16% accuracy improvement on the handwriting task compared to PCA-KNN.
- Long-sequence advantage: In long-context tasks such as human speech decoding, Transformers are limited by quadratic complexity, whereas POSSM efficiently handles variable-length sequences of 2–18 seconds.
- Noise robustness: POSSM maintains near-comparable performance without noise augmentation (PER 29.7 vs. 27.3), while the baseline GRU degrades substantially (39.2 vs. 30.1).
Highlights & Insights¶
- Elegant architectural design: Organically combines Perceiver's flexible input processing with SSM's efficient sequence modeling, resolving the core tension in neural decoding.
- Pioneering significance of cross-species transfer: Demonstrates that motor cortical neural dynamics in monkeys and humans share transferable common features, providing a solution for data-scarce human BCI scenarios.
- Clinical practicality: Satisfies real-time BCI latency requirements (<10ms) on both GPU and CPU, enabling deployment.
- Modular design: The backbone is interchangeable (S4D/GRU/Mamba), accommodating diverse application requirements.
- Comprehensive experimental coverage: Spans monkey motor decoding to human handwriting and speech, covering the major application domains of neural decoding.
Limitations & Future Work¶
- Restricted to invasive recordings: Validation is limited to intracortical electrode array data; non-invasive modalities such as EEG have not been explored.
- Offline evaluation: Despite being designed for real-time scenarios, all experiments are conducted offline.
- No language model for speech decoding: The absence of a language model for post-processing constrains further reduction of PER.
- Motor cortex only: Decoding from other brain regions (e.g., visual cortex, hippocampus) has not been explored.
- Limited scope of cross-species transfer: Only monkey-to-human transfer on the handwriting task has been validated.
- No self-supervised pretraining: The current approach relies on supervised behavioral labels; self-supervised alternatives have not been explored.
Related Work & Insights¶
- POYO: The direct foundation of POSSM, providing spike tokenization and PerceiverIO encoding.
- NDT-2: A Transformer decoder using spatiotemporal patch tokenization, but with high computational cost.
- Mamba/S4: Modern SSM architectures; POSSM demonstrates their potential as neural decoding backbones.
- BRAND: A real-time BCI deployment platform; POSSM's inference efficiency makes it well-suited for integration.
- Insight: Hybrid attention-recurrent architectures, beyond their success in NLP (e.g., Jamba), demonstrate unique advantages in neural signal processing—enabling flexible local encoding combined with efficient global state propagation.
Rating¶
- Novelty: ★★★★★ — First hybrid SSM-attention architecture for neural decoding; cross-species transfer is pioneering.
- Technical Depth: ★★★★★ — Theoretically grounded architectural design with systematic comparison of multiple backbones.
- Experimental Thoroughness: ★★★★★ — Comprehensive evaluation across three task categories, multiple datasets, inference efficiency, sample efficiency, and cross-species transfer.
- Practicality: ★★★★★ — Meets real-time BCI latency requirements with direct clinical value.
- Overall Recommendation: ★★★★★