SpikCommander: A High-Performance Spiking Transformer with Multi-View Learning for Efficient Speech Command Recognition¶

Conference: AAAI 2026 arXiv: 2511.07883 Code: https://github.com/JackieWang9811/SCommander Area: Spiking Neural Networks / Speech Recognition Keywords: Spiking Neural Networks, Speech Command Recognition, Spiking Transformer, Multi-View Learning, Energy Efficiency

TL;DR¶

This paper proposes SpikCommander, a fully spike-driven Transformer architecture that jointly enhances temporal and channel feature modeling via Multi-view Spike Temporal-Aware Self-Attention (MSTASA) and Spike Context Refinement MLP (SCR-MLP), surpassing state-of-the-art SNN methods on SHD/SSC/GSC benchmarks with fewer parameters.

Background & Motivation¶

Background: Spiking Neural Networks (SNNs) offer significant energy efficiency advantages due to their event-driven nature, making them well-suited for Speech Command Recognition (SCR) tasks. Spiking Transformers such as Spikformer and SDT have demonstrated progress on vision tasks.
Limitations of Prior Work: (a) Existing SNN speech models struggle to capture rich temporal dependencies and contextual information, constrained by the sparsity of binary spike representations; (b) most existing spiking self-attention mechanisms adopt global attention with \(O(N^2)\) complexity, incurring high computational cost; (c) conventional channel-wise MLPs lack context refinement capability.
Key Challenge: The binary sparsity of spikes limits effective feature extraction, and conventional continuous-valued attention operations are poorly suited to the spike domain.
Goal: To design an efficient yet expressive fully spike-driven Transformer architecture specifically for speech command recognition.
Key Insight: A multi-view learning framework that simultaneously captures complementary temporal information from three pathways: local (sliding window), global (long-range), and convolutional (shift-invariant).
Core Idea: A three-branch complementary temporal-aware attention mechanism combined with a selective context refinement MLP achieves rich temporal modeling under full spike-driven constraints.

Method¶

Overall Architecture¶

The input speech signal is converted into spike representations via a Spike Embedding Extractor (SEE). The backbone consists of Transformer blocks with alternating MSTASA and SCR-MLP layers. The classification head produces predictions by summing across time steps followed by softmax. Training employs BPTT with surrogate gradients.

Key Designs¶

Spike Temporal-Aware Self-Attention (STASA) and Its Multi-View Extension (MSTASA)
- Function: Captures complementary temporal dependencies in the spike domain with linear complexity \(O(ND)\).
- Mechanism: STASA applies temporal masking to spike-valued Q and K, then aggregates along the time dimension: \(\hat{Q}_S = \sum_{t=1}^T Q'_S[:,t,:]\). Attention weights are computed as \(S_{attn} = \beta(\hat{Q}_S + \hat{K}_S)\), passed through spiking neurons, and broadcast onto values \(V_S\) via element-wise multiplication. MSTASA comprises three branches: (a) Sliding Window STASA (SWA-STASA) — restricts attention to a \(2w+1\) window to model local dependencies; (b) Long-Range STASA (LRA-STASA) — full-sequence attention for global dependency modeling; (c) V-branch — injects shift-invariant positional patterns into value representations via depthwise convolution (kernel=9×1) followed by pointwise convolution. The two STASA branches are fused via dual-attention projection before being merged with the V-branch.
- Design Motivation: Classical spiking attention (SSA) incurs \(O(N^2)\) complexity due to the \(QK^T\) matrix multiplication. STASA reduces this to \(O(ND)\) via temporal aggregation. The multi-view design captures complementary information: local detail, global context, and shift-invariant patterns.
Spike Context Refinement MLP (SCR-MLP)
- Function: Enhances channel mixing and temporal context modeling.
- Mechanism: Three stages — (i) forward projection: PCBlock + LinBlock expands features to \(\alpha D\) dimensions (\(\alpha=4\)); (ii) selective context refinement: features are split evenly along the channel dimension, one half passes through a depthwise convolution with kernel=31 to capture local temporal context while the other half is passed through directly, followed by concatenation; (iii) back projection: compresses back to \(D\) dimensions. All operations maintain fully spike-driven computation through {Conv-BN-SN} blocks.
- Design Motivation: Conventional channel-wise MLPs perform only channel mixing without temporal context. The selective split design — applying depthwise convolution to only half the channels — injects contextual information while reducing computational overhead.
Spike Embedding Extractor (SEE)
- Function: Converts speech input into structured spike representations.
- Mechanism: Depthwise separable convolution (pointwise 1D + depthwise 1D with kernel=7) extracts local time-frequency features; a residual linear projection enhances channel dimensionality; all outputs are converted to spikes via spiking neurons.

Loss & Training¶

Standard cross-entropy loss; end-to-end training via BPTT with surrogate gradients (ArcTan). Time step \(T=100\).

Key Experimental Results¶

Main Results¶

Dataset	Method	Params (M)	Time Steps	Accuracy (%)
SHD	SpikeSCR (1L)	0.26	100	95.60
SHD	Pfa-SNN	0.20	100	96.26
SHD	SpikCommander (1L)	0.19	100	96.41
SSC	SpikeSCR (2L)	3.30	100	82.79
SSC	SpikCommander (2L)	2.13	100	83.49
GSC	Spiking LMUFormer	1.69	-	96.12
GSC	d-cAdLIF (2L)	0.61	100	95.69
GSC	SpikCommander (2L)	2.13	100	96.92

Ablation Study¶

Configuration	SHD Acc	Notes
LRA-STASA only	Lower	Lacks local information
SWA-STASA only	Lower	Lacks global information
MSTASA (three branches)	Best	Complementary gains
Standard MLP	Lower	No context refinement
SCR-MLP	Higher	Selective refinement effective

Key Findings¶

SpikCommander surpasses all state-of-the-art SNN methods across three datasets with fewer parameters (e.g., SHD: 0.19M vs. SpikeSCR's 0.26M).
On GSC, SpikCommander even outperforms the ANN model LMUFormer (96.92% vs. 96.53%), a rare achievement in the SNN literature.
All three branches of the multi-view design contribute independently; removing any single branch degrades performance.
The selective refinement in SCR-MLP is more efficient and effective than applying depthwise convolution to all channels.

Highlights & Insights¶

Linear-Complexity Spiking Attention: Temporal aggregation reduces complexity from \(O(N^2)\) to \(O(ND)\), making spiking Transformers practical for long sequences.
SNN Surpassing ANN: Results on GSC demonstrate that carefully designed SNN architectures can close and even exceed the performance gap with ANNs, which carries significant implications for neuromorphic computing.
Selective Channel-Split Refinement: Applying depthwise convolution to only half the channels elegantly balances computational efficiency and contextual modeling capacity.

Limitations & Future Work¶

Validation is limited to speech command recognition; generalization to more complex speech tasks (e.g., ASR) remains unexplored.
Energy efficiency has not been validated on real neuromorphic hardware (e.g., Loihi, Tianjic).
The time step is fixed at 100; adaptive time-step mechanisms could further reduce energy consumption.
Implementation details regarding the dynamic coupling of sliding window size and input length are insufficiently discussed.

vs. Spikformer/SDT: These methods rely on \(O(N^2)\) SSA/SDSA; SpikCommander's linear STASA is more computationally efficient.
vs. SpikeSCR: Employs a hybrid attention-convolution design but lacks multi-view architecture and uses more parameters.
vs. DCLS-Delays: A delay-learning-based approach; SpikCommander replaces explicit delay modeling with attention mechanisms.

Rating¶

Novelty: ⭐⭐⭐⭐ Multi-view spiking attention + SCR-MLP design is original
Experimental Thoroughness: ⭐⭐⭐⭐ Three standard benchmarks with comprehensive ablation
Writing Quality: ⭐⭐⭐⭐ Clear architectural diagrams and thorough comparisons
Value: ⭐⭐⭐⭐ Meaningful advancement for SNN-based speech processing