SpikCommander: A High-Performance Spiking Transformer with Multi-View Learning for Efficient Speech Command Recognition¶
Conference: AAAI 2026 arXiv: 2511.07883 Code: https://github.com/JackieWang9811/SCommander Area: Spiking Neural Networks / Speech Recognition Keywords: Spiking Neural Networks, Speech Command Recognition, Spiking Transformer, Multi-View Learning, Energy Efficiency
TL;DR¶
This paper proposes SpikCommander, a fully spike-driven Transformer architecture that jointly enhances temporal and channel feature modeling via Multi-view Spike Temporal-Aware Self-Attention (MSTASA) and Spike Context Refinement MLP (SCR-MLP), surpassing state-of-the-art SNN methods on SHD/SSC/GSC benchmarks with fewer parameters.
Background & Motivation¶
- Background: Spiking Neural Networks (SNNs) offer significant energy efficiency advantages due to their event-driven nature, making them well-suited for Speech Command Recognition (SCR) tasks. Spiking Transformers such as Spikformer and SDT have demonstrated progress on vision tasks.
- Limitations of Prior Work: (a) Existing SNN speech models struggle to capture rich temporal dependencies and contextual information, constrained by the sparsity of binary spike representations; (b) most existing spiking self-attention mechanisms adopt global attention with \(O(N^2)\) complexity, incurring high computational cost; (c) conventional channel-wise MLPs lack context refinement capability.
- Key Challenge: The binary sparsity of spikes limits effective feature extraction, and conventional continuous-valued attention operations are poorly suited to the spike domain.
- Goal: To design an efficient yet expressive fully spike-driven Transformer architecture specifically for speech command recognition.
- Key Insight: A multi-view learning framework that simultaneously captures complementary temporal information from three pathways: local (sliding window), global (long-range), and convolutional (shift-invariant).
- Core Idea: A three-branch complementary temporal-aware attention mechanism combined with a selective context refinement MLP achieves rich temporal modeling under full spike-driven constraints.
Method¶
Overall Architecture¶
The input speech signal is converted into spike representations via a Spike Embedding Extractor (SEE). The backbone consists of Transformer blocks with alternating MSTASA and SCR-MLP layers. The classification head produces predictions by summing across time steps followed by softmax. Training employs BPTT with surrogate gradients.
Key Designs¶
-
Spike Temporal-Aware Self-Attention (STASA) and Its Multi-View Extension (MSTASA)
- Function: Captures complementary temporal dependencies in the spike domain with linear complexity \(O(ND)\).
- Mechanism: STASA applies temporal masking to spike-valued Q and K, then aggregates along the time dimension: \(\hat{Q}_S = \sum_{t=1}^T Q'_S[:,t,:]\). Attention weights are computed as \(S_{attn} = \beta(\hat{Q}_S + \hat{K}_S)\), passed through spiking neurons, and broadcast onto values \(V_S\) via element-wise multiplication. MSTASA comprises three branches: (a) Sliding Window STASA (SWA-STASA) — restricts attention to a \(2w+1\) window to model local dependencies; (b) Long-Range STASA (LRA-STASA) — full-sequence attention for global dependency modeling; (c) V-branch — injects shift-invariant positional patterns into value representations via depthwise convolution (kernel=9×1) followed by pointwise convolution. The two STASA branches are fused via dual-attention projection before being merged with the V-branch.
- Design Motivation: Classical spiking attention (SSA) incurs \(O(N^2)\) complexity due to the \(QK^T\) matrix multiplication. STASA reduces this to \(O(ND)\) via temporal aggregation. The multi-view design captures complementary information: local detail, global context, and shift-invariant patterns.
-
Spike Context Refinement MLP (SCR-MLP)
- Function: Enhances channel mixing and temporal context modeling.
- Mechanism: Three stages — (i) forward projection: PCBlock + LinBlock expands features to \(\alpha D\) dimensions (\(\alpha=4\)); (ii) selective context refinement: features are split evenly along the channel dimension, one half passes through a depthwise convolution with kernel=31 to capture local temporal context while the other half is passed through directly, followed by concatenation; (iii) back projection: compresses back to \(D\) dimensions. All operations maintain fully spike-driven computation through {Conv-BN-SN} blocks.
- Design Motivation: Conventional channel-wise MLPs perform only channel mixing without temporal context. The selective split design — applying depthwise convolution to only half the channels — injects contextual information while reducing computational overhead.
-
Spike Embedding Extractor (SEE)
- Function: Converts speech input into structured spike representations.
- Mechanism: Depthwise separable convolution (pointwise 1D + depthwise 1D with kernel=7) extracts local time-frequency features; a residual linear projection enhances channel dimensionality; all outputs are converted to spikes via spiking neurons.
Loss & Training¶
Standard cross-entropy loss; end-to-end training via BPTT with surrogate gradients (ArcTan). Time step \(T=100\).
Key Experimental Results¶
Main Results¶
| Dataset | Method | Params (M) | Time Steps | Accuracy (%) |
|---|---|---|---|---|
| SHD | SpikeSCR (1L) | 0.26 | 100 | 95.60 |
| SHD | Pfa-SNN | 0.20 | 100 | 96.26 |
| SHD | SpikCommander (1L) | 0.19 | 100 | 96.41 |
| SSC | SpikeSCR (2L) | 3.30 | 100 | 82.79 |
| SSC | SpikCommander (2L) | 2.13 | 100 | 83.49 |
| GSC | Spiking LMUFormer | 1.69 | - | 96.12 |
| GSC | d-cAdLIF (2L) | 0.61 | 100 | 95.69 |
| GSC | SpikCommander (2L) | 2.13 | 100 | 96.92 |
Ablation Study¶
| Configuration | SHD Acc | Notes |
|---|---|---|
| LRA-STASA only | Lower | Lacks local information |
| SWA-STASA only | Lower | Lacks global information |
| MSTASA (three branches) | Best | Complementary gains |
| Standard MLP | Lower | No context refinement |
| SCR-MLP | Higher | Selective refinement effective |
Key Findings¶
- SpikCommander surpasses all state-of-the-art SNN methods across three datasets with fewer parameters (e.g., SHD: 0.19M vs. SpikeSCR's 0.26M).
- On GSC, SpikCommander even outperforms the ANN model LMUFormer (96.92% vs. 96.53%), a rare achievement in the SNN literature.
- All three branches of the multi-view design contribute independently; removing any single branch degrades performance.
- The selective refinement in SCR-MLP is more efficient and effective than applying depthwise convolution to all channels.
Highlights & Insights¶
- Linear-Complexity Spiking Attention: Temporal aggregation reduces complexity from \(O(N^2)\) to \(O(ND)\), making spiking Transformers practical for long sequences.
- SNN Surpassing ANN: Results on GSC demonstrate that carefully designed SNN architectures can close and even exceed the performance gap with ANNs, which carries significant implications for neuromorphic computing.
- Selective Channel-Split Refinement: Applying depthwise convolution to only half the channels elegantly balances computational efficiency and contextual modeling capacity.
Limitations & Future Work¶
- Validation is limited to speech command recognition; generalization to more complex speech tasks (e.g., ASR) remains unexplored.
- Energy efficiency has not been validated on real neuromorphic hardware (e.g., Loihi, Tianjic).
- The time step is fixed at 100; adaptive time-step mechanisms could further reduce energy consumption.
- Implementation details regarding the dynamic coupling of sliding window size and input length are insufficiently discussed.
Related Work & Insights¶
- vs. Spikformer/SDT: These methods rely on \(O(N^2)\) SSA/SDSA; SpikCommander's linear STASA is more computationally efficient.
- vs. SpikeSCR: Employs a hybrid attention-convolution design but lacks multi-view architecture and uses more parameters.
- vs. DCLS-Delays: A delay-learning-based approach; SpikCommander replaces explicit delay modeling with attention mechanisms.
Rating¶
- Novelty: ⭐⭐⭐⭐ Multi-view spiking attention + SCR-MLP design is original
- Experimental Thoroughness: ⭐⭐⭐⭐ Three standard benchmarks with comprehensive ablation
- Writing Quality: ⭐⭐⭐⭐ Clear architectural diagrams and thorough comparisons
- Value: ⭐⭐⭐⭐ Meaningful advancement for SNN-based speech processing