Skip to content

SpikCommander: A High-Performance Spiking Transformer with Multi-View Learning for Efficient Speech Command Recognition

Conference: AAAI 2026 arXiv: 2511.07883 Code: https://github.com/JackieWang9811/SCommander Area: Spiking Neural Networks / Speech Recognition Keywords: Spiking Neural Networks, Speech Command Recognition, Spiking Transformer, Multi-View Learning, Energy Efficiency

TL;DR

This paper proposes SpikCommander, a fully spike-driven Transformer architecture that jointly enhances temporal and channel feature modeling via Multi-view Spike Temporal-Aware Self-Attention (MSTASA) and Spike Context Refinement MLP (SCR-MLP), surpassing state-of-the-art SNN methods on SHD/SSC/GSC benchmarks with fewer parameters.

Background & Motivation

  1. Background: Spiking Neural Networks (SNNs) offer significant energy efficiency advantages due to their event-driven nature, making them well-suited for Speech Command Recognition (SCR) tasks. Spiking Transformers such as Spikformer and SDT have demonstrated progress on vision tasks.
  2. Limitations of Prior Work: (a) Existing SNN speech models struggle to capture rich temporal dependencies and contextual information, constrained by the sparsity of binary spike representations; (b) most existing spiking self-attention mechanisms adopt global attention with \(O(N^2)\) complexity, incurring high computational cost; (c) conventional channel-wise MLPs lack context refinement capability.
  3. Key Challenge: The binary sparsity of spikes limits effective feature extraction, and conventional continuous-valued attention operations are poorly suited to the spike domain.
  4. Goal: To design an efficient yet expressive fully spike-driven Transformer architecture specifically for speech command recognition.
  5. Key Insight: A multi-view learning framework that simultaneously captures complementary temporal information from three pathways: local (sliding window), global (long-range), and convolutional (shift-invariant).
  6. Core Idea: A three-branch complementary temporal-aware attention mechanism combined with a selective context refinement MLP achieves rich temporal modeling under full spike-driven constraints.

Method

Overall Architecture

The input speech signal is converted into spike representations via a Spike Embedding Extractor (SEE). The backbone consists of Transformer blocks with alternating MSTASA and SCR-MLP layers. The classification head produces predictions by summing across time steps followed by softmax. Training employs BPTT with surrogate gradients.

Key Designs

  1. Spike Temporal-Aware Self-Attention (STASA) and Its Multi-View Extension (MSTASA)

    • Function: Captures complementary temporal dependencies in the spike domain with linear complexity \(O(ND)\).
    • Mechanism: STASA applies temporal masking to spike-valued Q and K, then aggregates along the time dimension: \(\hat{Q}_S = \sum_{t=1}^T Q'_S[:,t,:]\). Attention weights are computed as \(S_{attn} = \beta(\hat{Q}_S + \hat{K}_S)\), passed through spiking neurons, and broadcast onto values \(V_S\) via element-wise multiplication. MSTASA comprises three branches: (a) Sliding Window STASA (SWA-STASA) — restricts attention to a \(2w+1\) window to model local dependencies; (b) Long-Range STASA (LRA-STASA) — full-sequence attention for global dependency modeling; (c) V-branch — injects shift-invariant positional patterns into value representations via depthwise convolution (kernel=9×1) followed by pointwise convolution. The two STASA branches are fused via dual-attention projection before being merged with the V-branch.
    • Design Motivation: Classical spiking attention (SSA) incurs \(O(N^2)\) complexity due to the \(QK^T\) matrix multiplication. STASA reduces this to \(O(ND)\) via temporal aggregation. The multi-view design captures complementary information: local detail, global context, and shift-invariant patterns.
  2. Spike Context Refinement MLP (SCR-MLP)

    • Function: Enhances channel mixing and temporal context modeling.
    • Mechanism: Three stages — (i) forward projection: PCBlock + LinBlock expands features to \(\alpha D\) dimensions (\(\alpha=4\)); (ii) selective context refinement: features are split evenly along the channel dimension, one half passes through a depthwise convolution with kernel=31 to capture local temporal context while the other half is passed through directly, followed by concatenation; (iii) back projection: compresses back to \(D\) dimensions. All operations maintain fully spike-driven computation through {Conv-BN-SN} blocks.
    • Design Motivation: Conventional channel-wise MLPs perform only channel mixing without temporal context. The selective split design — applying depthwise convolution to only half the channels — injects contextual information while reducing computational overhead.
  3. Spike Embedding Extractor (SEE)

    • Function: Converts speech input into structured spike representations.
    • Mechanism: Depthwise separable convolution (pointwise 1D + depthwise 1D with kernel=7) extracts local time-frequency features; a residual linear projection enhances channel dimensionality; all outputs are converted to spikes via spiking neurons.

Loss & Training

Standard cross-entropy loss; end-to-end training via BPTT with surrogate gradients (ArcTan). Time step \(T=100\).

Key Experimental Results

Main Results

Dataset Method Params (M) Time Steps Accuracy (%)
SHD SpikeSCR (1L) 0.26 100 95.60
SHD Pfa-SNN 0.20 100 96.26
SHD SpikCommander (1L) 0.19 100 96.41
SSC SpikeSCR (2L) 3.30 100 82.79
SSC SpikCommander (2L) 2.13 100 83.49
GSC Spiking LMUFormer 1.69 - 96.12
GSC d-cAdLIF (2L) 0.61 100 95.69
GSC SpikCommander (2L) 2.13 100 96.92

Ablation Study

Configuration SHD Acc Notes
LRA-STASA only Lower Lacks local information
SWA-STASA only Lower Lacks global information
MSTASA (three branches) Best Complementary gains
Standard MLP Lower No context refinement
SCR-MLP Higher Selective refinement effective

Key Findings

  • SpikCommander surpasses all state-of-the-art SNN methods across three datasets with fewer parameters (e.g., SHD: 0.19M vs. SpikeSCR's 0.26M).
  • On GSC, SpikCommander even outperforms the ANN model LMUFormer (96.92% vs. 96.53%), a rare achievement in the SNN literature.
  • All three branches of the multi-view design contribute independently; removing any single branch degrades performance.
  • The selective refinement in SCR-MLP is more efficient and effective than applying depthwise convolution to all channels.

Highlights & Insights

  • Linear-Complexity Spiking Attention: Temporal aggregation reduces complexity from \(O(N^2)\) to \(O(ND)\), making spiking Transformers practical for long sequences.
  • SNN Surpassing ANN: Results on GSC demonstrate that carefully designed SNN architectures can close and even exceed the performance gap with ANNs, which carries significant implications for neuromorphic computing.
  • Selective Channel-Split Refinement: Applying depthwise convolution to only half the channels elegantly balances computational efficiency and contextual modeling capacity.

Limitations & Future Work

  • Validation is limited to speech command recognition; generalization to more complex speech tasks (e.g., ASR) remains unexplored.
  • Energy efficiency has not been validated on real neuromorphic hardware (e.g., Loihi, Tianjic).
  • The time step is fixed at 100; adaptive time-step mechanisms could further reduce energy consumption.
  • Implementation details regarding the dynamic coupling of sliding window size and input length are insufficiently discussed.
  • vs. Spikformer/SDT: These methods rely on \(O(N^2)\) SSA/SDSA; SpikCommander's linear STASA is more computationally efficient.
  • vs. SpikeSCR: Employs a hybrid attention-convolution design but lacks multi-view architecture and uses more parameters.
  • vs. DCLS-Delays: A delay-learning-based approach; SpikCommander replaces explicit delay modeling with attention mechanisms.

Rating

  • Novelty: ⭐⭐⭐⭐ Multi-view spiking attention + SCR-MLP design is original
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three standard benchmarks with comprehensive ablation
  • Writing Quality: ⭐⭐⭐⭐ Clear architectural diagrams and thorough comparisons
  • Value: ⭐⭐⭐⭐ Meaningful advancement for SNN-based speech processing