Spiking Transformer with Spatial-Temporal Attention¶
Conference: CVPR 2025
arXiv: 2409.19764
Code: None
Area: Spiking Neural Networks / Efficient Inference
Keywords: Spiking Transformer, Spatial-Temporal Attention, SNN, Energy-Efficient Inference, Surrogate Gradient
TL;DR¶
Spatially-temporally decoupled attention is integrated into the Spiking Transformer architecture. By combining spatial-temporal decoupled attention designs with a spike-driven self-attention mechanism, the approach bridges the performance gap with ANNs while preserving the energy efficiency advantages of SNNs, achieving SOTA performance on multiple vision benchmarks.
Background & Motivation¶
Background¶
Background: SNNs have attracted significant attention due to their low power consumption and biological interpretability, but a considerable accuracy gap remains compared to ANNs. Recently, Spikformer and Spike-driven Transformers have introduced attention mechanisms into SNNs, achieving notable progress.
Key Challenge: The softmax operation and floating-point multiplications in standard self-attention are incompatible with the binary spiking nature of SNNs. Direct transplanting leads to a loss of energy efficiency, while simplified attention mechanisms suffer from accuracy degradation.
Mechanism: The attention mechanism is decoupled into spatial attention (capturing relationships between patches) and temporal attention (capturing dynamics across timesteps), both implemented using spike-compatible operations.
Proposed Solution¶
Goal: ### Overall Architecture Image → Spike Encoding → Multi-layer Spiking Transformer (Alternating Spatial Attention + Temporal Attention) → Classification Output.
Method¶
Overall Architecture¶
Image → Spike Encoding → Multi-layer Spiking Transformer (Alternating Spatial Attention + Temporal Attention) → Classification Output.
Key Designs¶
- Spike Spatial Attention: Uses a linear attention approximation that replaces multiplication with addition. Q, K, and V are generated using spiking convolutions, avoiding softmax.
- Spike Temporal Attention: Models the interaction between spike tokens across timesteps to capture temporal evolution patterns.
- Adaptive Membrane Potential: Learnable LIF parameters across different layers are introduced to simulate the heterogeneity of biological neurons.
- Surrogate Gradient Training: Approximates the gradient of the step function using a rectangular window function.
Loss & Training¶
Cross-entropy loss + spike sparsity regularization, optimized by SGD, with 4 timesteps.
Key Experimental Results¶
Main Results¶
| Dataset | Architecture | T | Ours | Prev. SOTA | Gain |
|---|---|---|---|---|---|
| CIFAR-10 | ResNet-19 | 4 | 96.8% | 96.5% | +0.3% |
| CIFAR-100 | ResNet-19 | 4 | 81.5% | 80.1% | +1.4% |
| ImageNet | ResNet-34 | 4 | 69.8% | 67.7% | +2.1% |
Ablation Study¶
| Configuration | CIFAR-100 acc | Description |
|---|---|---|
| Baseline SNN | 78.2% | Without attention |
| + Spatial Attention | 79.5% | +1.3% |
| + Temporal Attention | 80.3% | +2.1% |
| Full Model | 81.5% | +3.3% |
Key Findings¶
- Temporal attention delivers a higher performance contribution than spatial attention.
- The model achieves optimal performance at 4 timesteps, with significantly lower energy consumption than ANNs.
- An improvement of 2.1% is achieved on ImageNet, indicating a more pronounced advancement in large-scale tasks.
Highlights & Insights¶
- Decoupling spatial-temporal attention modularizes the design, enabling independent validation of each component.
- Spike-compatible attention preserves the energy efficiency advantages of SNNs.
- Adaptive membrane parameters enhance biological fidelity.
Limitations & Future Work¶
- An accuracy gap with ANNs still exists (approximately 6-7% on ImageNet).
- Whether the additional computational cost of attention is fully offset by the energy efficiency of the SNN requires more rigorous analysis.
- The effectiveness of temporal attention on video or event-driven data requires further validation.
Rating¶
- Novelty: ⭐⭐⭐⭐ The application of spatially-temporally decoupled attention in SNNs is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Thorough validations across multiple datasets alongside detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Clearly structured.
- Value: ⭐⭐⭐⭐ Facilitates bridging the gap between SNNs and ANNs.