Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks¶
Conference: NEURIPS2025 arXiv: 2503.04223 Code: https://github.com/XY-boy/SpikeSR Area: Image Restoration Keywords: Spiking Neural Networks, Remote Sensing Super-Resolution, Attention Mechanism, Deformable Similarity Attention, Energy-Efficient AI
TL;DR¶
This paper proposes SpikeSR, the first attention-based spiking neural network (SNN) framework for remote sensing image super-resolution. By incorporating Spiking Attention Blocks (SAB) that combine Hybrid Dimensional Attention (HDA) and Deformable Similarity Attention (DSA), SpikeSR achieves state-of-the-art performance on AID/DOTA/DIOR while maintaining high computational efficiency.
Background & Motivation¶
Background: High-resolution remote sensing images (RSI) are critical for downstream tasks, yet sensor-imposed resolution limits remain a fundamental constraint. Deep learning-based SR methods (CNN/Transformer) have achieved notable progress but incur substantial computational overhead, making large-scale deployment in remote sensing scenarios difficult.
Limitations of Prior Work: - CNN-based SR methods (EDSR, RCAN, etc.) focus on network design but exhibit high computational complexity, particularly in exhaustive non-local modeling operations; - Transformer-based SR methods (SwinIR, HiT-SR, etc.) offer global modeling capacity but still carry large parameter counts and FLOPs; - SNNs, as third-generation neural networks, offer inherent energy efficiency advantages but remain almost entirely unexplored for pixel-level regression tasks such as SR.
Key Challenge: The binary spike signals of SNNs inevitably cause per-pixel information loss (spiking degradation), and insufficiently optimized membrane potential dynamics limit the representational capacity of SNNs for SR.
Goal: - Introduce SNNs into remote sensing SR to leverage their energy efficiency advantages - Optimize membrane potentials via attention mechanisms to enhance SNN representational capacity - Achieve or surpass ANN-level performance while maintaining low FLOPs
Key Insight: A key observation—even in severely degraded remote sensing images, LIF neurons maintain vigorous membrane potential fluctuations (active learning state), suggesting that SNNs possess an inherent sensitivity to high-frequency information (Figure 1a).
Core Idea: Regulate SNN membrane potentials through attention mechanisms (temporal-channel and deformable spatial), enabling spiking neural networks to achieve state-of-the-art performance in remote sensing SR for the first time with greater efficiency.
Method¶
Overall Architecture¶
- Input: LR remote sensing image replicated along the temporal dimension for \(T\) steps (default \(T=4\))
- Shallow Feature Extraction: \(3\times3\) convolution
- Deep Feature Extraction: \(m\) Spiking Attention Groups (SAGs), each containing \(n\) SABs with residual connections
- Fusion: Fusion Block (FB) converts discrete spike sequences into continuous-valued features
- Reconstruction: PixelShuffle + \(3\times3\) convolution to generate the SR output
Key Designs¶
-
Spiking Attention Block (SAB):
- Function: Optimizes feature representation within the SNN framework
- Mechanism: Dual-branch parallel structure — Branch 1 uses two stacked SCBs (SNN Convolutional Blocks: LIF neuron → spiking convolution → tdBN); Branch 2 uses a standard CNN convolution. The two branches are summed and processed through HDA and DSA with a residual connection: \(\mathbf{X}^{t,n} = \mathbf{X}^{t,n-1} + \text{DSA}(\text{HDA}(\bar{\mathbf{X}}_1^{t,n} + \bar{\mathbf{X}}_2^{t,n}))\)
- Design Motivation: The CNN branch compensates for information loss caused by binary SNN signals (a core challenge in SNN-based SR); the attention modules optimize membrane potentials to make spiking activity more informative.
-
Hybrid Dimensional Attention (HDA):
- Function: Jointly modulates spike responses along the temporal and channel dimensions
- Mechanism: Employs temporal-channel joint attention (TJCA). Unlike prior approaches that treat temporal and channel attention independently, HDA bridges dependencies across both dimensions, enabling joint feature correlation learning.
- Design Motivation: SNN spike signals inherently possess a temporal dimension (\(T\) time steps), necessitating selective enhancement of useful signals simultaneously across temporal and channel dimensions.
-
Deformable Similarity Attention (DSA):
- Function: Exploits global self-similarity in remote sensing images as an SR prior
- Mechanism: (1) Multi-scale feature pyramid via bilinear interpolation downsampling; (2) Patch-level self-similarity computation: average-pool each patch → reshape → dot-product similarity matrix → concatenate multi-scale similarity scores; (3) Deformable convolution to correct geometric misalignment among the most similar patches: \(\mathbf{F}^D(p_0) = \sum_{p_m \in \mathcal{R}} \omega(p_m) \cdot \mathbf{F}(p_0 + p_m + \Delta p_m)\); (4) Cross-attention fusion: \(Q\) from deformed features, \(K, V\) from original features.
- Design Motivation: Remote sensing images exhibit repetitive patterns of the same scene type (e.g., building clusters, farmland) across different spatial locations, making self-similarity a strong prior. However, exhaustive pixel-wise non-local attention is computationally prohibitive; patch-level operations are both efficient and effective. Deformable convolution handles geometric transformations between matched patches.
-
Fusion Block (FB):
- Function: Adaptively aggregates discrete spike sequences into continuous pixel values
- Mechanism: First, temporal attention-weighted aggregation: \(\mathbf{Y}_1 = \sigma(\text{TA}(\mathbf{Y})) \otimes \mathbf{Y}\); then spatial attention processes residual information: \(\mathbf{Y}_2 = \sigma(\text{SA}(\mathbf{Y})) \otimes (1 - \mathbf{Y}_1)\); final output: \(\mathbf{Y}_1 + \mathbf{Y}_2\).
- Design Motivation: Naïve temporal averaging retains only first-order statistics; adaptive attention weighting preserves richer spatial-temporal details.
Loss & Training¶
- L1 loss (pixel-level reconstruction)
- Gumbel-Softmax for non-differentiable argmax in DSA patch matching
- \(T=4\) time steps for training/inference (\(T=1\) used for fair FLOPs comparison)
Key Experimental Results¶
Main Results — Remote Sensing SR Performance (×4)¶
| Method | Params | FLOPs | AID PSNR | DOTA PSNR | DIOR PSNR | Mean PSNR |
|---|---|---|---|---|---|---|
| EDSR | 1518K | 50.77G | 30.65 | 33.64 | 30.63 | 31.64 |
| SwinIR-light | 897K | 23.56G | 30.83 | 33.94 | 30.85 | 31.87 |
| HiT-SR | 792K | 21.04G | 30.87 | 33.93 | 30.89 | 31.90 |
| Omni-SR | 2803K | 70.98G | 30.89 | 33.94 | 30.89 | 31.91 |
| SpikeSR | 1042K | 33.05G | 30.91 | 33.98 | 30.95 | 31.95 |
| SpikeSR-S | 472K | 15.21G | 30.86 | 33.89 | 30.89 | 31.88 |
Ablation Study¶
| Configuration | PSNR↑ | Notes |
|---|---|---|
| Full SpikeSR | 31.95 | Complete model |
| w/o CNN branch | Significant drop | Pure SNN suffers severe information loss |
| w/o HDA | Drop | Temporal-channel joint attention is important |
| w/o DSA | Drop | Global self-similarity prior is critical |
| w/o Deformable Conv | Drop | Geometric correction is necessary for patch matching |
Key Findings¶
- SpikeSR comprehensively outperforms ANN methods: Achieves state-of-the-art on all three datasets (AID/DOTA/DIOR); mean PSNR of 31.95 surpasses Omni-SR (31.91) with only 47% of its FLOPs.
- SpikeSR-S approaches SOTA at minimal cost: With only 472K parameters and 15.21G FLOPs, it achieves 31.88 PSNR, competitive with SwinIR-light (31.87/23.56G) but with 35% fewer FLOPs.
- CNN branch is indispensable: Removing the CNN branch (pure SNN) leads to a substantial performance drop, confirming that CNN compensation is necessary to address SNN information loss.
- Value of deformable convolution in DSA: Patch matching without geometric correction introduces hallucinated textures.
Highlights & Insights¶
- First successful application of SNNs to pixel-level regression: Prior SNN work mainly targets classification/detection; this paper demonstrates that, through attention-based membrane potential optimization, SNNs can match or exceed ANNs on pixel-level regression (SR). This is pioneering for applying SNNs to broader low-level vision tasks.
- Observation that spiking signals preserve high-frequency sensitivity: Figure 1a shows that while pixel intensities in degraded images are smooth, LIF neurons still exhibit vigorous fluctuations, providing intuitive justification for SNN-based SR.
- Patch-level non-local attention: Reduces exhaustive pixel-wise non-local attention to patch-level similarity computation followed by deformable convolution correction, substantially lowering computational cost while preserving self-similarity modeling capacity. This design is generalizable to other tasks requiring non-local priors.
- CNN-SNN hybrid architecture: Rather than a pure SNN, the CNN branch pragmatically compensates for SNN information loss—a practical and effective design choice.
Limitations & Future Work¶
- Time step configuration: \(T=1\) minimizes FLOPs but sacrifices performance; \(T=4\) achieves SOTA but FLOPs scale linearly with time steps. The paper lacks sufficient analysis of the trade-off across different values of \(T\).
- Remote sensing datasets only: Generalizability is unknown, as the method is not validated on natural image SR benchmarks (e.g., DIV2K, Urban100).
- Absence of energy efficiency quantification: Energy efficiency is claimed as a key advantage of SNNs, but no actual power consumption or deployment results on neuromorphic hardware are reported.
- Only ×4 super-resolution: Performance at other upscaling factors (×2, ×8) is not evaluated.
- Future directions:
- Deploy on neuromorphic chips to empirically verify energy efficiency
- Extend to natural image SR benchmarks
- Explore larger SNN backbones (the current maximum of 1042K parameters is relatively small)
Related Work & Insights¶
- vs. SwinIR/HiT-SR: Transformer-based SR methods offer strong global modeling but at high FLOPs. SpikeSR achieves better performance with lower FLOPs via SNN + DSA.
- vs. Efficient SR (IMDN/RFDN/FMEN): These methods reduce CNN overhead through pruning/distillation, but generally achieve lower performance than SpikeSR.
- vs. Direct ANN-to-SNN Conversion: Conversion methods suffer from accuracy gaps and high latency; SpikeSR employs direct training (surrogate gradients + BPTT), yielding superior results.
Rating¶
- Novelty: ⭐⭐⭐⭐ First successful application of SNNs to remote sensing SR; SAB/DSA designs are innovative
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison on 3 remote sensing datasets with detailed ablations and per-scene-category analysis across 30 scene types
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear and method figures are detailed, though some formulations could be more concise
- Value: ⭐⭐⭐⭐ Opens a new direction for SNNs in low-level vision tasks with practical deployment value for the remote sensing community