NEURIPS2025 Image Restoration Spiking Neural Networks Remote Sensing Super-Resolution Attention Mechanism Deformable Similarity Attention Energy-Efficient AI

Spiking Meets Attention: Efficient Remote Sensing Image Super-Resolution with Attention Spiking Neural Networks¶

Conference: NEURIPS2025 arXiv: 2503.04223 Code: https://github.com/XY-boy/SpikeSR Area: Image Restoration Keywords: Spiking Neural Networks, Remote Sensing Super-Resolution, Attention Mechanism, Deformable Similarity Attention, Energy-Efficient AI

TL;DR¶

This paper proposes SpikeSR, the first attention-based spiking neural network (SNN) framework for remote sensing image super-resolution. By incorporating Spiking Attention Blocks (SAB) that combine Hybrid Dimensional Attention (HDA) and Deformable Similarity Attention (DSA), SpikeSR achieves state-of-the-art performance on AID/DOTA/DIOR while maintaining high computational efficiency.

Background & Motivation¶

Background: High-resolution remote sensing images (RSI) are critical for downstream tasks, yet sensor-imposed resolution limits remain a fundamental constraint. Deep learning-based SR methods (CNN/Transformer) have achieved notable progress but incur substantial computational overhead, making large-scale deployment in remote sensing scenarios difficult.

Limitations of Prior Work: - CNN-based SR methods (EDSR, RCAN, etc.) focus on network design but exhibit high computational complexity, particularly in exhaustive non-local modeling operations; - Transformer-based SR methods (SwinIR, HiT-SR, etc.) offer global modeling capacity but still carry large parameter counts and FLOPs; - SNNs, as third-generation neural networks, offer inherent energy efficiency advantages but remain almost entirely unexplored for pixel-level regression tasks such as SR.

Key Challenge: The binary spike signals of SNNs inevitably cause per-pixel information loss (spiking degradation), and insufficiently optimized membrane potential dynamics limit the representational capacity of SNNs for SR.

Goal: - Introduce SNNs into remote sensing SR to leverage their energy efficiency advantages - Optimize membrane potentials via attention mechanisms to enhance SNN representational capacity - Achieve or surpass ANN-level performance while maintaining low FLOPs

Key Insight: A key observation—even in severely degraded remote sensing images, LIF neurons maintain vigorous membrane potential fluctuations (active learning state), suggesting that SNNs possess an inherent sensitivity to high-frequency information (Figure 1a).

Core Idea: Regulate SNN membrane potentials through attention mechanisms (temporal-channel and deformable spatial), enabling spiking neural networks to achieve state-of-the-art performance in remote sensing SR for the first time with greater efficiency.

Method¶

Overall Architecture¶

Input: LR remote sensing image replicated along the temporal dimension for \(T\) steps (default \(T=4\))
Shallow Feature Extraction: \(3\times3\) convolution
Deep Feature Extraction: \(m\) Spiking Attention Groups (SAGs), each containing \(n\) SABs with residual connections
Fusion: Fusion Block (FB) converts discrete spike sequences into continuous-valued features
Reconstruction: PixelShuffle + \(3\times3\) convolution to generate the SR output

Key Designs¶

Spiking Attention Block (SAB):
- Function: Optimizes feature representation within the SNN framework
- Mechanism: Dual-branch parallel structure — Branch 1 uses two stacked SCBs (SNN Convolutional Blocks: LIF neuron → spiking convolution → tdBN); Branch 2 uses a standard CNN convolution. The two branches are summed and processed through HDA and DSA with a residual connection: \(\mathbf{X}^{t,n} = \mathbf{X}^{t,n-1} + \text{DSA}(\text{HDA}(\bar{\mathbf{X}}_1^{t,n} + \bar{\mathbf{X}}_2^{t,n}))\)
- Design Motivation: The CNN branch compensates for information loss caused by binary SNN signals (a core challenge in SNN-based SR); the attention modules optimize membrane potentials to make spiking activity more informative.
Hybrid Dimensional Attention (HDA):
- Function: Jointly modulates spike responses along the temporal and channel dimensions
- Mechanism: Employs temporal-channel joint attention (TJCA). Unlike prior approaches that treat temporal and channel attention independently, HDA bridges dependencies across both dimensions, enabling joint feature correlation learning.
- Design Motivation: SNN spike signals inherently possess a temporal dimension (\(T\) time steps), necessitating selective enhancement of useful signals simultaneously across temporal and channel dimensions.
Deformable Similarity Attention (DSA):
- Function: Exploits global self-similarity in remote sensing images as an SR prior
- Mechanism: (1) Multi-scale feature pyramid via bilinear interpolation downsampling; (2) Patch-level self-similarity computation: average-pool each patch → reshape → dot-product similarity matrix → concatenate multi-scale similarity scores; (3) Deformable convolution to correct geometric misalignment among the most similar patches: \(\mathbf{F}^D(p_0) = \sum_{p_m \in \mathcal{R}} \omega(p_m) \cdot \mathbf{F}(p_0 + p_m + \Delta p_m)\); (4) Cross-attention fusion: \(Q\) from deformed features, \(K, V\) from original features.
- Design Motivation: Remote sensing images exhibit repetitive patterns of the same scene type (e.g., building clusters, farmland) across different spatial locations, making self-similarity a strong prior. However, exhaustive pixel-wise non-local attention is computationally prohibitive; patch-level operations are both efficient and effective. Deformable convolution handles geometric transformations between matched patches.
Fusion Block (FB):
- Function: Adaptively aggregates discrete spike sequences into continuous pixel values
- Mechanism: First, temporal attention-weighted aggregation: \(\mathbf{Y}_1 = \sigma(\text{TA}(\mathbf{Y})) \otimes \mathbf{Y}\); then spatial attention processes residual information: \(\mathbf{Y}_2 = \sigma(\text{SA}(\mathbf{Y})) \otimes (1 - \mathbf{Y}_1)\); final output: \(\mathbf{Y}_1 + \mathbf{Y}_2\).
- Design Motivation: Naïve temporal averaging retains only first-order statistics; adaptive attention weighting preserves richer spatial-temporal details.

Loss & Training¶

L1 loss (pixel-level reconstruction)
Gumbel-Softmax for non-differentiable argmax in DSA patch matching
\(T=4\) time steps for training/inference (\(T=1\) used for fair FLOPs comparison)

Key Experimental Results¶

Main Results — Remote Sensing SR Performance (×4)¶

Method	Params	FLOPs	AID PSNR	DOTA PSNR	DIOR PSNR	Mean PSNR
EDSR	1518K	50.77G	30.65	33.64	30.63	31.64
SwinIR-light	897K	23.56G	30.83	33.94	30.85	31.87
HiT-SR	792K	21.04G	30.87	33.93	30.89	31.90
Omni-SR	2803K	70.98G	30.89	33.94	30.89	31.91
SpikeSR	1042K	33.05G	30.91	33.98	30.95	31.95
SpikeSR-S	472K	15.21G	30.86	33.89	30.89	31.88

Ablation Study¶

Configuration	PSNR↑	Notes
Full SpikeSR	31.95	Complete model
w/o CNN branch	Significant drop	Pure SNN suffers severe information loss
w/o HDA	Drop	Temporal-channel joint attention is important
w/o DSA	Drop	Global self-similarity prior is critical
w/o Deformable Conv	Drop	Geometric correction is necessary for patch matching

Key Findings¶

SpikeSR comprehensively outperforms ANN methods: Achieves state-of-the-art on all three datasets (AID/DOTA/DIOR); mean PSNR of 31.95 surpasses Omni-SR (31.91) with only 47% of its FLOPs.
SpikeSR-S approaches SOTA at minimal cost: With only 472K parameters and 15.21G FLOPs, it achieves 31.88 PSNR, competitive with SwinIR-light (31.87/23.56G) but with 35% fewer FLOPs.
CNN branch is indispensable: Removing the CNN branch (pure SNN) leads to a substantial performance drop, confirming that CNN compensation is necessary to address SNN information loss.
Value of deformable convolution in DSA: Patch matching without geometric correction introduces hallucinated textures.

Highlights & Insights¶

First successful application of SNNs to pixel-level regression: Prior SNN work mainly targets classification/detection; this paper demonstrates that, through attention-based membrane potential optimization, SNNs can match or exceed ANNs on pixel-level regression (SR). This is pioneering for applying SNNs to broader low-level vision tasks.
Observation that spiking signals preserve high-frequency sensitivity: Figure 1a shows that while pixel intensities in degraded images are smooth, LIF neurons still exhibit vigorous fluctuations, providing intuitive justification for SNN-based SR.
Patch-level non-local attention: Reduces exhaustive pixel-wise non-local attention to patch-level similarity computation followed by deformable convolution correction, substantially lowering computational cost while preserving self-similarity modeling capacity. This design is generalizable to other tasks requiring non-local priors.
CNN-SNN hybrid architecture: Rather than a pure SNN, the CNN branch pragmatically compensates for SNN information loss—a practical and effective design choice.

Limitations & Future Work¶

Time step configuration: \(T=1\) minimizes FLOPs but sacrifices performance; \(T=4\) achieves SOTA but FLOPs scale linearly with time steps. The paper lacks sufficient analysis of the trade-off across different values of \(T\).
Remote sensing datasets only: Generalizability is unknown, as the method is not validated on natural image SR benchmarks (e.g., DIV2K, Urban100).
Absence of energy efficiency quantification: Energy efficiency is claimed as a key advantage of SNNs, but no actual power consumption or deployment results on neuromorphic hardware are reported.
Only ×4 super-resolution: Performance at other upscaling factors (×2, ×8) is not evaluated.
Future directions:
- Deploy on neuromorphic chips to empirically verify energy efficiency
- Extend to natural image SR benchmarks
- Explore larger SNN backbones (the current maximum of 1042K parameters is relatively small)

vs. SwinIR/HiT-SR: Transformer-based SR methods offer strong global modeling but at high FLOPs. SpikeSR achieves better performance with lower FLOPs via SNN + DSA.
vs. Efficient SR (IMDN/RFDN/FMEN): These methods reduce CNN overhead through pruning/distillation, but generally achieve lower performance than SpikeSR.
vs. Direct ANN-to-SNN Conversion: Conversion methods suffer from accuracy gaps and high latency; SpikeSR employs direct training (surrogate gradients + BPTT), yielding superior results.

Rating¶

Novelty: ⭐⭐⭐⭐ First successful application of SNNs to remote sensing SR; SAB/DSA designs are innovative
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive comparison on 3 remote sensing datasets with detailed ablations and per-scene-category analysis across 30 scene types
Writing Quality: ⭐⭐⭐⭐ Motivation is clear and method figures are detailed, though some formulations could be more concise
Value: ⭐⭐⭐⭐ Opens a new direction for SNNs in low-level vision tasks with practical deployment value for the remote sensing community