QMambaBSR: Burst Image Super-Resolution with Query State Space Model¶
Conference: CVPR 2025
arXiv: 2408.08665
Code: None
Area: Image Super-Resolution / Multi-Frame Fusion
Keywords: Burst Image Super-Resolution, State Space Models, Sub-pixel Extraction, Adaptive Upsampling, Multi-frame Denoising
TL;DR¶
QMambaBSR is proposed to achieve joint sub-pixel extraction and noise suppression through inter-frame query and intra-frame scanning using the Query State Space Model (QSSM). Combined with an adaptive upsampling module, it achieves SOTA performance on both synthetic and real burst image super-resolution tasks.
Background & Motivation¶
Burst Image Super-Resolution (BurstSR) aims to reconstruct high-resolution images by fusing sub-pixel information from multiple handheld low-resolution frames, which is an important technology to overcome the limitations of smartphone sensors and lenses. This area faces two core challenges:
- Difficulty in separating sub-pixels from noise: Burst RAW images contain both useful sub-pixel information and high-frequency random noise. Existing methods (e.g., weighted fusion, frame-by-frame cross-attention) process frames individually, failing to effectively exploit the key characteristic that "sub-pixels exhibit consistent spatial distribution across multiple frames, whereas noise appears randomly," leading to inaccurate extraction.
- Static upsampling is adaptive-incapable: Existing SOTA methods (e.g., Burstormer, BIPNet) use fixed interpolation, transposed convolution, or PixelShuffle for upsampling. They cannot perceive the spatial distribution characteristics of sub-pixels in different scenes, leading to over-smoothed details.
The core observation of this paper is that effective sub-pixels maintain consistent intensity at corresponding locations across all frames, whereas noise occurs randomly only in specific frames. Therefore, simultaneously considering the entire burst sequence for fusion allows for more reliable extraction of consistent sub-pixels and suppression of noisy outliers.
Method¶
Overall Architecture¶
The pipeline of QMambaBSR consists of three stages: (1) Alignment stage—using existing alignment modules to align current frames with the reference frame; (2) Fusion stage—conducting joint sub-pixel extraction with inter-frame query and intra-frame scanning via the QSSM module, and integrating sub-pixel information across scales using the Multi-Scale Fusion Module (MSFM); (3) Upsampling stage—dynamically adjusting the upsampling kernel according to scene features using the Adaptive Upsampling module (AdaUp) to reconstruct high-quality high-resolution images.
Key Designs¶
-
Query State Space Model (QSSM):
- Function: Simultaneously extract sub-pixel information that matches the reference frame from all current frames, while suppressing random noise.
- Mechanism: Modify the control matrix \(B\) and step size \(\Delta\) in the SSM so that they are generated by the reference frame instead of the input frames. Specifically, the reference frame is first concatenated with all current frames along the channel dimension and fused via an MLP (preliminary denoising). Then, the reference frame generates \(\Delta_{base}\) and \(B_{base}\) through a linear layer, acting as gating signals to control the impact of current frame features on the state. All current frame features are concatenated along the channel dimension and processed uniformly through a linear layer, enabling the reference frame to query all current frames at once. QSSM incorporates four-directional scanning and channel attention.
- Design Motivation: Traditional cross-attention can only perform query operations frame-by-frame and position-by-position (\(O(N^2)\) complexity). In contrast, QSSM concurrently achieves inter-frame query and intra-frame information interaction through the recurrent structure of SSM, yielding lower complexity. Position \(t\) of the reference frame not only queries current frame information at its own position but also guides queries at neighboring positions via the forget/input gates, forming a progressively decaying receptive field.
-
多尺度融合模块(MSFM):
- Function: Fuse sub-pixel information across different scales to enhance detail reconstruction capabilities.
- Mechanism: A three-branch parallel design—a \(3 \times 3\) convolution processes local sub-pixel features, an SSM (horizontal + vertical scans) processes axial global features, and a channel Transformer enhances global perception. The outputs of the three branches are weighted and summed using learnable weights.
- Design Motivation: The decay characteristic of the SSM's \(A\) matrix limits its long-range perception, which is compensated for by the Transformer; local convolution captures fine-grained textures. The three branches complement each other to cover various scales.
-
Adaptive Upsampling Module (AdaUp):
- Function: Dynamically adjust the upsampling kernel according to the spatial distribution of sub-pixels in the current scene.
- Mechanism: First, input feature channel-wise sub-pixel distribution \(L\) is perceived via adaptive pooling, and then output channel distribution \(L_1\) is obtained via a \(1 \times 1\) convolution. The two distribution sequences are applied to the transposed convolution kernel \(W\) through broadcasting and element-wise multiplication: \(W_f = (W \odot L) \odot L_1\). Finally, the transposed convolution upsampling is performed using the modulated kernel.
- Design Motivation: Static upsampling kernels cannot adapt to the varying spatial arrangements of sub-pixels in different burst scenes. AdaUp makes the kernel scene-aware to better leverage sub-pixel information for detail reconstruction.
Loss & Training¶
Training setup: Trained from scratch on the Synthetic BurstSR dataset for 300 epochs using the AdamW optimizer (\(\beta_1\)=0.9, \(\beta_2\)=0.999) with cosine annealing scheduling to decay the learning rate from \(3 \times 10^{-4}\) to \(10^{-6}\). The training patch size is 48×48, batch size is 8, burst size is 14, using 8 V100 GPUs. For Real BurstSR, the model is fine-tuned for 60 epochs (lr=\(10^{-6}\), patch=56×56). For RealBSR-RAW/RGB, the model is trained from scratch for 100 epochs (patch=80×80).
Key Experimental Results¶
Main Results¶
| Dataset | Metric | QMambaBSR | Burstormer (Prev. SOTA) | Gain |
|---|---|---|---|---|
| Synthetic BurstSR (\(\times 4\)) | PSNR | 43.12 | 42.83 | +0.29 dB |
| Synthetic BurstSR (\(\times 4\)) | SSIM | 0.97 | 0.97 | - |
| RealBSR-RAW (\(\times 4\)) | PSNR | 27.558 | 27.290 | +0.268 dB |
| RealBSR-RAW (\(\times 4\)) | SSIM | 0.820 | 0.816 | +0.004 |
| RealBSR-RAW (\(\times 4\)) | L-PSNR | 32.791 | 32.533 | +0.258 dB |
| RealBSR-RGB (\(\times 4\)) | PSNR | 31.401 | 31.197 | +0.204 dB |
| RealBSR-RGB (\(\times 4\)) | SSIM | 0.908 | 0.907 | +0.001 |
Ablation Study¶
| Configuration | PSNR↑ | SSIM↑ | Description |
|---|---|---|---|
| Baseline (w/o modules) | 39.81 | 0.93 | Base network |
| +MSFM | 41.15 | 0.94 | +1.34 dB |
| +MSFM+QSSM | 41.87 | 0.96 | +0.72 dB |
| +MSFM+QSSM+AdaUp | 42.13 | 0.96 | +0.26 dB |
| Fusion Methods Comparison | PSNR↑ | Description |
|---|---|---|
| Concat | 39.85 | Simple concatenation |
| PBFF (BIPNet) | 40.57 | Channel fusion |
| NRFE (Burstormer) | 41.72 | Neighborhood interaction |
| QSSM+MSFM (Ours) | 42.13 | +0.41 dB vs NRFE |
Key Findings¶
- MSFM contributes the most (+1.34 dB), indicating that multi-scale sub-pixel fusion is key.
- QSSM yields an additional improvement of 0.72 dB on top of MSFM, verifying the effectiveness of joint inter-frame querying.
- AdaUp yields a 0.16 dB improvement compared to PixelShuffle, showing a clear advantage of adaptive upsampling.
- Internal ablation of MSFM proves that the combination of the three branches (Conv+SSM+Transformer) achieves the best performance (+0.56 dB compared to Conv-only).
- In the user study (20 volunteers), the average score is 8.56/10, significantly outperforming Burstormer and BIPNet.
Highlights & Insights¶
- Elegant query mechanism design of QSSM: By modifying the sources of SSM parameters \(B\) and \(\Delta\) (generating them from the reference frame), SSM is transformed from an autoregressive model into a cross-sequence query model, representing an innovative application of SSM in multi-frame tasks.
- Natural exploitation of "sub-pixel consistency vs. noise randomness": Joint multi-frame querying inherently utilizes this prior knowledge.
- Three-branch design of MSFM: Balances local (CNN), axial (SSM), and global (Transformer) receptive fields.
- Simple and effective AdaUp: Achieves scene-adaptive upsampling solely through channel-wise distribution modulation.
Limitations & Future Work¶
- The current approach mainly focuses on the fusion and upsampling stages; the alignment stage still relies on existing methods, leaving room for further optimization.
- The computational cost of training for 300 epochs on 8 V100 GPUs is relatively high.
- The method is only validated on \(\times 4\) super-resolution; its applicability to other scale factors is not fully demonstrated.
- The authors plan to apply SSMs to the alignment stage and extend the framework to other multi-frame restoration tasks such as burst denoising, HDR, etc.
Related Work & Insights¶
- vs Burstormer: Burstormer uses cross-attention to extract sub-pixels frame-by-frame, whereas QSSM can query all frames simultaneously and perform intra-frame information interaction, which is more efficient.
- vs BIPNet: BIPNet facilitates inter-frame information flow through channel shuffling but lacks an explicit sub-pixel extraction mechanism.
- vs RBSR: RBSR uses RNNs for frame-by-frame fusion, failing to distinguish between the distinct characteristics of sub-pixels and noise.
- vs MambaIR: MambaIR is the first to apply SSMs to image restoration but only handles single frames; QMambaBSR extends SSMs to multi-frame query scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ QSSM innovatively adapts SSM into a cross-frame query model, and the channel-level kernel modulation of AdaUp is also a novel design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation conducted on four benchmarks (synthetic + real) with extensive ablation studies (module-level + component-level + user study).
- Writing Quality: ⭐⭐⭐⭐ The methodological derivations are clear, mathematical formulations are detailed, and the comparative analysis against cross-attention is highly convincing.
- Value: ⭐⭐⭐⭐ Achieved comprehensive SOTA on the crucial problem of burst image super-resolution, providing valuable insights for applying SSMs to multi-frame tasks.