VSRM: A Robust Mamba-Based Framework for Video Super-Resolution¶
Conference: ICCV 2025
arXiv: 2506.22762
Code: N/A
Area: Image Restoration / Video Super-Resolution
Keywords: Video Super-Resolution, Mamba, State Space Model, Deformable Alignment, Frequency Loss
TL;DR¶
This work is the first to introduce Mamba into video super-resolution, proposing the Dual Aggregation Mamba Block (DAMB) for long-range spatiotemporal dependency modeling, the Deformable Cross-Mamba Alignment module (DCA) for more flexible inter-frame alignment, and the Frequency Charbonnier-like Loss (FCL) for improved high-frequency detail recovery, achieving state-of-the-art results on REDS4, Vid4, and Vimeo-90K.
Background & Motivation¶
Video super-resolution (VSR) aims to generate high-resolution frames from low-resolution video by exploiting complementary multi-frame information. Current approaches are primarily CNN- or Transformer-based:
- CNN-based methods (e.g., BasicVSR) are constrained by local receptive fields and cannot effectively capture long-range inter-frame information.
- Transformer-based methods (e.g., PSRT, IART) offer powerful attention mechanisms, but full attention's quadratic complexity is impractical for long sequences; window attention reduces complexity at the cost of limited receptive field.
- Alignment modules: existing methods commonly use bilinear/nearest-neighbor interpolation for spatial alignment, where fixed weights cause feature distortion; IART proposes attention-based implicit interpolation but computes within fixed reference windows, limiting flexibility.
- Loss functions: pixel-level losses produce over-smoothing; perceptual losses introduce greater distortion; reconstruction–GT discrepancies are particularly pronounced in the frequency domain.
Mamba's linear complexity, long-sequence modeling capability, and data-dependent parameterization make it well-suited for VSR. This paper is the first to explore Mamba in this setting.
Method¶
Overall Architecture¶
VSRM consists of two components: a feature extractor (Conv2d + Feature Propagation Block, FPB) and an upsampler (Reconstruction module). The FPB includes Deformable Cross-Mamba Alignment (DCA) and the Dual Aggregation Mamba Block (DAMB). It first aligns neighboring frame features, then extracts deep spatiotemporal features, and finally generates high-resolution output via the upsampler.
Key Designs¶
-
Dual Aggregation Mamba Block (DAMB): Composed of \(N\) S2TMBs and one T2SMB, jointly modeling long-range dependencies in both spatial and temporal dimensions.
- S2T-Mamba (Spatial-to-Temporal): Flattens the 3D video sequence into a 1D sequence with a spatial-first, temporal-second scan order, processed by bidirectional (forward and backward) SSMs. Formula: \(S2T\text{-}Mamba(x,z) = Linear(x_1 \odot z + x_2 \odot z)\)
- T2S-Mamba (Temporal-to-Spatial): Uses a temporal-first, spatial-second scan order with a unidirectional forward scan only. Experiments show S2TMB is biased toward spatial information, while T2SMB explicitly prioritizes temporal information, making the two complementary.
- TGFN (Temporal-Gated Feed-forward Network): Replaces the standard FFN with 3D depthwise separable convolutions and a gating mechanism to better model spatiotemporal neighborhood relationships and optimize information flow.
-
Deformable Cross-Mamba Alignment (DCA): Optical flow is estimated using SpyNet, and a deformable window scheme is introduced during the compensation stage. The core idea is:
- For each target pixel, the corresponding sampling location is identified in the reference frame via optical flow.
- A window \(w\) is constructed around the sampling location, and a reference region \(r\) is initialized.
- A learnable offset network \(\mathcal{S}(w)\) predicts offsets \(\epsilon_r\) to obtain a dynamic reference region \(\bar{r} = \phi(w; r + \epsilon_r)\).
- A cross-Mamba module fuses reference and target features to complete alignment: \(\bar{X}(x,y) = cross\text{-}mamba(R, Q)\), based on the SSM recurrence \(H_t = \bar{A}_R H_{t-1} + \bar{B}_R \bar{R}_t\), \(\bar{X}_t = C_Q H_t\).
-
Frequency Charbonnier-like Loss (FCL): Charbonnier losses are computed separately on the real and imaginary parts of the FFT-transformed images, rather than on the amplitude/phase (avoiding numerical instabilities from square roots and arctan operations).
$\(\mathcal{L}_{FCL} = \sum_{i \in \{Re, Im\}} \lambda_i \sqrt{\|i\mathcal{F}(\mathbf{I}_{SR}) - i\mathcal{F}(\mathbf{I}_{HR})\|^2 + \epsilon^2}\)$
Loss & Training¶
The total loss is a weighted combination of the spatial-domain Charbonnier loss and the frequency-domain FCL:
where \(\lambda = 1.0\), \(\lambda_{Re} = \lambda_{Im} = 0.02\), and \(\epsilon = 10^{-3}\). Training datasets: REDS and Vimeo-90K.
Key Experimental Results¶
Main Results¶
| Method | Input Frames | Params (M) | REDS4 PSNR | REDS4 SSIM | Vid4 PSNR | Vid4 SSIM |
|---|---|---|---|---|---|---|
| BasicVSR++ | 30/14 | 7.3 | 32.39 | 0.9069 | 27.79 | 0.8400 |
| VRT | 16/7 | 35.6 | 32.19 | 0.9006 | 27.93 | 0.8425 |
| RVRT | 30/14 | 10.8 | 32.75 | 0.9113 | 27.99 | 0.8462 |
| PSRT-rec | 16/14 | 13.4 | 32.72 | 0.9106 | 28.07 | 0.8485 |
| IART | 16/7 | 13.4 | 32.90 | 0.9138 | 28.26 | 0.8517 |
| VSRM | 16/7 | 17.1 | 33.11 | 0.9162 | 28.44 | 0.8552 |
VSRM outperforms IART by 0.21 dB on REDS4 (16-frame setting) and 0.18 dB on Vid4, and also achieves the best result of 38.33 dB on Vimeo-90K-T.
Ablation Study¶
| Ablation | PSNR (dB) | Params (M) | FLOPs (G) |
|---|---|---|---|
| 3D DW-Conv (replaces Mamba) | 30.84 | 19.49 | 149.8 |
| Window Attention (replaces Mamba) | 30.97 | 7.68 | 152.4 |
| Full Attention (replaces Mamba) | 31.06 | 7.68 | 1018.1 |
| Mamba (ours) | 31.09 | 8.61 | 159.2 |
| w/o DCA alignment | 30.87 | 8.53 | 120.4 |
| FGDA alignment | 30.92 | 8.70 | 154.3 |
| IA alignment | 31.00 | 8.57 | 148.7 |
| DCA alignment (ours) | 31.09 | 8.61 | 159.2 |
| w/o T2SMB | 30.95 | 7.87 | 155.6 |
| T2SMB (bidirectional) | 31.02 | 8.65 | 162.2 |
| T2SMB (unidirectional, ours) | 31.09 | 8.61 | 159.2 |
| FFN | 30.90 | 8.68 | 136.2 |
| TGFN (ours) | 31.09 | 8.61 | 159.2 |
| w/o FCL (\(\lambda\)=0) | 30.97 | — | — |
| FCL (\(\lambda\)=0.02) | 31.09 | — | — |
Key Findings¶
- Mamba achieves performance comparable to full attention with only 1/6 the FLOPs (159 G vs. 1018 G).
- DCA outperforms FGDA and IA alignment by 0.17 dB and 0.09 dB, respectively, validating the advantage of the deformable window mechanism.
- T2SMB complements S2TMB's limited temporal information extraction (+0.14 dB), and unidirectional scanning outperforms bidirectional.
- Removing FCL causes a 0.12 dB drop, confirming the importance of frequency-domain regularization for high-frequency detail recovery.
- VSRM's effective receptive field (ERF) substantially exceeds that of CNN and Transformer methods.
Highlights & Insights¶
- First Mamba + VSR work: Successfully validates the feasibility of Mamba in video super-resolution, achieving both linear complexity and a global receptive field.
- Complementary S2T and T2S scanning: Combining spatial-first and temporal-first scanning strategies enables complete spatiotemporal feature extraction—a VSR-specific Mamba adaptation.
- DCA's deformable reference regions: Unlike fixed-window implicit alignment, DCA dynamically adjusts reference regions via learned offsets, better handling motion of varying magnitude.
- Simple and effective FCL design: Computing Charbonnier loss directly on real and imaginary parts avoids the numerical instability of amplitude/phase-based computation.
Limitations & Future Work¶
- Parameter count (17.1 M) and inference time (223 ms) are slightly higher than PSRT/IART (13.4 M, 173–180 ms); Mamba acceleration remains an open research direction.
- Only ×4 super-resolution is explored; other scale factors and degradation models are not evaluated.
- Mamba's hardware acceleration libraries and tooling for vision are less mature than those for Transformers.
- The framework is extensible to other low-level video tasks such as deblurring, denoising, and colorization.
Related Work & Insights¶
- The selective SSM mechanism of Mamba (S6) makes parameters input-dependent, overcoming limitations of classical SSMs.
- Unlike MambaIR, VSRM processes multi-frame 3D sequences; the S2T/T2S scanning strategy is transferable to other video tasks.
- Comparisons with frequency-domain losses (FFL, WHFL) confirm FCL's advantage in balancing low- and high-frequency components.
Rating¶
- Novelty: ⭐⭐⭐⭐ First application of Mamba to VSR; bidirectional scanning and DCA designs are innovative.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Ablations cover every module; multi-metric, multi-dataset comparisons.
- Writing Quality: ⭐⭐⭐⭐ Clear structure with rich figures and tables.
- Value: ⭐⭐⭐⭐ Provides a solid Mamba-based baseline for low-level video vision.