Skip to content

VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Conference: ICCV 2025 arXiv: 2506.22762
Code: N/A
Area: Image Restoration / Video Super-Resolution Keywords: Video Super-Resolution, Mamba, State Space Model, Deformable Alignment, Frequency Loss

TL;DR

This work is the first to introduce Mamba into video super-resolution, proposing the Dual Aggregation Mamba Block (DAMB) for long-range spatiotemporal dependency modeling, the Deformable Cross-Mamba Alignment module (DCA) for more flexible inter-frame alignment, and the Frequency Charbonnier-like Loss (FCL) for improved high-frequency detail recovery, achieving state-of-the-art results on REDS4, Vid4, and Vimeo-90K.

Background & Motivation

Video super-resolution (VSR) aims to generate high-resolution frames from low-resolution video by exploiting complementary multi-frame information. Current approaches are primarily CNN- or Transformer-based:

  • CNN-based methods (e.g., BasicVSR) are constrained by local receptive fields and cannot effectively capture long-range inter-frame information.
  • Transformer-based methods (e.g., PSRT, IART) offer powerful attention mechanisms, but full attention's quadratic complexity is impractical for long sequences; window attention reduces complexity at the cost of limited receptive field.
  • Alignment modules: existing methods commonly use bilinear/nearest-neighbor interpolation for spatial alignment, where fixed weights cause feature distortion; IART proposes attention-based implicit interpolation but computes within fixed reference windows, limiting flexibility.
  • Loss functions: pixel-level losses produce over-smoothing; perceptual losses introduce greater distortion; reconstruction–GT discrepancies are particularly pronounced in the frequency domain.

Mamba's linear complexity, long-sequence modeling capability, and data-dependent parameterization make it well-suited for VSR. This paper is the first to explore Mamba in this setting.

Method

Overall Architecture

VSRM consists of two components: a feature extractor (Conv2d + Feature Propagation Block, FPB) and an upsampler (Reconstruction module). The FPB includes Deformable Cross-Mamba Alignment (DCA) and the Dual Aggregation Mamba Block (DAMB). It first aligns neighboring frame features, then extracts deep spatiotemporal features, and finally generates high-resolution output via the upsampler.

Key Designs

  1. Dual Aggregation Mamba Block (DAMB): Composed of \(N\) S2TMBs and one T2SMB, jointly modeling long-range dependencies in both spatial and temporal dimensions.

    • S2T-Mamba (Spatial-to-Temporal): Flattens the 3D video sequence into a 1D sequence with a spatial-first, temporal-second scan order, processed by bidirectional (forward and backward) SSMs. Formula: \(S2T\text{-}Mamba(x,z) = Linear(x_1 \odot z + x_2 \odot z)\)
    • T2S-Mamba (Temporal-to-Spatial): Uses a temporal-first, spatial-second scan order with a unidirectional forward scan only. Experiments show S2TMB is biased toward spatial information, while T2SMB explicitly prioritizes temporal information, making the two complementary.
    • TGFN (Temporal-Gated Feed-forward Network): Replaces the standard FFN with 3D depthwise separable convolutions and a gating mechanism to better model spatiotemporal neighborhood relationships and optimize information flow.
  2. Deformable Cross-Mamba Alignment (DCA): Optical flow is estimated using SpyNet, and a deformable window scheme is introduced during the compensation stage. The core idea is:

    • For each target pixel, the corresponding sampling location is identified in the reference frame via optical flow.
    • A window \(w\) is constructed around the sampling location, and a reference region \(r\) is initialized.
    • A learnable offset network \(\mathcal{S}(w)\) predicts offsets \(\epsilon_r\) to obtain a dynamic reference region \(\bar{r} = \phi(w; r + \epsilon_r)\).
    • A cross-Mamba module fuses reference and target features to complete alignment: \(\bar{X}(x,y) = cross\text{-}mamba(R, Q)\), based on the SSM recurrence \(H_t = \bar{A}_R H_{t-1} + \bar{B}_R \bar{R}_t\), \(\bar{X}_t = C_Q H_t\).
  3. Frequency Charbonnier-like Loss (FCL): Charbonnier losses are computed separately on the real and imaginary parts of the FFT-transformed images, rather than on the amplitude/phase (avoiding numerical instabilities from square roots and arctan operations).

$\(\mathcal{L}_{FCL} = \sum_{i \in \{Re, Im\}} \lambda_i \sqrt{\|i\mathcal{F}(\mathbf{I}_{SR}) - i\mathcal{F}(\mathbf{I}_{HR})\|^2 + \epsilon^2}\)$

Loss & Training

The total loss is a weighted combination of the spatial-domain Charbonnier loss and the frequency-domain FCL:

\[\mathcal{L}_{total} = \lambda \mathcal{L}_{CL} + \mathcal{L}_{FCL}\]

where \(\lambda = 1.0\), \(\lambda_{Re} = \lambda_{Im} = 0.02\), and \(\epsilon = 10^{-3}\). Training datasets: REDS and Vimeo-90K.

Key Experimental Results

Main Results

Method Input Frames Params (M) REDS4 PSNR REDS4 SSIM Vid4 PSNR Vid4 SSIM
BasicVSR++ 30/14 7.3 32.39 0.9069 27.79 0.8400
VRT 16/7 35.6 32.19 0.9006 27.93 0.8425
RVRT 30/14 10.8 32.75 0.9113 27.99 0.8462
PSRT-rec 16/14 13.4 32.72 0.9106 28.07 0.8485
IART 16/7 13.4 32.90 0.9138 28.26 0.8517
VSRM 16/7 17.1 33.11 0.9162 28.44 0.8552

VSRM outperforms IART by 0.21 dB on REDS4 (16-frame setting) and 0.18 dB on Vid4, and also achieves the best result of 38.33 dB on Vimeo-90K-T.

Ablation Study

Ablation PSNR (dB) Params (M) FLOPs (G)
3D DW-Conv (replaces Mamba) 30.84 19.49 149.8
Window Attention (replaces Mamba) 30.97 7.68 152.4
Full Attention (replaces Mamba) 31.06 7.68 1018.1
Mamba (ours) 31.09 8.61 159.2
w/o DCA alignment 30.87 8.53 120.4
FGDA alignment 30.92 8.70 154.3
IA alignment 31.00 8.57 148.7
DCA alignment (ours) 31.09 8.61 159.2
w/o T2SMB 30.95 7.87 155.6
T2SMB (bidirectional) 31.02 8.65 162.2
T2SMB (unidirectional, ours) 31.09 8.61 159.2
FFN 30.90 8.68 136.2
TGFN (ours) 31.09 8.61 159.2
w/o FCL (\(\lambda\)=0) 30.97
FCL (\(\lambda\)=0.02) 31.09

Key Findings

  • Mamba achieves performance comparable to full attention with only 1/6 the FLOPs (159 G vs. 1018 G).
  • DCA outperforms FGDA and IA alignment by 0.17 dB and 0.09 dB, respectively, validating the advantage of the deformable window mechanism.
  • T2SMB complements S2TMB's limited temporal information extraction (+0.14 dB), and unidirectional scanning outperforms bidirectional.
  • Removing FCL causes a 0.12 dB drop, confirming the importance of frequency-domain regularization for high-frequency detail recovery.
  • VSRM's effective receptive field (ERF) substantially exceeds that of CNN and Transformer methods.

Highlights & Insights

  • First Mamba + VSR work: Successfully validates the feasibility of Mamba in video super-resolution, achieving both linear complexity and a global receptive field.
  • Complementary S2T and T2S scanning: Combining spatial-first and temporal-first scanning strategies enables complete spatiotemporal feature extraction—a VSR-specific Mamba adaptation.
  • DCA's deformable reference regions: Unlike fixed-window implicit alignment, DCA dynamically adjusts reference regions via learned offsets, better handling motion of varying magnitude.
  • Simple and effective FCL design: Computing Charbonnier loss directly on real and imaginary parts avoids the numerical instability of amplitude/phase-based computation.

Limitations & Future Work

  • Parameter count (17.1 M) and inference time (223 ms) are slightly higher than PSRT/IART (13.4 M, 173–180 ms); Mamba acceleration remains an open research direction.
  • Only ×4 super-resolution is explored; other scale factors and degradation models are not evaluated.
  • Mamba's hardware acceleration libraries and tooling for vision are less mature than those for Transformers.
  • The framework is extensible to other low-level video tasks such as deblurring, denoising, and colorization.
  • The selective SSM mechanism of Mamba (S6) makes parameters input-dependent, overcoming limitations of classical SSMs.
  • Unlike MambaIR, VSRM processes multi-frame 3D sequences; the S2T/T2S scanning strategy is transferable to other video tasks.
  • Comparisons with frequency-domain losses (FFL, WHFL) confirm FCL's advantage in balancing low- and high-frequency components.

Rating

  • Novelty: ⭐⭐⭐⭐ First application of Mamba to VSR; bidirectional scanning and DCA designs are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Ablations cover every module; multi-metric, multi-dataset comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with rich figures and tables.
  • Value: ⭐⭐⭐⭐ Provides a solid Mamba-based baseline for low-level video vision.