Skip to content

VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Conference: ICCV 2025 arXiv: 2506.22762 Code: N/A Area: Video Generation Keywords: Video Super-Resolution, Mamba, State Space Model, Deformable Alignment, Frequency Loss

TL;DR

This work is the first to introduce Mamba into video super-resolution (VSR), proposing the VSRM framework. It achieves efficient spatiotemporal modeling via the Dual Aggregation Mamba Block, combined with Deformable Cross-Mamba Alignment and a frequency-domain loss, achieving state-of-the-art performance on multiple benchmarks.

Background & Motivation

Video super-resolution requires processing long sequences and capturing inter-frame information over large receptive fields. Existing methods exhibit clear limitations:

  • CNN-based methods (e.g., BasicVSR): receptive fields are confined to local regions, limiting the capture of long-range inter-frame information.
  • Transformer-based methods (e.g., IART, PSRT): quadratic complexity of full attention is impractical for long sequences; window attention reduces complexity but sacrifices receptive field coverage.
  • Alignment modules: most existing methods rely on fixed-weight interpolation (e.g., bilinear) for alignment, causing feature distortion; attention-based implicit alignment is also constrained by fixed reference windows.
  • Loss functions: pixel-level losses produce over-smoothed outputs; perceptual losses introduce additional distortion; models also suffer from spectral bias.

Mamba's linear complexity, global receptive field, and data-dependent parameterization make it naturally suited for VSR, yet it had not been explored in this context prior to this work.

Method

Overall Architecture

VSRM consists of two main components: a feature extractor (Conv2d + Feature Propagation Block) and an upsampler (Reconstruction module). The Feature Propagation Block includes Deformable Cross-Mamba Alignment (DCA) and the Dual Aggregation Mamba Block (DAMB).

Key Designs

  1. Dual Aggregation Mamba Block (DAMB): The core module, composed of \(N\) S2TMBs and one T2SMB.

    • S2TMB (Spatial-to-Temporal Mamba): Flattens the 3D sequence into 1D following a spatial-first, temporal-second order, and applies bidirectional (forward and backward) SSM scanning. Bidirectional scanning preserves spatial awareness while enabling temporal modeling. Formula: \(S2T\text{-}Mamba(x,z)=Linear(x_1 \odot z + x_2 \odot z)\)
    • T2SMB (Temporal-to-Spatial Mamba): Applies only a forward scan (experiments show unidirectional scanning is superior), prioritizing temporal information extraction to complement S2TMB's spatial emphasis.
    • TGFN (Temporal-Gated Feed-forward Network): Incorporates 3D depthwise separable convolutions to model spatiotemporal neighborhood relationships, with a gating mechanism (channel splitting + GELU) to optimize information flow: \(TGFN(X)=W_p^2(W_d^1 LN(\hat{X}_1) \odot \sigma(W_d^2 LN(\hat{X}_2)))\)
  2. Deformable Cross-Mamba Alignment (DCA): Addresses inter-frame motion alignment.

    • Optical flow is estimated using a pretrained SpyNet.
    • A deformable window mechanism is introduced during the compensation stage: a window \(w\) is extracted from the reference frame, a reference region \(r\) is initialized, and a lightweight offset network learns offsets \(\epsilon_r\) to produce a dynamic reference region \(\bar{r}=\phi(w; r+\epsilon_r)\).
    • A cross-Mamba module fuses target and dynamic reference features: \(\bar{X}(x,y) = cross\text{-}mamba(R,Q)\), where \(H_t = \bar{A}_R H_{t-1} + \bar{B}_R \bar{R}_t\), \(\bar{X}_t = C_Q H_t\).
    • Compared to fixed-window alignment, DCA adapts more flexibly to motion of varying magnitude.
  3. Frequency Charbonnier-like Loss (FCL): Computes the loss in the frequency domain to recover high-frequency details.

    • FFT is applied to the images; Charbonnier losses are computed separately on the real and imaginary parts.
    • \(\mathcal{L}_{FCL}=\sum_{i\in\{Re,Im\}} \lambda_i \sqrt{\|i\mathcal{F}(\mathbf{I}_{SR})-i\mathcal{F}(\mathbf{I}_{HR})\|^2+\epsilon^2}\)
    • Real and imaginary parts are used directly instead of amplitude/phase, avoiding discontinuities introduced by square roots and arctan operations.

Loss & Training

Total loss: \(\mathcal{L}_{total} = \lambda \mathcal{L}_{CL} + \mathcal{L}_{FCL}\)

Hyperparameters: \(\lambda=1.0\), \(\lambda_{Re}=\lambda_{Im}=0.02\), \(\epsilon=10^{-3}\). Training datasets: REDS and Vimeo-90K; task: ×4 super-resolution.

Key Experimental Results

Main Results

Method Frames Params (M) REDS4 PSNR REDS4 SSIM Vimeo-90K-T PSNR Vid4 PSNR
BasicVSR++ 30/14 7.3 32.39 0.9069 37.79 27.79
VRT 16/7 35.6 32.19 0.9006 38.20 27.93
RVRT 30/14 10.8 32.75 0.9113 38.15 27.99
PSRT-recurrent 16/14 13.4 32.72 0.9106 38.27 28.07
IART 16/7 13.4 32.90 0.9138 38.14 28.26
VSRM 16/7 17.1 33.11 0.9162 38.33 28.44

Under the 6-frame setting, VSRM also outperforms IART by 0.28 dB (32.43 vs. 32.15).

Ablation Study

Ablation PSNR (dB) Notes
3D DW-Conv (replaces Mamba) 30.84 Mamba shows clear advantage (+0.25 dB)
Window Attention (replaces Mamba) 30.97 Mamba outperforms (+0.12 dB)
Full Attention 31.06 Comparable performance but 6.7× higher FLOPs
Mamba (ours) 31.09 Best performance–efficiency trade-off
w/o alignment module 30.87 Alignment contributes +0.22 dB
FGDA alignment 30.92 DCA outperforms by +0.17 dB
IA alignment 31.00 DCA outperforms by +0.09 dB
w/o T2SMB 30.95 T2SMB contributes +0.14 dB
w/o FCL 30.97 FCL contributes +0.12 dB
FFN (replaces TGFN) 30.90 TGFN contributes +0.19 dB

Key Findings

  • VSRM outperforms IART by 0.21 dB on REDS4 and 0.18 dB on Vid4, demonstrating effectiveness under both large- and small-motion scenarios.
  • Mamba achieves performance comparable to full attention (1018.1 G FLOPs) with only 159.2 G FLOPs.
  • T2SMB with unidirectional forward scanning outperforms bidirectional scanning (31.09 vs. 31.02), indicating that redundant scanning is detrimental.
  • Effective receptive field (ERF) visualization confirms that VSRM achieves a global receptive field, substantially larger than CNN and Transformer counterparts.

Highlights & Insights

  • This work is the first to validate the effectiveness of Mamba in VSR, opening a new backbone option for low-level vision tasks.
  • The DCA module's design of "deformable windows + cross-Mamba" is elegant: deformable windows handle motion of varying magnitude, while cross-Mamba performs implicit alignment.
  • FCL directly computes Charbonnier loss on real and imaginary parts, which is simpler and more effective than methods such as FFL.
  • The integration of 3D depthwise convolutions in TGFN enables the feed-forward network to model spatiotemporal information as well.

Limitations & Future Work

  • Parameter count (17.1 M) and inference time (223 ms) are slightly higher than PSRT/IART (13.4 M, ~175 ms).
  • Mamba acceleration and optimization remain an active area; further speedup is achievable.
  • Only ×4 super-resolution is evaluated; other scale factors are not explored.
  • The framework is extensible to other low-level video tasks such as deblurring, denoising, and colorization.
  • The selective SSM mechanism of Mamba (S6) makes parameters input-dependent, overcoming limitations of conventional SSMs.
  • Unlike MambaIR, VSRM must process multi-frame 3D sequences; the S2T/T2S scanning strategy is worth adapting for other video tasks.
  • The cross-Mamba alignment paradigm is applicable to other video tasks requiring inter-frame correspondence.

Rating

  • Novelty: ⭐⭐⭐⭐ First application of Mamba to VSR; the S2T/T2S bidirectional scanning design is well-motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive ablations covering backbone, alignment, FFN, loss, and scanning direction; validated across multiple datasets.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with complete mathematical derivations.
  • Value: ⭐⭐⭐⭐ Provides a new efficient backbone choice for VSR; state-of-the-art results are convincing.

VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Conference: ICCV 2025 arXiv: 2506.22762 Code: N/A Area: Image Restoration / Video Super-Resolution Keywords: Video Super-Resolution, Mamba, State Space Model, Deformable Alignment, Frequency Loss

TL;DR

This work is the first to introduce Mamba into video super-resolution, proposing the Dual Aggregation Mamba Block (DAMB) for long-range spatiotemporal dependency modeling, the Deformable Cross-Mamba Alignment module (DCA) for more flexible inter-frame alignment, and the Frequency Charbonnier-like Loss (FCL) for improved high-frequency detail recovery, achieving state-of-the-art results on REDS4, Vid4, and Vimeo-90K.

Background & Motivation

Video super-resolution (VSR) aims to generate high-resolution frames from low-resolution video by exploiting complementary multi-frame information. Current approaches are primarily CNN- or Transformer-based:

  • CNN-based methods (e.g., BasicVSR) are constrained by local receptive fields and cannot effectively capture long-range inter-frame information.
  • Transformer-based methods (e.g., PSRT, IART) offer powerful attention mechanisms, but full attention's quadratic complexity is impractical for long sequences; window attention reduces complexity at the cost of limited receptive field.
  • Alignment modules: existing methods commonly use bilinear/nearest-neighbor interpolation for spatial alignment, where fixed weights cause feature distortion; IART proposes attention-based implicit interpolation but computes within fixed reference windows, limiting flexibility.
  • Loss functions: pixel-level losses produce over-smoothing; perceptual losses introduce greater distortion; reconstruction–GT discrepancies are particularly pronounced in the frequency domain.

Mamba's linear complexity, long-sequence modeling capability, and data-dependent parameterization make it well-suited for VSR. This paper is the first to explore Mamba in this setting.

Method

Overall Architecture

VSRM consists of two components: a feature extractor (Conv2d + Feature Propagation Block, FPB) and an upsampler (Reconstruction module). The FPB includes Deformable Cross-Mamba Alignment (DCA) and the Dual Aggregation Mamba Block (DAMB). It first aligns neighboring frame features, then extracts deep spatiotemporal features, and finally generates high-resolution output via the upsampler.

Key Designs

  1. Dual Aggregation Mamba Block (DAMB): Composed of \(N\) S2TMBs and one T2SMB, jointly modeling long-range dependencies in both spatial and temporal dimensions.

    • S2T-Mamba (Spatial-to-Temporal): Flattens the 3D video sequence into a 1D sequence with a spatial-first, temporal-second scan order, processed by bidirectional (forward and backward) SSMs. Formula: \(S2T\text{-}Mamba(x,z) = Linear(x_1 \odot z + x_2 \odot z)\)
    • T2S-Mamba (Temporal-to-Spatial): Uses a temporal-first, spatial-second scan order with a unidirectional forward scan only. Experiments show S2TMB is biased toward spatial information, while T2SMB explicitly prioritizes temporal information, making the two complementary.
    • TGFN (Temporal-Gated Feed-forward Network): Replaces the standard FFN with 3D depthwise separable convolutions and a gating mechanism to better model spatiotemporal neighborhood relationships and optimize information flow.
  2. Deformable Cross-Mamba Alignment (DCA): Optical flow is estimated using SpyNet, and a deformable window scheme is introduced during the compensation stage. The core idea is:

    • For each target pixel, the corresponding sampling location is identified in the reference frame via optical flow.
    • A window \(w\) is constructed around the sampling location, and a reference region \(r\) is initialized.
    • A learnable offset network \(\mathcal{S}(w)\) predicts offsets \(\epsilon_r\) to obtain a dynamic reference region \(\bar{r} = \phi(w; r + \epsilon_r)\).
    • A cross-Mamba module fuses reference and target features to complete alignment: \(\bar{X}(x,y) = cross\text{-}mamba(R, Q)\), based on the SSM recurrence \(H_t = \bar{A}_R H_{t-1} + \bar{B}_R \bar{R}_t\), \(\bar{X}_t = C_Q H_t\).
  3. Frequency Charbonnier-like Loss (FCL): Charbonnier losses are computed separately on the real and imaginary parts of the FFT-transformed images, rather than on the amplitude/phase (avoiding numerical instabilities from square roots and arctan operations).

$\(\mathcal{L}_{FCL} = \sum_{i \in \{Re, Im\}} \lambda_i \sqrt{\|i\mathcal{F}(\mathbf{I}_{SR}) - i\mathcal{F}(\mathbf{I}_{HR})\|^2 + \epsilon^2}\)$

Loss & Training

The total loss is a weighted combination of the spatial-domain Charbonnier loss and the frequency-domain FCL:

\[\mathcal{L}_{total} = \lambda \mathcal{L}_{CL} + \mathcal{L}_{FCL}\]

where \(\lambda = 1.0\), \(\lambda_{Re} = \lambda_{Im} = 0.02\), and \(\epsilon = 10^{-3}\). Training datasets: REDS and Vimeo-90K.

Key Experimental Results

Main Results

Method Input Frames Params (M) REDS4 PSNR REDS4 SSIM Vid4 PSNR Vid4 SSIM
BasicVSR++ 30/14 7.3 32.39 0.9069 27.79 0.8400
VRT 16/7 35.6 32.19 0.9006 27.93 0.8425
RVRT 30/14 10.8 32.75 0.9113 27.99 0.8462
PSRT-rec 16/14 13.4 32.72 0.9106 28.07 0.8485
IART 16/7 13.4 32.90 0.9138 28.26 0.8517
VSRM 16/7 17.1 33.11 0.9162 28.44 0.8552

VSRM outperforms IART by 0.21 dB on REDS4 (16-frame setting) and 0.18 dB on Vid4, and also achieves the best result of 38.33 dB on Vimeo-90K-T.

Ablation Study

Ablation PSNR (dB) Params (M) FLOPs (G)
3D DW-Conv (replaces Mamba) 30.84 19.49 149.8
Window Attention (replaces Mamba) 30.97 7.68 152.4
Full Attention (replaces Mamba) 31.06 7.68 1018.1
Mamba (ours) 31.09 8.61 159.2
w/o DCA alignment 30.87 8.53 120.4
FGDA alignment 30.92 8.70 154.3
IA alignment 31.00 8.57 148.7
DCA alignment (ours) 31.09 8.61 159.2
w/o T2SMB 30.95 7.87 155.6
T2SMB (bidirectional) 31.02 8.65 162.2
T2SMB (unidirectional, ours) 31.09 8.61 159.2
FFN 30.90 8.68 136.2
TGFN (ours) 31.09 8.61 159.2
w/o FCL (\(\lambda\)=0) 30.97
FCL (\(\lambda\)=0.02) 31.09

Key Findings

  • Mamba achieves performance comparable to full attention with only 1/6 the FLOPs (159 G vs. 1018 G).
  • DCA outperforms FGDA and IA alignment by 0.17 dB and 0.09 dB, respectively, validating the advantage of the deformable window mechanism.
  • T2SMB complements S2TMB's limited temporal information extraction (+0.14 dB), and unidirectional scanning outperforms bidirectional.
  • Removing FCL causes a 0.12 dB drop, confirming the importance of frequency-domain regularization for high-frequency detail recovery.
  • VSRM's effective receptive field (ERF) substantially exceeds that of CNN and Transformer methods.

Highlights & Insights

  • First Mamba + VSR work: Successfully validates the feasibility of Mamba in video super-resolution, achieving both linear complexity and a global receptive field.
  • Complementary S2T and T2S scanning: Combining spatial-first and temporal-first scanning strategies enables complete spatiotemporal feature extraction—a VSR-specific Mamba adaptation.
  • DCA's deformable reference regions: Unlike fixed-window implicit alignment, DCA dynamically adjusts reference regions via learned offsets, better handling motion of varying magnitude.
  • Simple and effective FCL design: Computing Charbonnier loss directly on real and imaginary parts avoids the numerical instability of amplitude/phase-based computation.

Limitations & Future Work

  • Parameter count (17.1 M) and inference time (223 ms) are slightly higher than PSRT/IART (13.4 M, 173–180 ms); Mamba acceleration remains an open research direction.
  • Only ×4 super-resolution is explored; other scale factors and degradation models are not evaluated.
  • Mamba's hardware acceleration libraries and tooling for vision are less mature than those for Transformers.
  • The framework is extensible to other low-level video tasks such as deblurring, denoising, and colorization.
  • The selective SSM mechanism of Mamba (S6) makes parameters input-dependent, overcoming limitations of classical SSMs.
  • Unlike MambaIR, VSRM processes multi-frame 3D sequences; the S2T/T2S scanning strategy is transferable to other video tasks.
  • Comparisons with frequency-domain losses (FFL, WHFL) confirm FCL's advantage in balancing low- and high-frequency components.

Rating

  • Novelty: ⭐⭐⭐⭐ First application of Mamba to VSR; bidirectional scanning and DCA designs are innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Ablations cover every module; multi-metric, multi-dataset comparisons.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure with rich figures and tables.
  • Value: ⭐⭐⭐⭐ Provides a solid Mamba-based baseline for low-level video vision.