Exploiting Temporal State Space Sharing for Video Semantic Segmentation¶

Conference: CVPR 2025
arXiv: 2503.20824
Code: https://github.com/Ashesham/TV3S
Area: Semantic Segmentation / Video Semantic Segmentation
Keywords: Video Semantic Segmentation, State Space Models, Mamba, Temporal Feature Sharing, Efficient Inference

TL;DR¶

This work proposes the TV3S (Temporal Video State Space Sharing) framework, which leverages Mamba state space models to achieve efficient temporal information sharing across video frames. By processing spatial patches independently and incorporating a shifted window mechanism, TV3S enables highly parallelized computation. It outperforms existing Transformer and RNN methods on the VSPW and Cityscapes datasets while maintaining a superior accuracy-efficiency trade-off.

Background & Motivation¶

Background: Video semantic segmentation (VSS) requires leveraging temporal information to improve consistency and accuracy beyond frame-level segmentation. Existing methods generally fall into three categories: (1) optical flow-based methods align inter-frame features by estimating pixel motion, but are computationally expensive and inaccurate in occluded or sudden-change scenarios; (2) RNN-based (e.g., ConvLSTM) methods capture temporal information but suffer from scalability and training stability issues on long video sequences; (3) Transformer-based methods (e.g., CFFM, MRCFA) capture global dependencies, but the quadratic complexity of the attention mechanism leads to heavy memory and computational overhead, typically restricting them to short temporal windows.

Limitations of Prior Work: All existing methods compromise to varying degrees on long video sequence scalability, computational/memory efficiency, and temporal consistency preservation. In particular, while Transformer-based methods perform well over short windows, they struggle to scale to global temporal modeling for long videos, whereas RNN-based approaches suffer from low sequence processing efficiency despite their recursive structure.

Key Challenge: The contradiction between long-range temporal modeling capability and computational efficiency. There is a need for an architecture that can efficiently store and propagate long-range temporal information without incurring expensive attention computations.

Goal: Design a computationally efficient VSS architecture that supports long video sequences and leverages both local and global temporal information.

Key Insight: State Space Models (SSMs), particularly Mamba, feature linear complexity and high efficiency in long-sequence modeling. The authors introduce Mamba to VSS. Instead of simply running the SSM on full-frame features (which would become a bottleneck), they split the feature map into independent spatial patches. Each patch independently maintains and propagates its own hidden state along the temporal dimension, enabling a highly parallelized implementation.

Core Idea: Independent spatial patches are used with Mamba SSMs to propagate hidden states temporally, achieving parallelized temporal feature sharing.

Method¶

Overall Architecture¶

TV3S adopts an encoder-decoder architecture. The video frames \(\{I_{t-l}, ..., I_t\}\) are first processed by an image encoder (e.g., MiT or Swin) to extract spatial features \(\{E_{t-l}, ..., E_t\}\). These feature maps then sequentially pass through \(N=4\) TV3S Blocks for temporal information aggregation. Each TV3S Block contains two TSS (Temporal State Space) modules that process unshifted and shifted patches, respectively. The final aggregated features generate segmentation outputs through linear projection and interpolation. During inference, frames are processed sequentially, and the hidden state of each frame is stored and propagated to subsequent frames.

Key Designs¶

Independent Spatial Patch Processing & SSM Temporal Aggregation:
- Function: Independently propagate and aggregate information along the temporal dimension on a patch-by-patch basis.
- Mechanism: The encoded feature map \(E_t\) is divided into non-overlapping spatial patches of size \(w \times w\) (\(w=20\)), where each patch is flattened into a 1D sequence. Within each TSS module, a Mamba SSM updates the hidden state \(H_t^{i,j}\) based on the current patch input \(\hat{P}_t^{i,j}\) and the previous frame's hidden state \(H_{t-1}^{i,j}\). The state-space equation is \(H_t^{i,j} = f_A(\Delta, A_s)H_{t-1}^{i,j} + f_B(\Delta, A_s, B_s)\hat{P}_t^{i,j}\) with output \(F_t^{i,j} = C_s H_t^{i,j}\). Key Advantage: The SSM processing of all patches is completely independent and parallelizable, requiring only \(\frac{W}{w} \times \frac{H}{w}\) hidden states in total.
- Design Motivation: Unlike VisionMamba, which flattens the entire feature map into a single long sequence (tangling spatial and temporal dimensions), processing independent patches offers two major advantages: (1) the encoder has already thoroughly learned the spatial representations, so the patch processing only needs to focus on the temporal dimension; (2) frame-to-frame variations at the patch level are small, aligning perfectly with the incremental update nature of SSMs.
Shifted Window Mechanism:
- Function: Alleviate boundary effects caused by independent patch processing and enhance spatial context interaction between patches.
- Mechanism: Inspired by Swin Transformer, the second TSS module of each TV3S Block processes shifted feature maps. The shift parameter is \(s = w/2 = 10\), which merges the boundary regions of adjacent patches into a single patch after shifting. Incomplete boundary patches are subdivided into smaller sub-patches for processing. Consequently, the combination of two TSS modules (unshifted + shifted) simultaneously captures temporal dynamics both within and across patch boundaries.
- Design Motivation: Strictly independent patch processing ignores motion information at the boundaries (e.g., objects crossing patch boundaries). Shifted windowing is a zero-overhead way to expand the receptive field.
Dual-Loss Training Strategy:
- Function: Jointly guarantee spatial feature quality and temporal aggregation effectiveness.
- Mechanism: The total loss is defined as \(\mathcal{L} = \lambda \sum_{k} \mathcal{L}_{CE}(\hat{O}_{t-k}, M_{t-k}) + \mathcal{L}_{CE}(O_t, M_t)\), where \(\hat{O}_{t-k}\) represents intermediate predictions directly from the encoder (without temporal information), and \(O_t\) is the final prediction after the TV3S Blocks. \(\lambda = 0.5\) balances the two terms. During training, the input consists of 4 frames sampled with a stride of \(\{t-9, t-6, t-3, t\}\).
- Design Motivation: The intermediate loss ensures that the encoder learns high-quality spatial representations (the foundation for TV3S inputs), while the final loss optimizes temporal aggregation. Without the intermediate loss, the encoder might degenerate to outputting oversimplified features tailored solely for the temporal module.

Loss & Training¶

Weighted cross-entropy loss is used with the AdamW optimizer and a "poly" learning rate policy with an initial learning rate of \(6 \times 10^{-5}\). The training uses 4-frame sequences (stride of 3 frames), whereas inference can handle video sequences of arbitrary length (sequentialized frames, continuous hidden state propagation).

Key Experimental Results¶

Main Results (VSPW Dataset)¶

Method	Backbone	mIoU↑	mVC8↑	mVC16↑	GFLOPs↓	FPS↑
SegFormer	MiT-B2	43.9	86.0	81.2	100.8	16.2
CFFM	MiT-B2	44.9	89.8	85.8	143.2	10.1
MRCFA	MiT-B2	45.3	90.3	86.2	127.9	10.7
TV3S	MiT-B2	46.3	91.5	88.35	53.9	21.9

TV3S with a MiT-B2 backbone outperforms MRCFA by 1.0 mIoU and 2.15 mVC16, yet consumes only 42% of MRCFA's GFLOPs while doubling the FPS.

Ablation Study¶

Configuration	mIoU↑	mVC8↑	Description
Without TV3S (SegFormer baseline)	36.5	84.7	Pure frame-level segmentation
TV3S (1 block)	38.4	89.2	Single block temporal enhancement
TV3S (4 blocks)	40.0	90.7	Standard configuration
TV3S w/o shift	39.1	89.6	Shifted window removed
TV3S w/ shift	40.0	90.7	Full model

Key Findings¶

TV3S achieves the best mIoU and mVC across all backbones (MiT-B1/B2/B5, Swin-T/S), while yielding significantly lower GFLOPs compared to Transformer-based methods.
The improvement in temporal consistency (mVC16) is particularly notable: 88.35 vs. 86.2 (MRCFA) on MiT-B2, showing that the hidden state transmission of SSMs is highly suited for temporal consistency modeling.
The shifted window contributes +0.9 to mIoU and +1.1 to mVC, demonstrating the critical importance of boundary-level temporal information.
On the Swin-S backbone, mIoU reaches 50.6 (far exceeding MPVSS's 40.4), with only 94.1 GFLOPs (MPVSS has 47.3 GFLOPs but 10 percentage points lower accuracy).
During inference, the full video sequence can be processed (utilizing the hidden states of all historical frames), a feat that Transformer-based methods cannot achieve due to attention window limitations.

Highlights & Insights¶

Paradigm Advantage of SSMs in Video Understanding: Compared to the restricted attention window of Transformers, SSMs can theoretically propagate temporal information infinitely via hidden states with constant complexity. This allows the model to leverage all historical information from the first frame during inference, presenting an inherent architectural advantage.
Efficient Spatio-Temporal Decoupled Design: Unlike VideoMamba, which flattens the entire video tensor into a single sequence, TV3S relies on the encoder for spatial representations and confines SSM to independent patch aggregation temporally, greatly reducing computation.
Flexible Training-Inference Scheme: The model is trained on only 4 frames but can process arbitrarily long sequences during inference. The continuous propagation of hidden states naturally supports streaming setups, making it highly practical for real-world video analysis tasks like autonomous driving and video surveillance.

Limitations & Future Work¶

The selection of hidden state dimensions and patch size \(w\) impacts performance; currently, \(w\) is fixed to 20 without adaptive adjustments.
Validation is limited to outdoor scenarios (VSPW, Cityscapes); performance on highly dynamic environments (e.g., fast motion, heavy occlusion) has not been dedicatedly analyzed.
The selective gating mechanism of SSM lacks detailed visualization—how the model learns specific temporal propagation patterns remains non-transparent.
Although FPS is higher than in Transformer approaches, it has not reached real-time speeds (e.g., only 14 FPS with the MiT-B5 backbone).
The frame stride is fixed to 3 during training, which might require adjustment for videos with widely varying frame rates.

vs. CFFM/MRCFA: These Transformer methods use multi-resolution cross-frame attention, leading to heavy computational demands (127-143 GFLOPs). TV3S achieves superior performance with only 54 GFLOPs using SSM.
vs. MPVSS: MPVSS uses a memory-augmented Transformer, which has lower GFLOPs but also lower accuracy; TV3S is superior in both accuracy and efficiency.
vs. VideoMamba: VideoMamba is designed for video classification and flattens the entire spatio-temporal sequence, making it suboptimal for dense segmentation. The patch-independent processing in TV3S is better suited for pixel-level tasks.
vs. VM-RNN: While both combine Mamba and temporal modeling, VM-RNN uses LSTM for temporal integration, whereas TV3S utilizes SSM directly, yielding a cleaner and more native architecture.

Rating¶

Novelty: ⭐⭐⭐⭐ First to apply Mamba SSM in a patch-independent manner to video semantic segmentation, with a cleverly integrated shifted window mechanism.
Experimental Thoroughness: ⭐⭐⭐⭐ Solid multi-backbone validation and comprehensive ablation studies, though in-depth analysis on very long videos is slightly lacking.
Writing Quality: ⭐⭐⭐⭐ Clear architectural diagram, but some formula typesetting could be more standardized.
Value: ⭐⭐⭐⭐ Provides an efficient new paradigm for VSS, demonstrating that the potential of SSMs in dense video prediction is worth further exploring.