Skip to content

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

Conference: CVPR 2026
Paper: CVF Open Access
Code: https://mstemba.github.io (Project Page)

Area: Video Understanding
Keywords: Temporal Action Detection, State Space Models, Mamba, Multi-scale Modeling, Long Video

TL;DR

MS-Temba transforms the Mamba State Space Model into a "Multi-Scale Dilated SSM" by stacking parallel branches with varying temporal dilation rates into a hierarchical structure, then uses a lightweight Mamba fuser to unify multi-scale features. With only 17M parameters, it achieves SOTA in Temporal Action Detection (TAD) on 40-minute-long, densely annotated daily activity videos, reducing parameters by 5x compared to Transformer-based solutions.

Background & Motivation

Background: Temporal Action Detection (TAD) in long untrimmed videos requires a model to identify "which actions are occurring + when they start and end" at every timestep. Activities of Daily Living (ADL, e.g., home care, smart homes) are particularly challenging; a video can reach 40 minutes, where atomic actions lasting seconds (e.g., "drinking water") are interleaved with activities lasting tens of minutes (e.g., "using a laptop"), often with multiple concurrent actions (e.g., "walking while using a phone").

Limitations of Prior Work: Mainstream TAD frameworks based on Temporal CNNs or Transformers face bottlenecks. Temporal convolutions (PDAN, TGM) have limited receptive fields and struggle to capture long-range dependencies across whole videos even with dilated kernels. Transformers (MS-TCT, MLAD) model long-range dependencies via global self-attention, but their quadratic complexity is prohibitive for ultra-long sequences, and large models often reach 87M parameters. State Space Models (SSMs) like Mamba offer linear complexity but typically scan at a single temporal scale, mixing instantaneous fine-grained actions with background noise and losing fine temporal structure.

Key Challenge: A single-scale scan with a fixed receptive field cannot simultaneously handle "precision for short actions" and "breadth for long actions," which is critical for dense, overlapping ADL scenarios. Models are either too granular to see long-range structure or too broad to distinguish short action boundaries.

Goal: Restore sensitivity to multiple temporal scales while retaining the linear efficiency of Mamba, addressing three challenges: long-range dependencies (Challenge 1), cross-scale actions and intra-class temporal variance (Challenge 2), and dense overlapping actions (Challenge 3).

Key Insight: Instead of scanning the video with a fixed receptive field, allow multiple SSM branches to scan at different temporal dilation rates (\(\eta\)). Short-stride branches focus on instantaneous fine-grained actions, while long-stride branches capture long-range dependencies. This brings the concept of "dilation" from dilated convolutions into the SSM scanning process for the first time.

Core Idea: Replace single-scale scanning with "Dilated SSMs," stacked into hierarchical Temba blocks with increasing dilation, followed by a Mamba fuser for additive fusion and SSM refinement, maintaining linear efficiency while restoring temporal resolution.

Method

Overall Architecture

MS-Temba takes a long untrimmed video and outputs multi-label action predictions for each segment. The pipeline consists of four steps: first, a frozen visual backbone extracts a token sequence; second, a stack of Temba Blocks uses parallel dilated SSMs to learn representations across different temporal scales, with deeper blocks having larger dilation and wider feature dimensions; third, a Multi-Scale Mamba Fuser (MS-Fuser) projects multi-scale features to a common dimension for additive fusion and SSM refinement; finally, a Classification Head generates multi-label predictions per segment.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Untrimmed Long Video<br/>Split into T segments"] --> B["Visual Backbone<br/>(I3D / CLIP, Frozen)"]
    B --> C["Dilated SSM Hierarchical Blocks<br/>Temba Block k (η=k)"]
    C --> D["Scale-Aware Aux Supervison<br/>Per-block Head + Consistency Loss"]
    D --> C
    C --> E["Multi-Scale Mamba Fuser<br/>Project→Sum→SSM Refine"]
    E --> F["Classification Head<br/>Per-segment Multi-label Prediction"]

Key Designs

1. Dilated SSM: Parallel Multi-scale Receptive Fields in One Block

The Dilated SSM addresses the mixing of short actions and long backgrounds. Given an input sequence \(x_k \in \mathbb{R}^{B\times T\times D_k}\), a non-parametric, invertible mapping \(\Phi_\eta\) splits the time axis into \(\eta\) non-overlapping sub-sequences based on stride \(\eta\). The \(i\)-th sub-sequence takes tokens at positions \(t = i + j\eta\) (\(0\le j < \lceil T/\eta\rceil\)). This splits the video into \(\eta\) temporal streams with different "dilation phases."

Each sub-sequence is processed by an independently parameterized SSM branch \((A^{(i)}, B^{(i)}, C^{(i)})\):

\[h_t^{(i)} = A^{(i)} h_{t-1}^{(i)} + B^{(i)} X_t^{(i)}, \quad y_t^{(i)} = C^{(i)} h_t^{(i)}\]

These branches learn complementary receptive fields. The outputs are reassembled into the original order via \(\Phi_\eta^{-1}\). Since \(\Phi_\eta^{-1}(\Phi_\eta(x))=x\), this step is bijective and preserves temporal fidelity, embedding multi-scale receptive fields into the linear Mamba scan.

2. Projection Consistency Alignment: Preventing Semantic Drift

As branches scan asynchronously with independent parameters, "scale drift" (inconsistent activation patterns for the same action across branches) may occur. The authors propose that if an action activates at time \(t_s\) in one branch, adjacent segments \(t_{s-1}, t_{s+1}\) in adjacent branches should have similar patterns. A pairwise consistency loss is applied to the output projection matrices \(C_i\):

\[\mathcal{L}_{cons} = 1 - \mathrm{sim}(\hat{C}_i, \hat{C}_j), \quad i \neq j\]

where \(\hat{C}_i, \hat{C}_j\) are flattened and \(\ell_2\)-normalized projection matrices, and \(\mathrm{sim}\) is cosine similarity. This encourages semantic alignment across scales while preserving temporal diversity.

3. Scale-Aware Auxiliary Supervision + Hierarchy: Forcing Scale Specialization

Multiple Temba blocks are stacked with increasing dilation: the \(k\)-th block uses \(\eta = k\), where \(z_k = \mathrm{Temba}_{\eta=k}(z_{k-1})\). Feature dimensions expand by ratio \(\gamma\) (\(x_k = W_k z_{k-1} + \beta_k\)) in deeper blocks. To ensure each block focuses on its assigned scale, each block includes a lightweight head producing a block-level prediction \(\hat{Y}_k\), supervised by a binary cross-entropy auxiliary loss:

\[\mathcal{L}_{aux} = \mathrm{BCE}(\hat{Y}_k, Y)\]

This explicit supervision maintains the inductive bias of each block towards its specific dilation rate.

4. Multi-Scale Mamba Fuser (MS-Fuser): Additive Fusion and SSM Refinement

The MS-Fuser projects block outputs \(z_k\) to a unified dimension \(E\) (\(\tilde{z}_k = W_k^f z_k + \beta_k^f\)) and performs direct summation: \(z_f = \sum_{k=1}^{K} \tilde{z}_k\). The fused feature passes through an SSM for cross-scale refinement:

\[h_t^f = A^f h_{t-1}^f + B^f z_t^f, \quad y_t^f = C^f h_t^f\]

This combines fine-grained cues from early blocks with coarse-grained background from later blocks. Ablations show "Summation + SSM" outperforms "Concatenation + Projection" by aligning features in the same space.

Loss & Training

The total objective weights three losses:

\[\mathcal{L} = \mathcal{L}_{BCE} + \alpha \mathcal{L}_{cons} + \frac{\beta}{K}\sum_{k=1}^{K}\mathcal{L}_{aux}\]

where \(\mathcal{L}_{BCE}\) is the main loss, \(\alpha=100.0\), and \(\beta=1\). Training uses \(K=3\) Temba blocks, expansion ratio \(\gamma=1.5\), and a state dimension of 16. Backbones (I3D or CLIP-L/14) remain frozen.

Key Experimental Results

Main Results

Performance on TSU (51 classes, dense concurrency, 21 min avg) and Charades (157 classes):

Dataset Backbone Method Params (M) mAP
TSU I3D MS-TCT 87 33.7
TSU I3D DualDETR 21 34.8
TSU I3D Ours 17 36.1
TSU CLIP MS-TCT 87 40.6
TSU CLIP Ours 17 44.0
Charades I3D MS-TCT 87 25.4
Charades I3D Ours 17 25.4
Charades CLIP MS-TCT 87 31.9
Charades CLIP Ours 17 33.6

Gain: Using a CLIP backbone, MS-Temba reaches 44.0 mAP on TSU (vs 40.6 for MS-TCT) with 5x fewer parameters. Localization accuracy is higher across all tIoU thresholds, indicating superior boundary alignment.

Ablation Study

Configuration TSU mAP Charades mAP Description
No temporal modeling 24.7 22.9 CLIP baseline
+ Mamba encoder 40.2 32.4 Massive gain
+ Dilated SSM 42.5 32.5 TSU +2.3%
+ MS-Fuser (Full) 44.0 33.6 Final gain
\(\mathcal{L}_{cons}\) \(\mathcal{L}_{aux}\) Avg mAP Description
41.9 BCE only
41.9 Consistency alone lacks impact
42.8 Aux supervision is primary driver
43.8 Best combined (+1.9%)

Key Findings

  • Dilated SSM role specialization: Temba Block 1 (small dilation) is stronger for actions <10s, while Block 3 (large dilation) is stronger for actions >20s.
  • Auxiliary loss importance: \(\mathcal{L}_{aux}\) improves performance by ensuring intermediate representations align with temporal semantics.
  • Optimal depth: 3 blocks provide the best trade-off. A 4th block leads to performance degradation.
  • No temporal scaling: Pooling or strided convolutions hurt performance. Maintaining native temporal resolution in every block is optimal.
  • Summation > Concatenation: Summation aligns multi-scale features in a unified space for better SSM refinement.

Highlights & Insights

  • Clean migration of "Dilation" to SSM: The "rearrange-scan-restore" pattern using \(\Phi_\eta\) allows multi-scale receptive fields without breaking Mamba's linear scanning logic. This can be adapted to any SSM-based multi-scale task.
  • Parameter Efficiency: 17M vs 87M parameters shows that TAD's bottleneck is often the "temporal inductive bias" rather than raw parameter count.
  • Effective Auxiliary Supervision: Direct supervision on intermediate blocks forces each scale to learn specialized features, a strategy broadly applicable to hierarchical networks.
  • Generalization: MS-Temba achieves SOTA on video summarization (SumMe, TVSum), proving it is a general framework for long video understanding.

Limitations & Future Work

  • Heuristic Dilation: Setting \(\eta=k\) is simple but may not be optimal for all datasets compared to learnable dilation.
  • Depth Constraints: Performance drops beyond 3 blocks, which may limit long-range coverage for extremely long sequences.
  • Frozen Backbone: The two-stage training prevents joint optimization of visual features and temporal modeling.
  • Hyperparameter Sensitivity: The high weight of \(\alpha=100.0\) suggests the consistency loss might be sensitive and requires careful tuning across datasets.
  • vs MS-TCT: MS-TCT uses CNNs and attention. MS-Temba matches or exceeds it with much higher parameter efficiency and better scalability for ultra-long videos.
  • vs Video Mamba Suite: Existing video SSMs usually focus on short 3-minute clips with global classification. MS-Temba is the first to apply Mamba to 40-minute untrimmed videos for dense temporal localization.
  • vs Temporal Convolutions: Unlike dilated CNNs that have local kernels, Dilated SSMs propagate information across the entire sequence within the state space, improving long-action modeling.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐