Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel¶
Conference: CVPR 2026
Paper: CVF Open Access
Code: TBD
Area: Video Object Segmentation
Keywords: Video Object Segmentation, Dynamic Networks, SAM2 Acceleration, Block Skipping, LoRA Fine-tuning
TL;DR¶
To address the slow inference speeds of large video segmentation models like SAM2, this paper introduces a "Predictive-Aware Router" (taking the previous segmentation mask and current visual features) to activate only a specific subset of blocks per frame. Combined with "Importance-Aware LoRA" that fine-tunes only critical blocks, it achieves a 1.3× real-world speedup on DAVIS 2017 with a performance drop of <0.4%, training only 3% of the parameters.
Background & Motivation¶
Background: State-of-the-art (SOTA) Video Object Segmentation and Tracking (VOST) increasingly relies on vision foundation models like SAM2. While these models offer strong zero-shot generalization and stable tracking, their massive computational overhead makes real-time deployment difficult in resource-constrained scenarios.
Limitations of Prior Work: Existing efficiency methods follow two main paths, both with drawbacks. First, static pruning (e.g., SlimSAM, PBD) removes fixed channels or blocks regardless of the specific frame; however, different tasks and frames prefer different "key blocks." As shown in Figure 1(a), two static models with identical pruning rates may have similar average accuracy but vastly different per-frame performance, indicating that a "one-size-fits-all" approach is suboptimal. Second, dynamic networks (e.g., AdaViT, DyT) assign a lightweight router to each block to decide activation based on intermediate features. While theoretical FLOPs decrease, the cumulative overhead of "dense routing" often results in slower real-world inference (Figure 1(b)).
Key Challenge: Static methods ignore temporal heterogeneity, while dynamic methods struggle with dense routing overhead. Furthermore, most methods treat videos as a collection of independent images, failing to leverage the temporal correlation between adjacent frames. SAM2’s own memory mechanism demonstrates that the previous frame's prediction is a powerful prior for the current one.
Goal: (1) Enable frame-adaptive block selection with minimal routing overhead for real-world speedup; (2) Explicitly incorporate temporal priors into routing decisions; (3) Minimize the parameter count required to adapt to this dynamic architecture.
Key Insight: The authors observe that SAM2 segmentation is conditional on historical memory (\(M_t = D(E(I_t, M_{t-1}))\)). The previous mask \(M_{t-1}\) naturally indicates target location and appearance, serving as a free temporal routing signal. Additionally, block importance is non-uniform, with a "core subset" of blocks being universally critical across samples.
Core Idea: Replace multiple routers with a single Predictive-Aware Router that fuses the previous mask and current features to output a complete block activation plan for the entire model. Use Importance-Aware LoRA (I-LoRA) to inject trainable parameters only into core blocks, minimizing adaptation costs.
Method¶
Overall Architecture¶
RDS (Recurrent Dynamic Submodel) adds two components to a frozen SAM2 encoder. During inference for each frame \(I_t\), the current patch embeddings \(X_t\) and the previous mask \(M_{t-1}\) are fused into a spatio-temporal feature \(Y_t\). This is fed into a single Predictive-Aware Router (PAR), which outputs a binary routing vector of length \(L\) (\(L=48\) for SAM2), deciding which blocks to activate or skip. Skipped blocks use identity mappings. This ensures that only a "submodel" is executed per frame. At the training side, Importance-Aware LoRA (I-LoRA) identifies the most critical blocks offline and restricts fine-tuning to these adapters and the router.
The system is a recurrent structure where the output mask serves as the prior for the next frame's routing decision.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Current Frame It<br/>+ Prev. Mask Mt-1"] --> B["Predictive-Aware Router PAR<br/>Fuses temporal+visual, single routing"]
B -->|Binary routing vector rt| C["Dynamic Submodel<br/>Activates hi, skips others via identity"]
C --> D["Segmentation Mask Mt"]
D -->|As next frame temporal prior| A
E["Importance-Aware LoRA I-LoRA<br/>Dual-perspective key block selection"] -.Inject adapters during training.-> C
Key Designs¶
1. Predictive-Aware Router (PAR): Using Mask as Temporal Prior with Single-Pass Routing
PAR addresses the overhead of dense routing and the neglect of temporal cues. It takes two inputs: current patch embeddings \(X_t \in \mathbb{R}^{H\times W\times C}\) and the previous binary mask \(M_{t-1} \in \{0,1\}^{H\times W\times 1}\). They are fused via:
where \(\alpha\) is a learnable scalar, and \(f, g\) are convolutions for alignment. The combined feature \(Y_t\) passes through a router \(\phi\) (convolutions + pooling + linear) to produce logits \(l_t \in \mathbb{R}^L\).
To enable end-to-end training of discrete decisions, the authors use Straight-Through Estimator (STE) with Bernoulli sampling. Logits are normalized using a target activation ratio \(\beta\in(0,1]\) to obtain probabilities:
This normalization naturally constrains the expected number of active blocks to approximately \(\beta L\), eliminating the need for complex sparse penalty losses. A small subset of "core blocks \(K\)" is kept permanently active for stability. Binary decisions \(r^i_t\) are sampled for other blocks. Feature propagation for block \(i\) is defined as \(X^{i+1}_t = r^i_t \cdot h_i(X^i_t) + (1-r^i_t)\cdot X^i_t\). Unlike prior work, the router is executed only once per model, reducing routing latency from ~17.2 ms to ~0.5 ms (Table 4).
2. Importance-Aware LoRA (I-LoRA): Dual-Perspective Identification of Core Blocks
To minimize adaptation costs, only the most critical blocks receive LoRA adapters. Importance is measured via two complementary perspectives:
- Global Importance \(I^i_G\): Measures the impact of removing block \(i\) on the final J&F score: \(I^i_G = \mathbb{E}[\text{J\&F}(O) - \text{J\&F}(O^i)]\).
- Local Importance \(I^i_L\): Measures the feature transformation magnitude via cosine similarity: \(I^i_L = 1 - \mathbb{E}\!\left[\frac{\langle X^i, X^{i+1}\rangle}{\|X^i\|\cdot\|X^{i+1}\|}\right]\).
Blocks are ranked by \(s(i) = -(r_G(i)+r_L(i))\). LoRA adapters are injected into \(W_Q, W_K, W_V, W_O\) and MLP weights \(W_U, W_D\) for the top blocks. PAR enables fast inference, while I-LoRA enables efficient training.
3. Training Strategy: Normalization Over Sparse Penalties
Training involves two steps: (i) I-LoRA Identification—using a small subset of data to locate core blocks; (ii) Dynamic Component Fine-tuning—jointly training the router and adapters. Fixed \(\beta\) during training provides a precise compute budget. Inference uses Bernoulli sampling (stochastic) or thresholding (deterministic, \(\tau=0.5\)).
Key Experimental Results¶
Main Results¶
Comparing RDS with other strategies on SAM2 for DAVIS 2017:
| Method | Tunable Params (M) | Training Data | FLOPs (G)↓ | FPS↑ | J&F↑ |
|---|---|---|---|---|---|
| SAM2 (Base) | 224 | 100% | 819 | 29.8 | 90.0 |
| SlimSAM | 147 | >5% | 547 | 33.4 | 87.8 |
| PBD | 7.6 | <0.03% | 500 | 39.8 | 85.6 |
| DyT | 20.5 | <0.03% | 629 | 22.3 | 90.2 |
| AdaViT | 9.6 | <0.03% | 528 | 26.3 | 88.3 |
| RDS (Ours) | 7.6 | <0.03% | 500 | 38.2 | 89.6 |
RDS achieves 38.2 FPS (1.3× speedup) with only 7.6M parameters (3% of total), while J&F remains high at 89.6. Models like DyT have higher J&F but are actually slower than the original SAM2 due to routing overhead.
Cross-model Transferability: | Model | Config | DAVIS J&F | SA-V G | YTVOS | FPS | |------|------|------|------|------|------| | SAM2 | Base | 90.0 | 78.5 | 88.5 | 29.8 | | SAM2 | w/ RDS₀.₆ | 89.6 | 74.3 | 86.0 | 38.2 | | SAMURAI | Base | 90.1 | 78.8 | 88.3 | 28.5 | | SAMURAI | w/ RDS₀.₆ | 90.1 | 74.7 | 85.6 | 35.7 |
Ablation Study¶
| Config | Result | Insight |
|---|---|---|
| Single vs. Dense Router | Latency 17.2ms → 0.5ms | Critical for real acceleration (Table 4) |
| Input: Image only | J&F drop | Susceptible to background clutter |
| Input: Mask only | J&F drop | Fails to capture appearance changes |
| Input: Fusion (Ours) | 89.6 | Best balance of temporal prior + visual cues |
| I-LoRA: Dual-view | 89.6 | Better than using Global or Local alone |
Key Findings¶
- Real speedup depends on routing frequency: Reducing routing from a per-block task to a per-model task is the key to converting theoretical FLOP savings into wall-clock speedup.
- Short-term temporal priors are sufficient: Using only the previous frame \(M_{t-1}\) is nearly as effective as using a longer window (\(M_{t-1..t-4}\)).
- Block importance is depth-independent: Active blocks are distributed throughout the model, confirming that key blocks are a sparse core subset rather than concentrated in shallow/deep layers.
Highlights & Insights¶
- Diagnosing the "Theoretical vs. Real" Gap: The paper explicitly measures routing latency, identifying it as the primary bottleneck in dynamic networks.
- Compute Constraints via Normalization: Normalizing probabilities based on \(\beta\) simplifies the training process by removing sensitive sparse loss coefficients.
- "Zero-cost" Temporal Prior: Reusing the previous frame's mask—which is already generated by SAM2—provides an effective routing signal without extra overhead.
Limitations & Future Work¶
- Generalization on Large Datasets: RDS shows a larger performance drop on SA-V (e.g., 78.5 → 74.3) compared to DAVIS, suggesting limitations in complex long-form videos.
- Static vs. Adaptive \(\beta\): The current activation ratio \(\beta\) is fixed per run; making this adaptive (activating more blocks for hard frames) could further optimize the efficiency-accuracy frontier.
- Threshold Calibration: Deterministic deployment relies on threshold \(\tau\), which may require domain-specific recalibration.
Related Work & Insights¶
- vs. Static Pruning (SlimSAM/PBD): Static methods are frame-agnostic. RDS adapts per-frame, achieving higher accuracy for the same compute budget.
- vs. Dynamic Block/Token Skipping: RDS avoids the "negative acceleration" of dense routing by using a single-pass router and using masks as temporal priors.
- vs. Full Fine-tuning: I-LoRA reduces GPU memory and training time by identifying and adapting only the core functional blocks.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Single-pass router with temporal priors effectively solves dynamic network overhead)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Extensive ablations on routing latency and cross-model transfer)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation and well-structured evidence)
- Value: ⭐⭐⭐⭐ (Practical utility for deploying SAM2-based models on edge devices)