Segment Any Motion in Videos¶

Conference: CVPR 2025
arXiv: 2503.22268
Code: https://motion-seg.github.io/
Area: Video Understanding / Segmentation
Keywords: Moving Object Segmentation, Long-range Trajectories, SAM2, Semantic-Motion Decoupling, Video Understanding

TL;DR¶

This paper proposes a moving object segmentation method that combines long-range point trajectory motion cues, DINO semantic features, and SAM2 pixel-level mask densification. By employing spatio-temporal trajectory attention and motion-semantic decoupled embedding, it significantly outperforms traditional optical flow-based methods on multiple benchmarks, particularly in fine-grained multi-object segmentation scenarios.

Background & Motivation¶

Background: Moving Object Segmentation (MOS) is a core task in video understanding, requiring the separation of independently moving objects in a video from the background and camera motion. Traditional methods heavily rely on optical flow to provide motion cues, but optical flow is essentially short-range two-frame matching, making it susceptible to occlusions, motion blur, and illumination changes.

Limitations of Prior Work: Optical flow-based methods suffer from three main issues: (1) Short-range limitation—optical flow only covers adjacent frames, rendering it ineffective against slow motion or long-term occlusions; (2) Depth ambiguity—it is difficult to distinguish independent object motion from parallax-induced motion caused by depth differences; (3) Poor segmentation quality—the generated masks lack geometric completeness. On the other hand, trajectory-based methods mainly rely on spectral clustering of affine matrices, struggle to capture global consistency and complex motion patterns.

Key Challenge: Motion cues and semantic cues both have their pros and cons—pure motion cues cannot easily distinguish camera motion from object motion (especially in scenes with intense camera motion), while pure semantic cues fail to distinguish moving and static objects within the same category. Effectively fusing these two types of information is a critical challenge.

Goal: Go beyond optical flow by using long-range point trajectories as motion cues, combining them with DINO semantic features as complementary information, and leveraging the powerful segmentation capabilities of SAM2 to generate high-quality per-object masks.

Key Insight: Long-range point trajectories (generated by BootsTAP) naturally possess resilience to occlusion and deformation; the self-supervised nature of DINO features ensures cross-domain generalization; SAM2 can perform segmentation and tracking based on point prompts. A clever combination of these three elements forms a robust unified framework.

Core Idea: Train a transformer model that takes long-range trajectories and DINO features as input, predicts dynamic labels for each trajectory through motion-semantic decoupled embedding, uses these trajectories as point prompts for SAM2, and generates fine-grained per-object masks via an iterative prompting strategy.

Method¶

Overall Architecture¶

The overall pipeline consists of three phases: 1. Motion Pattern Encoding: Long-range 2D trajectories are generated using BootsTAP and depth maps are estimated with Depth-Anything, followed by motion feature extraction using a spatio-temporal trajectory attention encoder. 2. Per-trajectory Motion Prediction: Combining DINO features, a transformer decoder with motion-semantic decoupled embedding predicts whether each trajectory corresponds to a moving object. 3. SAM2 Iterative Prompting: Trajectories identified as dynamic are utilized as point prompts for SAM2 to generate pixel-level per-object masks through a two-stage iterative strategy.

Key Designs¶

Spatio-Temporal Trajectory Attention:
- Function: Alternately executes spatial and temporal attention within the encoder to capture spatial relationships among trajectories and temporal dynamics within a single trajectory.
- Mechanism: The input is an augmented representation of long-range trajectories, including normalized pixel coordinates \((u_i, v_i)\), frame differences \((\Delta u_i, \Delta v_i)\), depth \(d_i\), depth differences \(\Delta d_i\), visibility \(\rho_i\), and confidence \(c_i\). NeRF-style frequency encoding is applied to coordinates to prevent oversmoothing. Attention layers operate alternately along the trajectory dimension (spatial) and the temporal dimension, followed by max-pooling over the temporal dimension to obtain a single feature vector for each trajectory.
- Design Motivation: Standard attention mechanisms cannot efficiently model spatial relationships and temporal dynamics simultaneously, which is addressed by this alternating attention design. Frequency encoding prevents oversmoothing of features for spatially adjacent sampling points.
Motion-Semantic Decoupled Embedding:
- Function: Balances the utilization of motion and semantic information in the decoder, prioritizing motion cues while using semantics as auxiliary support.
- Mechanism: The encoder layer performs attention only on embedded trajectories (containing only motion information). After computing attention-weighted features, DINO features are concatenated and passed through a feed-forward layer. In the decoder layers, self-attention still operates exclusively on motion features, while multi-head attention queries a memory bank containing semantic information. Finally, a sigmoid layer outputs the dynamic probability for each trajectory.
- Design Motivation: Simply feeding DINO features as input causes the model to rely excessively on semantics—such as misclassifying static objects of the same category as dynamic. The decoupled design ensures the model remains "motion-dominant, semantically assisted," preventing semantic overfitting.
SAM2 Iterative Prompting:
- Function: Converts sparse dynamic trajectory points into pixel-level dense masks for each object.
- Mechanism: A two-stage process. Stage 1 (Object Grouping): Identifies the densest points in the frame with the most visible points to serve as initial prompts for SAM2. Once a mask is generated, its boundary is dilated, and points inside the mask are marked as belonging to the same object. These grouped points are excluded, and the process repeats for the remaining points until all points are grouped. Stage 2 (Mask Refinement): Uses the densest point and the two furthest points from each grouped trajectory as prompts, prompting SAM2 at intervals to prevent long-distance tracking loss. Post-processing is finally applied to merge overlapping masks.
- Design Motivation: SAM2 requires object IDs as input and cannot assign the same ID to all dynamic objects. The iterative grouping strategy solves the multi-object differentiation problem.

Loss & Training¶

A weighted binary cross-entropy loss is used to train the trajectory classification model. Ground truth labels are generated by checking whether trajectory sampling points fall within the dynamic mask.

The training data is a mixture of three datasets: Kubric (synthetic, 35%), Dynamic Replica (synthetic, 35%), HOI4D (real-world, 30%). To accelerate training, 300 frames from Dynamic Replica are sampled at 1/4 intervals while reserving patterns of large-range camera motion.

Key Experimental Results¶

Main Results¶

Moving Object Segmentation (MOS, combined foreground evaluation):

Method	Motion Cues	Semantic Cues	DAVIS16-Moving \(\mathcal{J\&F}\)↑	SegTrackv2 \(\mathcal{J}\)↑	FBMS-59 \(\mathcal{J}\)↑	DAVIS2016 \(\mathcal{J\&F}\)↑
RCF-All	Optical Flow	DINO	79.6	79.6	72.4	80.7
OCLR-TTA	Optical Flow	RGB	78.5	72.3	69.9	78.8
ABR	Optical Flow	DINO	72.0	76.6	81.9	72.5
Ours	Trajectory	DINO	89.5	76.3	78.3	90.9

Fine-grained MOS (DAVIS17-Moving, per-object evaluation):

Method	MOS \(\mathcal{J\&F}\)↑	Fine-grained \(\mathcal{J}\)↑	Fine-grained \(\mathcal{F}\)↑
ABR	74.6/75.2	50.9	51.2
OCLR-TTA	76.0/75.3	48.4	49.9
Ours	90.0/89.0	77.4	83.6

Ablation Study¶

Configuration	DAVIS17-Moving \(\mathcal{J\&F}\)↑	DAVIS16-Moving \(\mathcal{J\&F}\)↑	Description
Ours-full	80.5	89.1	Full Model
w/o Tracks	19.6	20.9	Performance collapses without trajectories, proving trajectories are core
w/o DINO	65.0	75.5	Significant drop, showing semantic information is important
w/o MSDE	63.0	78.2	Feature fusion without decoupling performs poorly
w/o MOE	72.0	81.8	Fusing DINO in both encoder and decoder leads to semantic overreliance
w/o ST-ATT	65.5	78.3	Spatio-temporal attention contributes significantly
w/o Depth	69.2	82.5	Depth information is helpful but not critical
w/o PE	66.4	82.0	Frequency positional encoding makes a certain contribution

Key Findings¶

Trajectories are the most critical input: Removing trajectories causes performance to collapse to ~20%, whereas the decline from removing DINO or depth is relatively moderate. This validates the irreplaceable role of long-range trajectories as motion cues.
The semantic fusion mechanism is crucial: w/o MSDE (direct concatenation without decoupled embedding) performs worse than w/o DINO (no semantics at all), indicating that incorrect fusion can be worse than no fusion.
Generates massive advantages in fine-grained segmentation: On DAVIS17-Moving, the per-object segmentation \(\mathcal{J}\) reaches 77.4 vs. the second best 50.9, achieving a gain of 26.5 points.
The model remains stable in extreme scenarios, such as intense camera motion, water reflections, and camouflage.

Highlights & Insights¶

The combination of long-range trajectories and SAM2 is elegant: Trajectories provide cross-frame consistent motion cues and naturally serve as point prompts for SAM2, offering a fundamental advantage in temporal consistency over the traditional optical flow \(\rightarrow\) segmentation paradigm.
The design concept of motion-semantic decoupled embedding has broad transfer value: In any task requiring the fusion of two potentially conflicting sources of information, allowing primary signals to flow through the backbone while injecting auxiliary signals via cross-attention is an effective paradigm.
The "find densest point \(\rightarrow\) generate mask \(\rightarrow\) exclude \(\rightarrow\) iterate" pipeline in the iterative prompting strategy cleverly translates the object counting problem into a greedy grouping task.

Limitations & Future Work¶

Dependency on tracker quality: The precision of BootsTAP directly affects final results.
Objects that appear and disappear rapidly: If an object only appears for a few frames and moves very fast, the generated trajectories might be too short, which could cause the method to fail.
Dominant motion overshadowing subtle motion: In scenes with large-moving objects, objects with subtle motion tend to be neglected.
Part-segmentation issue: SAM2 sometimes segments only a part of an object (e.g., a person's clothes instead of the whole body), and the post-processing merge strategy cannot always resolve this.
Future work can explore more robust trajectory estimators or introduce 3D trajectory information.

vs. Optical Flow Methods (OCLR, RCF): Optical flow is a short-range, two-frame matching that performs poorly under occlusion and long-term tracking. Long-range trajectories are essentially an extension of optical flow in the temporal dimension, overcoming short-range limitations.
vs. Affine Matrix Spectral Clustering Methods: Traditional trajectory methods rely on affine matrix clustering to capture only local similarities, lacking global consistency. This paper uses a transformer to directly model trajectory features, learning global motion patterns.
vs. ABR: ABR also uses DINO features but achieves unstable performance because it fuses motion and semantics in separate stages. The decoupled embedding in this paper enables closer and more robust integration.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of long-range trajectories + SAM2 + decoupled embedding is novel, although individual components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation on multiple datasets, rich ablations, and extensive qualitative comparisons.
Writing Quality: ⭐⭐⭐⭐ Clearly structured with effective illustrations and honest limitations.
Value: ⭐⭐⭐⭐⭐ Highly impressive performance in fine-grained motion segmentation with high practical applicability.