MoSiC: Optimal-Transport Motion Trajectory for Dense Self-Supervised Learning¶
Conference: ICCV 2025 arXiv: 2506.08694 Code: github.com/SMSD75/MoSiC Area: Self-Supervised Learning / Video Representation Learning Keywords: Dense self-supervised learning, optimal transport, motion trajectories, spatiotemporal consistency, video segmentation
TL;DR¶
MoSiC extracts long-range motion trajectories via an offline point tracker and propagates cluster assignments along the temporal dimension through an Optimal Transport (Sinkhorn-Knopp)-based clustering mechanism. This enables learning spatially and temporally consistent dense representations from video data, improving DINOv2 by 1%–6% across multiple image and video benchmarks using only video for training.
Background & Motivation¶
Dense self-supervised learning has achieved significant progress in pixel/patch-level representation learning, yet extending it to video remains challenging. The core difficulties are:
Failure of static augmentations: Color transformations in the image domain implicitly preserve pixel correspondences, but object motion, camera displacement, and deformation in videos break this assumption.
Occlusion issues: Prior methods such as TimeTuning establish inter-frame correspondences via mask propagation, but temporary occlusions introduce propagation errors.
Drift accumulation: Errors accumulate frame by frame during long-range tracking, leading to degraded feature representations.
The core insight of MoSiC draws from the Gestalt psychology principle — "points that move together belong together" — and extends this principle to a finer-grained patch level.
Method¶
Overall Architecture¶
MoSiC adopts a teacher-student framework with the following pipeline: 1. Patchify frames from a video clip \(X \in \mathbb{R}^{h \times w \times c \times T}\); 2. The student network processes randomly masked patches; the teacher network processes the original patches; 3. Long-range motion trajectories are extracted using an offline point tracker (CoTracker-v3); 4. Clustering is performed on the first frame via the Sinkhorn-Knopp optimal transport algorithm; 5. Cluster assignments are propagated to subsequent frames along the motion trajectories; 6. A cross-entropy loss aligns the cluster assignments between the student and teacher.
Key Designs¶
-
Motion Trajectory Extraction: \(N\) points are sampled from the first frame to form a uniform grid, and the frozen CoTracker-v3 tracks these points throughout the video clip, yielding trajectories \(\text{Traj}_{t,i} \in \mathbb{R}^{T \times N \times 2}\). A key advantage of CoTracker is its long-range tracking capability and robustness to object re-appearance after occlusion (object permanence).
-
OT-based Clustering: Given cluster prototypes \(P \in \mathbb{R}^{K \times d}\), the transport cost between patch features and prototypes (negative cosine similarity) is computed, and the entropy-regularized optimal transport problem is solved via the Sinkhorn-Knopp algorithm:
where \(H(M)\) is the entropy regularization term and \(\epsilon\) controls the smoothness of the assignments. Uniform marginal constraints are imposed to prevent collapse. This procedure is applied independently to both the student and teacher networks.
- Cluster Propagation: After obtaining cluster assignments on the first frame \(t_0\) via OT, the teacher's cluster assignments are propagated to subsequent frames along the motion trajectories: \(\mathcal{Q}_t^{\text{teach},i} = \mathcal{Q}_{t_0}^{\text{teach},i}\). This ensures that points moving along the same trajectory maintain the same cluster identity — even when their appearance changes due to viewpoint variation. The teacher side uses bilinear interpolation for feature sampling at continuous coordinates, while the student side uses nearest-neighbor interpolation.
Loss & Training¶
The training loss takes a cross-entropy form, computed between the student's cluster scores at the first frame \(S_{t_0}^{\text{stu},k,i}\) (after softmax) and the teacher's propagated one-hot cluster assignments:
Key details: - \(v_{t,i}\) is a visibility flag; the loss is computed only over visible trajectory points, enhancing robustness to occlusion; - Only simple augmentations (cropping + masking) are used, without complex augmentations such as color jitter or grayscale conversion; - The model is initialized from DINOv2 pretrained weights and trained on YouTube-VOS.
Key Experimental Results¶
Main Results¶
| Benchmark | Dataset | Metric | MoSiC-S14 | DINOv2-S14 | Gain |
|---|---|---|---|---|---|
| In-context Scene Understanding | Pascal VOC (1/128) | mIoU | 62.5 | 56.0 | +6.5 |
| In-context Scene Understanding | Pascal VOC (1/1) | mIoU | 78.2 | 77.0 | +1.2 |
| In-context Scene Understanding | ADE20K (1/1) | mIoU | 40.7 | 38.8 | +1.9 |
| Unsupervised Video Semantic Seg. | DAVIS (F-Clustering) | mIoU | 58.9 | 57.4 | +1.5 |
| Unsupervised Video Semantic Seg. | YTVOS (F-Clustering) | mIoU | 60.6 | 56.3 | +4.3 |
| Frozen Clustering | Pascal VOC (K=500) | mIoU | 60.2 | 58.6 | +1.6 |
| Linear Segmentation | Pascal VOC | mIoU | 79.7 | 78.9 | +0.8 |
| Linear Segmentation | ADE20K | mIoU | 39.6 | 37.9 | +1.7 |
| Semantic Segmentation | Pascal VOC | mIoU | 51.2 | 37.5 | +13.7 |
MoSiC-B14 (85M parameters) further improves results: Pascal VOC in-context (1/128) reaches 65.5, and (1/1) reaches 80.5.
Ablation Study¶
| Configuration | Pascal VOC (mIoU) | ADE20K (mIoU) | Notes |
|---|---|---|---|
| No masking | 51.1 | 18.2 | Baseline |
| 10% masking ratio | 51.5 | 18.6 | Best |
| 40% masking ratio | 49.9 | 18.5 | Excessive masking is harmful |
| Without EMA teacher | 50.5 | 18.2 | EMA teacher is beneficial |
| With EMA teacher | 51.5 | 18.6 | Default setting |
| 8×8 grid | 49.2 | 17.5 | Sparse grid insufficient |
| 16×16 grid | 51.5 | 18.6 | Default setting |
Key Findings¶
- MoSiC generalizes to various visual foundation models (DINO, EVA-CLIP, DINOv2-R), consistently delivering 2%–7% improvements;
- Gains are more pronounced in low-data regimes (+6.5% at 1/128 data);
- Compared to TimeTuning, MoSiC achieves +8.7% clustering improvement on DAVIS and +9.4% on YTVOS.
Highlights & Insights¶
- Motion trajectories as implicit supervision: The Gestalt psychology principle — "points that move together belong together" — is elegantly translated into an optimizable objective at the patch level;
- Improving image representations using only video: Without image annotations, temporal signals from video can enhance dense representation quality for static images;
- Visibility mask mechanism: Computing the loss only over visible trajectories gracefully addresses the pseudo-label noise caused by occlusions, offering greater robustness than TimeTuning's mask propagation;
- Plug-and-play: The method can be applied to various visual foundation models as a post-pretraining fine-tuning stage.
Limitations & Future Work¶
- The approach relies on an offline point tracker (CoTracker-v3); tracker quality directly affects the accuracy of cluster propagation;
- Training requires video data (YouTube-VOS) and cannot be applied in purely image-based settings;
- The tracker itself may fail under extremely fast motion or severe camera shake;
- The number of clusters \(K\) must be set manually, with no adaptive mechanism.
Related Work & Insights¶
- The key distinction from TimeTuning lies in replacing mask propagation with a robust long-range point tracker, avoiding the accumulation of propagation errors under occlusion;
- The method is complementary to image-domain approaches such as NeCo and CrIBo, which exploit cross-image consistency, whereas MoSiC exploits cross-frame temporal consistency;
- The OT clustering mechanism is inherited from DINO/SwAV, with the novel contribution of propagating cluster assignments along motion trajectories.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of motion trajectories and OT-based cluster propagation is a first in dense self-supervised learning
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six datasets, four evaluation benchmarks, generalization across multiple backbones, and comprehensive ablations
- Writing Quality: ⭐⭐⭐⭐ Clear logical flow, well-formulated equations, and intuitive illustrations
- Value: ⭐⭐⭐⭐ Provides an effective solution for leveraging video data to enhance image representations