SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning¶
Conference: CVPR 2025
arXiv: 2504.00527
Code: https://github.com/fmthoker/SMILE
Area: Self-Supervised Learning / Video Understanding
Keywords: Masked Video Modeling, Synthetic Motion Augmentation, CLIP Feature Target, Trajectory Masking, VideoMAE
TL;DR¶
This paper proposes SMILE, which enhances masked video modeling through synthetic motion augmentation (overlaying segmented objects moving along random trajectories onto videos) and CLIP feature reconstruction targets. Combined with a trajectory-guided masking strategy, it significantly boosts K400 linear probing to 56.2% (an improvement over the previous SOTA of 47.5%).
Background & Motivation¶
Background: In self-supervised video representation learning, masked video modeling (e.g., VideoMAE) learns spatio-temporal features through mask-and-reconstruct pretext tasks. However, temporal redundancy in videos is severe—adjacent frames are almost identical, allowing models to "cheat" by looking at neighboring frames without truly understanding motion.
Limitations of Prior Work: (1) Pixel reconstruction targets focus on low-level textures rather than high-level semantics; (2) object motion in natural videos lacks diversity, making it difficult for models to learn rich motion patterns; (3) random tube masking is not targeted at motion regions.
Key Challenge: Masked video learning requires rich motion signals, but most regions in natural videos consist of static backgrounds.
Key Insight: Artificial synthetic motion—generating objects with Stable Diffusion, segmenting them using X-Paste, and overlaying them onto video frames along random smooth trajectories, forces the introduction of motion signals. This is combined with CLIP features replacing pixels as the reconstruction target.
Core Idea: Synthetic object motion augmentation + CLIP feature reconstruction + trajectory-guided masking = motion-aware self-supervised video learning.
Method¶
Key Designs¶
-
Synthetic Motion Augmentation:
- Function: Artificially increasing motion diversity in videos.
- Mechanism: Object images generated by Stable Diffusion are segmented using X-Paste and overlaid onto video frames along random quadratic Bézier curve trajectories with scaling and rotation transformations, forcing the model to learn to track these moving objects.
- Design Motivation: Ablation shows that synthetic motion contributes +3.1% (with pixel targets) on K400, demonstrating its effectiveness in breaking temporal redundancy.
-
CLIP Feature Reconstruction Target:
- Function: Replacing low-level pixels with high-level semantic features as the reconstruction target.
- Mechanism: A pre-trained CLIP image encoder is utilized to extract features for each frame, serving as the reconstruction target for masked tokens: \(\mathcal{L} = \frac{1}{|\mathcal{T}^{mask}|}\sum_{i \in \mathcal{T}^{mask}} \|f_i' - Y_i\|_2^2\)
- Design Motivation: CLIP features achieve a linear probing accuracy +12.1% higher than pixels, proving that semantic-level targets are far superior to pixel-level targets.
-
Trajectory-Guided Masking:
- Function: Applying additional masking along the motion trajectories of synthetic objects.
- Mechanism: On top of standard tube masking, additional tokens along the motion trajectory of synthetic objects are masked, forcing the model to predict semantic features along the trajectory.
- Design Motivation: Masking the most active motion regions maximizes the requirement for motion reasoning.
Loss & Training¶
CLIP feature L2 reconstruction loss. ViT-B backbone, trained for 600 epochs on K400. Progressive training: pre-trained with synthetic motion + CLIP targets first, then fine-tuned on original videos.
Key Experimental Results¶
Main Results¶
| Method | K400 Linear Probing | UCF-101 | SSv2 |
|---|---|---|---|
| VideoMAE | 40.2% | 73.1% | 17.8% |
| SIGMA (Prev. SOTA) | 47.5% | 80.7% | 21.7% |
| SMILE | 56.2% | 83.8% | 23.7% |
Ablation Study¶
| Configuration | K400 Linear Probing |
|---|---|
| Pixel Target | 44.1% |
| + Synthetic Motion | 47.2% (+3.1%) |
| CLIP Target | 56.2% (+12.1%) |
| CLIP + Synthetic Motion | 56.2% |
| + Trajectory Masking | +~1% |
Key Findings¶
- Massive gap between CLIP and pixel targets: +12.1%, showing semantic-level reconstruction targets are the largest contributing factor.
- Synthetic motion is more effective under pixel targets: +3.1% (pixel), but the gain saturates with CLIP targets.
- SSv2 (Motion-sensitive) also improves: 23.7% vs 21.7%, demonstrating that motion understanding is indeed enhanced.
Highlights & Insights¶
- A simple solution to break temporal redundancy—"pasting" moving objects onto videos is significantly simpler than collecting new data.
- Overwhelming advantage of CLIP features—a 12-point boost simply from pixel to CLIP, suggesting that the bottleneck of VideoMAE lies in the reconstruction target rather than the masking strategy.
Limitations & Future Work¶
- Synthetic object motion is not realistic enough (moving mechanically along trajectories).
- Evaluation is primarily on action recognition, with insufficient validation on temporal reasoning tasks.
- CLIP features inherit the biases of image-text alignment.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of synthetic motion and CLIP targets is effective and novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear.
- Value: ⭐⭐⭐⭐ Significantly pushes the state of the art in self-supervised video learning.