A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning¶
Conference: CVPR 2025
arXiv: 2503.06960
Code: https://github.com/CVMI-Lab/SlotMIM
Area: Robotics / Visual Pre-training
Keywords: Pre-trained Vision Models, Robot Learning, Object-Centric Representations, Self-Supervised Learning, Data Types
TL;DR¶
Through systematic evaluation, it is found that DINO/iBOT outperforms MAE in robot tasks but suffers performance degradation on non-object-centric (NOC) data due to the loss of object-centric representation capabilities. This paper proposes SlotMIM, which uses a semantic bottleneck (reducing prototype numbers to encourage the emergence of objectness), cross-view consistency regularization, and slot-level contrastive learning. This enables the model to learn object-centric representations from NOC data, outperforming MVP/VC-1 pre-trained on >1M samples using only 241K samples.
Background & Motivation¶
Background: Pre-trained vision models (PVMs) have become fundamental building blocks for robot learning. The prevailing approach (e.g., MVP, VC-1) utilizes MAE pre-training on ego-centric data.
Limitations of Prior Work: Two unquestioned assumptions: (1) MAE is the optimal pre-training method; (2) ego-centric data is the best choice. Experiments reveal neither holds true: DINO/iBOT consistently outperforms MAE in both manipulation and perception tasks, and ImageNet (object-centric) data outperforms ego-centric data.
Key Challenge: Although DINO/iBOT performs best on object-centric data, its performance degrades severely on non-object-centric (NOC) data such as scene-centric or ego-centric data. This is because they struggle to learn object-centric representations from NOC data, whereas objectness is highly correlated with successful manipulation (correlation coefficient of \(0.72\)).
Core Idea: Design SlotMIM—it promotes the emergence of objectness by reducing the prototype counts (creating a semantic bottleneck), learns semantic prototypes via cross-view consistency, and improves discriminability using slot-level contrastive learning. This allows PVMs to learn object-centric representations across any type of pre-training data.
Method¶
Overall Architecture¶
Based on iBOT extensions: (1) reducing the number of prototypes (\(8192 \to 512\)) to construct an information bottleneck \(\to\) patch clustering reveals emerging objectness; (2) introducing cross-view patch consistency loss \(\to\) prototypes gain semantic meaning; (3) pooling patches into slots according to prototype assignments \(\to\) slot-level MoCo contrastive learning.
Key Designs¶
-
Representation Bottleneck:
- Function: Drastically reduces the number of clustering prototypes.
- Key Finding: iBOT uses 8192 prototypes to capture fine-grained patterns but lacks semantics. Reducing this to 512 allows objectness to naturally emerge (Fig.4a: transitioning from texture-level clustering to object-level clustering).
- Design Motivation: The information bottleneck forces the model to learn compositional object concepts rather than low-level patterns.
-
Cross-view Consistency:
- Function: Forces corresponding patches from different augmented views of the same image to share the same prototype.
- Mechanism: Overlap regions of the two views are aligned using ROIAlign, computing cross-view cross-entropy for matched patch pairs: \(\mathcal{L}_{patch}^{cross}\).
- Design Motivation: iBOT's within-view MIM loss provides no view-invariance guidance, leading to a lack of semantic consistency in prototypes.
-
Slot-level Contrastive Learning:
- Function: Pools patches into object-level slot features based on prototype assignment, and performs MoCo contrastive learning between slots.
- Mechanism: \(\mathbf{s}_{\theta,i} = h_\theta(\sum_j p_\theta(\mathbf{v}_j)_i \mathbf{z}_{\theta,j})\), where slots of the same prototype across two views form positive pairs.
- Design Motivation: Patch-level learning is insufficient to distinguish objects, while slot-level contrastive learning enhances object-level discriminability.
Key Findings: Inverse Scaling¶
While MAE performance scales up with data, DINO/iBOT's performance paradoxically drops when scaling data from 241K to 1.28M. Reason: Self-supervised learning objectives compress representations excessively, discarding low-level visual information crucial for manipulation tasks. SlotMIM circumvents this issue—it learns fine-grained parts instead of coarse-grained objects on ego-centric data, thereby avoiding over-compression.
Key Experimental Results¶
Main Results: 241K Pre-training Comparison¶
| Method | Data | Franka Kitchen | Meta-World | VOC Jacc | ADE mIoU |
|---|---|---|---|---|---|
| MAE | Ego-241K | 34.2 | 36.0 | 37.1 | 40.3 |
| DINO | INet-241K | 38.5 | 42.8 | 42.2 | 44.5 |
| iBOT | INet-241K | 40.1 | 43.3 | 43.2 | 48.2 |
| SlotMIM | COCO+-241K | 42.0 | 44.1 | 43.9 | 49.1 |
Ablation Study: Contribution of Each Component¶
| Configuration | mask | cross | within | slot | k-NN | ADE | Jacc |
|---|---|---|---|---|---|---|---|
| DINO cross-view only | ✗ | ✓ | ✗ | ✗ | 45.1 | 47.4 | 42.5 |
| +MIM | ✓ | ✓ | ✗ | ✗ | 44.9 | 48.6 | 42.3 |
| +within-view+slot | ✓ | ✓ | ✓ | ✓ | 46.2 | 49.1 | 43.9 |
Key Findings¶
- 241K SlotMIM > 1M+ MVP/VC-1: Surpasses SOTA methods using million-scale ego-centric data with only 1/4 the data size.
- Data types should match the task: Ego-centric data fits manipulation best, scene-centric data fits navigation, and object-centric data fits perception—the "one-size-fits-all data" concept is suboptimal.
- Objectness correlates with manipulation performance with a coefficient of 0.72: This serves as a vital indicator for selecting pre-training methods.
Highlights & Insights¶
- Challenging two popular assumptions: MAE is not the optimal PVM, and ego-centric data is not always the best choice. This offers significant guidance to the robot learning community.
- Discovery and explanation of "inverse scaling": Performance decreases with more data due to over-compression in self-supervised learning, providing an important supplement to scaling laws.
- Adaptability of SlotMIM: It learns objectness at different granularities depending on the data type (object-centric \(\to\) coarse-grained objects; ego-centric \(\to\) fine-grained parts), demonstrating the flexibility of the method.
Limitations & Future Work¶
- The number of slots (i.e., prototypes) must be manually configured, and the optimal values differ across datasets (ImageNet: 1024, COCO: 512).
- Cross-view matching utilizes ROIAlign to handle cropping alignment, which may not scale well to drastic deformations.
- Although the evaluation across 6 tasks is diverse, the experimental scale of each task is limited (3 seeds \(\times\) limited demos).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Systematic research perspective + SlotMIM methodological innovation (bottleneck + cross-view + slot contrastive), challenging existing domain assumptions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 4 data types \(\times\) 5 methods \(\times\) 6 tasks \(\times\) multiple scales, highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Figs 1, 2, 4, 6 are highly informative, with a clear logical chain from systematic analysis to method design.
- Value: ⭐⭐⭐⭐⭐ Provides a novel, data-centric perspective and an effective methodology for the PVM + robot learning community.