COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training¶

Conference: CVPR 2025
arXiv: 2412.01814
Code: https://github.com/ExplainableML/cosmos
Area: Image Segmentation
Keywords: Vision-Language Pre-training, Self-Distillation, Cross-Modality Learning, Semantic Segmentation, Contrastive Learning

TL;DR¶

COSMOS proposes a cross-modality self-distillation framework that learns fine-grained cross-modality representations in a student-teacher architecture using a text-cropping strategy and a cross-attention module. Pre-trained on only 30M data, it consistently outperforms CLIP-like baselines across zero-shot retrieval, classification, and semantic segmentation tasks, even surpassing OpenCLIP trained on billions of data points.

Background & Motivation¶

Vision-Language Models (VLMs) like CLIP use a global contrastive loss to match entire images with texts, achieving significant progress on various vision and language tasks. However, such global contrastive learning has inherent limitations: models tend to focus on the dominant foreground objects in images while ignoring other critical information. This leads to the "feature suppression" phenomenon—the model only learns the most salient features in the data, discarding other valuable discriminative details. This manifests in three ways: (1) poor performance on dense prediction tasks (e.g., semantic segmentation); (2) difficulty in distinguishing images with different visual patterns but overall similarity; (3) the text encoder treating text as a bag-of-words, neglecting word order and compositional semantics.

Prior studies mainly introduced self-supervised learning on the image encoder (e.g., SLIP and SILC), which improves image representations but leaves text representations unimproved. The Key Challenge is: how to simultaneously enhance fine-grained representations of both images and texts instead of focusing solely on a single modality?

Core Idea of COSMOS: Generalize the multi-crop augmentation strategy from self-supervised learning to multi-modal scenarios by introducing the concept of "text-cropping". Combined with a cross-attention module, this simultaneously distills both image and text encoders within a student-teacher framework to learn cross-modality fine-grained representations.

Method¶

Overall Architecture¶

COSMOS adopts a student-teacher self-distillation framework. The student and teacher models share the same VLM architecture (image encoder + text encoder), where the teacher model is updated via Exponential Moving Average (EMA) of the student's parameters. During training, multi-modal augmentation is applied to image-text pairs to generate global and local views. All views pass through the student, whereas only global views pass through the teacher. The student additionally contains a cross-attention module to fuse cross-modality information. The total loss consists of the standard CLIP contrastive loss and the COSMOS cross-modality self-distillation loss.

Key Designs¶

Text-Cropping Strategy:
- Inspired by multi-crop augmentation in the image domain, this strategy extends it to the text domain.
- Utilizing long synthetic descriptions (comprising multiple sentences) generated by MLLMs, fragments of varying lengths are randomly sampled.
- Global text view: Randomly samples 1-5 sentences, covering a larger description area of the image.
- Local text view: Samples only 1 sentence, focusing on a localized description area of the image.
- Image cropping and text cropping are performed independently; a global/local crop does not necessarily correspond to the same region, which is an intentional design choice.
- This design enables self-distillation to optimize both the text and image encoders simultaneously.
Cross-Attention Module:
- Only added to the student model, consisting of two sub-modules: \(C^T_\theta\) and \(C^I_\theta\).
- \(C^T_\theta\): Uses the image [CLS] token as the query and text tokens as keys/values to generate the cross-modality image embedding \(h_I\).
- \(C^I_\theta\): Uses the text [EOT] token as the query and image tokens as keys/values to generate the cross-modality text embedding \(h_T\).
- The outputs are added back to the original tokens via residual connections: \(h_I = C^T_\theta(q=[\text{cls}], kv=\text{txt-tok}) + [\text{cls}]\).
- This allows the distillation signal to flow into both encoders simultaneously, promoting bidirectional grounding of vision and language.
- In practice, tokens from the global crop are used as keys/values.
Student-Teacher Self-Distillation Framework:
- Teacher parameters are updated via EMA: \(\theta_t = \lambda \theta_t + (1-\lambda) \theta_s\).
- The student processes all crops (global + local), while the teacher only processes global crops.
- This asymmetric design encourages the student to predict the teacher's global context from local features.

Loss & Training¶

CLIP Contrastive Loss \(\mathcal{L}_{CLIP}\): Standard symmetric InfoNCE loss within the student model, computed across all crops.
COSMOS Cross-Modality Self-Distillation Loss \(\mathcal{L}_{COSMOS}\): Directs a four-way symmetric InfoNCE matching between the student's cross-modality embeddings (\(h_I\), \(h_T\)) and the teacher's [CLS] and [EOT] tokens.
Total Loss: \(\mathcal{L}_{total} = \mathcal{L}_{CLIP} + \mathcal{L}_{COSMOS}\).
An important advantage: The two loss terms require no additional scaling hyperparameters and achieve optimal results with direct, equal weighting.
Training runs for 32 epochs using ViT-B/16 as the vision encoder.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (Merged-30M)	DreamLIP (30M)	CLIP (30M)	Gain (vs CLIP)
MSCOCO I2T	R@1	68.0	62.3	63.2	+4.8
MSCOCO T2I	R@1	52.5	44.9	48.2	+4.3
Flickr30K I2T	R@1	92.9	89.9	90.5	+2.4
Flickr30K T2I	R@1	80.3	73.3	75.9	+4.4
ImageNet	Top-1 Acc	57.6	58.4	50.0	+7.6
Semantic Segmentation (8 benchmarks)	Avg mIoU	20.0	-	-	-

Ablation Study¶

Configuration	Key Metric	Description
CLIP loss only	Baseline	Standard CLIP contrastive learning
+ Text cropping	Improvements in both retrieval/segmentation	Text augmentation is a key innovation
+ Cross-attention	Further improvements	Cross-modality fusion enhances representation
COSMOS (30M) vs OpenCLIP (1B)	20.0 vs 16.5 avg mIoU	30M data outperforms 1B data
COSMOS w/ SCLIP	37.8 avg mIoU	Approaches SCLIP trained on 400M data (38.2)

Key Findings¶

COSMOS outmatches the retrieval performance of Llip trained on 2.5B data (MSCOCO: 68.0 vs 63.4 I2T R@1) using only 30M data.
In semantic segmentation, 30M data almost doubles the performance of OpenCLIP trained on 1B data (Cityscapes: 13.9 vs 8.5).
Significant advantages are also observed on compositional reasoning benchmarks like SugarCrepe and SVO, with an average score of 86.6 vs 81.8 (DreamLIP).
Visualization of cross-attention indicates that the model can effectively localize corresponding regions in both images and text.

Highlights & Insights¶

Clever Design of Text-Cropping: Extending the well-established multi-crop self-distillation strategy from the image domain to the text domain, using synthetic long descriptions to construct global/local text views is a simple yet effective innovation.
Bidirectional Distillation of Cross-Modality Embeddings: In contrast to preceding methods that exclusively improve the image encoder, COSMOS dynamically routes distillation signals into both encoders via cross-attention modules.
No Hyperparameter Scaling Required: The clean and simple design of equally weighting the two loss terms avoids the hassle of grid-searching for the optimal loss ratio.
Exceptional Data Efficiency: The model trained on 30M data surpasses models trained on tens of billions of data across various tasks, demonstrating the superiority of the methodological design.

Limitations & Future Work¶

The absolute performance on classification tasks is still lower than models trained on billions of data, indicating that classification relies more heavily on data scale.
The approach depends on synthetic long descriptions generated by MLLMs; the quality of the descriptions directly impacts the effectiveness of text-cropping.
The cross-attention module introduces additional computational overhead for the student model.
Evaluated only on ViT-B/16; whether it scales effectively to larger models remains to be verified.
Text-cropping and image-cropping are conducted independently; whether aligning the cropped regions is beneficial has not been explored.

Closely aligned with the self-distillation philosophy of DINO, but extended to multi-modal settings.
DreamLIP provides a long synthetic description dataset, on top of which COSMOS introduces text-cropping augmentation.
SILC only performs local-to-global matching on the image encoder, whereas COSMOS extends this to dual modalities.
Insights for open-vocabulary segmentation: Improving fine-grained representation learning during the pre-training phase can significantly elevate downstream dense prediction task performance.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of text-cropping and cross-modality self-distillation is novel, though the overall framework remains within the mature paradigm of DINO+CLIP.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluations across retrieval, classification, and segmentation, alongside compositionality and visual grounding assessments, plus sufficient ablations.
Writing Quality: ⭐⭐⭐⭐ Clear structure, informative figures and tables, and well-explained methodology.
Value: ⭐⭐⭐⭐ Offers a valuable paradigm for data-efficient VLM pre-training; the text-cropping strategy in particular is worth generalizing.