SuperCLIP: CLIP with Simple Classification Supervision¶
Conference: NeurIPS 2025 arXiv: 2512.14480 Code: GitHub (hustvl/SuperCLIP) Area: Information Retrieval Keywords: CLIP, vision-language pretraining, classification supervision, fine-grained alignment, contrastive learning
TL;DR¶
SuperCLIP augments the CLIP contrastive learning framework with an extremely simple classification loss — requiring only a lightweight linear layer that increases total FLOPs by merely 0.077% — to recover fine-grained textual supervision that CLIP underutilizes, achieving consistent improvements on zero-shot classification, image-text retrieval, and vision-only tasks.
Background & Motivation¶
CLIP aligns images and text into a shared embedding space via contrastive learning, achieving strong performance on zero-shot classification and retrieval. However, recent work reveals a noteworthy phenomenon:
CLIP fails to fully exploit the rich supervisory signals present in text. This manifests in three aspects:
Inherent limitations of contrastive learning: CLIP optimizes only global image-text similarity, neglecting word- and phrase-level fine-grained semantics. For instance, CLIP may confuse statues with real people (object state) or struggle to distinguish a bear inside versus outside a river (spatial relation).
Sparsity of web data: An analysis of 10 million captions from DataComp-1B shows that "man + newspaper" appears 333 times, but "man + newspaper + real/statue" appears only 6 times, and "bear + river + in/out" is nearly absent. Such low-frequency fine-grained combinations rarely form effective contrastive pairs within the same batch.
Richer captions actually degrade CLIP performance: Replacing training data with more detailed captions regenerated by LLaMA-3 (Recap-DataComp) and training CLIP from scratch leads to a performance drop. This indicates that the contrastive learning paradigm cannot effectively leverage richer textual descriptions — the added complexity even interferes with learning.
Strong dependence on batch size: CLIP requires large batches (typically 16K+) to form diverse positive and negative pairs within a batch. Performance degrades sharply under small batch sizes.
Method¶
Overall Architecture¶
SuperCLIP adds only a single lightweight linear layer to the CLIP framework, mapping the average-pooled features from the visual encoder to text-based classification targets. The classification loss is jointly optimized with the contrastive loss, requiring no additional annotated data; the training data, visual encoder, and text encoder are all reused directly from CLIP.
Key Designs¶
-
Text tokens as classification labels: Each caption is tokenized via CLIP's subword tokenizer to obtain a token ID set \(\mathcal{C}\), which is encoded as a \(V\)-dimensional K-hot vector \(\mathbf{y} \in \mathbb{R}^V\) (where \(V\) is the vocabulary size). Unlike conventional classification, the "classes" here are raw text tokens, requiring no manual filtering or vocabulary construction.
-
IDF weighting: Using K-hot labels directly would allow high-frequency stop words to dominate learning. Inverse document frequency (IDF) weighting is therefore introduced:
The normalized weighted label distribution is:
This focuses the model on informationally dense vocabulary (e.g., "zebra", "skateboarding") while reducing overemphasis on function words such as "the" and "is".
- Classification loss: A linear layer is applied to the average-pooled visual features to obtain logits \(x_c\), and weighted cross-entropy is computed as:
- Total loss:
Since the classification loss does not rely on in-batch negatives, it is inherently insensitive to batch size, mitigating the performance degradation of CLIP under small batch settings.
Loss & Training¶
- Pretrained on a subset of DataComp-1B (approximately 1.3 billion image-text pairs)
- Image resolution 224×224, AdamW optimizer, cosine learning rate schedule
- Default batch size 16K (consistent with CLIP for fair comparison)
- The linear layer introduces only 0.051 GFLOPs (L-size), accounting for 0.077% of total computation
- Supports DualCaption mode: contrastive loss uses short captions; classification loss uses long captions
Key Experimental Results¶
Main Results (Various Model Sizes)¶
| Model | ImageNet val (%) | ImageNet v2 (%) | COCO Image Retrieval (%) | Flickr Image Retrieval (%) |
|---|---|---|---|---|
| CLIP B-512M | 60.5 | 53.0 | 29.0 | 54.5 |
| SuperCLIP B-512M | 63.5 (+3.0) | 55.2 (+2.2) | 31.3 (+2.3) | 56.9 (+2.4) |
| CLIP L-512M | 66.1 | 57.4 | 32.7 | 57.0 |
| SuperCLIP L-512M | 70.1 (+4.0) | 62.5 (+5.1) | 35.9 (+3.2) | 62.4 (+5.4) |
| CLIP L-12.8B | 79.0 | 72.0 | 43.9 | 72.7 |
| SuperCLIP L-12.8B | 80.0 (+1.0) | 72.8 (+0.8) | 45.5 (+1.6) | 74.2 (+1.5) |
Recovering Rich Textual Supervision (Mixed Caption Experiments)¶
| Model | Caption Ratio | 38-dataset Avg. Classification (%) | COCO Image Retrieval (%) | Flickr Text Retrieval (%) |
|---|---|---|---|---|
| CLIP-L (1.0/0.0) | Short 100% | 45.7 | 32.7 | 76.4 |
| CLIP-L (0.0/1.0) | Long 100% | 30.0 | 26.2 | 65.9 |
| CLIP-L (0.8/0.2) | Short 80% / Long 20% | 46.8 | 37.0 | 78.8 |
| SuperCLIP-L (Dual) | Contrastive=Short / Class=Long | 49.5 (+2.7) | 37.6 | 82.5 |
Training CLIP with 100% long captions causes a substantial performance drop (45.7→30.0), whereas SuperCLIP's DualCaption mode effectively exploits the rich semantics of long captions.
Ablation Study¶
| Configuration | ImageNet (%) | COCO Image Retrieval (%) | Flickr Text Retrieval (%) |
|---|---|---|---|
| λ=0.4 | 44.1 | 41.3 | 58.3 |
| λ=1.0 | 47.1 | 44.0 | 61.0 |
| λ=1.6 | 47.2 | 44.2 | 62.0 |
| Without IDF | 44.8 | (31.6, 51.7) | (48.0, 71.1) |
| With IDF | 47.1 | (33.2, 54.7) | (48.9, 73.1) |
Generalization Validation¶
| Framework | ImageNet val (%) | ImageNet v2 (%) | COCO Image Retrieval (%) | Flickr Text Retrieval (%) |
|---|---|---|---|---|
| SigLIP | 60.4 | 52.8 | 29.8 | 73.2 |
| SuperSigLIP | 64.1 (+3.7) | 55.9 (+3.1) | 32.5 (+2.7) | 75.9 (+2.7) |
| FLIP | 58.1 | 50.1 | 27.5 | 66.7 |
| SuperFLIP | 61.3 (+3.2) | 53.5 (+3.4) | 30.1 (+2.6) | 72.0 (+5.3) |
Key Findings¶
- Word–image similarity analysis shows that CLIP's Top-20 words are exclusively object category words (zebras, kites), whereas SuperCLIP successfully elevates state words (blurry), spatial words (inside), and action words (stands) to high-ranking positions.
- SuperCLIP yields consistent improvements on vision-only tasks: linear probing +1.3–1.5%, semantic segmentation +2.1–4.1%, and depth estimation also improves.
- When integrated into LLaVA-1.5, SuperCLIP outperforms the CLIP encoder on multiple multimodal benchmarks, including VQAv2 (+1.8) and MMBench (+6.8).
- When batch size is reduced from 32K to 1K, SuperCLIP exhibits substantially smaller performance degradation compared to CLIP.
Highlights & Insights¶
- Extreme simplicity: Adding a single linear layer and a classification loss suffices to address CLIP's fine-grained alignment deficiency — an exemplary case of solving a real problem with minimal intervention.
- Incisive problem diagnosis: Quantitative keyword co-occurrence statistics from DataComp-1B provide a rigorous explanation for why contrastive learning struggles to capture fine-grained semantics.
- Elegant DualCaption strategy: Using short captions for the contrastive loss preserves coarse-grained alignment, while using long captions for the classification loss extracts fine-grained semantics, avoiding the need to carefully tune mixing ratios.
- Batch size robustness: The classification loss is inherently independent of batch size, offering a practical solution for resource-constrained training scenarios.
Limitations & Future Work¶
- Classification supervision operates only in the text-to-visual-encoder direction; enhancement in the image-to-text-encoder direction remains unexplored.
- IDF weights are precomputed before training and cannot dynamically adapt to shifts in semantic distribution during training.
- The linear classification head may limit the modeling capacity for more complex semantic relationships.
- Improvements on certain specialized datasets (e.g., DSprites, SmallNORB, and other synthetic benchmarks) are marginal.
Related Work & Insights¶
- Orthogonal to methods such as RegionCLIP (region-level supervision) and Long-CLIP (long-text understanding), and can be combined with them.
- The idea of classification supervision draws from Classification-based Supervision (Huang et al. 2024); this work scales it to CLIP pretraining.
- Provides a general enhancement scheme applicable to other contrastive learning variants such as SigLIP and FLIP.
- Implication: the information bottleneck in contrastive learning may be more severe than commonly assumed; supplementing with classification signals is a simple and effective remedy.
Rating¶
- Novelty: ⭐⭐⭐⭐ The method is extremely simple yet the problem diagnosis is insightful; IDF weighting and DualCaption demonstrate genuine ingenuity.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-scale models, cross-framework generalization, 38-dataset evaluation, MLLM integration, and batch size analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is thoroughly analyzed and well supported by quantitative statistics.
- Value: ⭐⭐⭐⭐⭐ Zero-overhead integration into any CLIP training pipeline; open-source code available.