SuperCLIP: CLIP with Simple Classification Supervision¶

Conference: NeurIPS 2025 arXiv: 2512.14480 Code: GitHub (hustvl/SuperCLIP) Area: Information Retrieval Keywords: CLIP, vision-language pretraining, classification supervision, fine-grained alignment, contrastive learning

TL;DR¶

SuperCLIP augments the CLIP contrastive learning framework with an extremely simple classification loss — requiring only a lightweight linear layer that increases total FLOPs by merely 0.077% — to recover fine-grained textual supervision that CLIP underutilizes, achieving consistent improvements on zero-shot classification, image-text retrieval, and vision-only tasks.

Background & Motivation¶

CLIP aligns images and text into a shared embedding space via contrastive learning, achieving strong performance on zero-shot classification and retrieval. However, recent work reveals a noteworthy phenomenon:

CLIP fails to fully exploit the rich supervisory signals present in text. This manifests in three aspects:

Inherent limitations of contrastive learning: CLIP optimizes only global image-text similarity, neglecting word- and phrase-level fine-grained semantics. For instance, CLIP may confuse statues with real people (object state) or struggle to distinguish a bear inside versus outside a river (spatial relation).

Sparsity of web data: An analysis of 10 million captions from DataComp-1B shows that "man + newspaper" appears 333 times, but "man + newspaper + real/statue" appears only 6 times, and "bear + river + in/out" is nearly absent. Such low-frequency fine-grained combinations rarely form effective contrastive pairs within the same batch.

Richer captions actually degrade CLIP performance: Replacing training data with more detailed captions regenerated by LLaMA-3 (Recap-DataComp) and training CLIP from scratch leads to a performance drop. This indicates that the contrastive learning paradigm cannot effectively leverage richer textual descriptions — the added complexity even interferes with learning.

Strong dependence on batch size: CLIP requires large batches (typically 16K+) to form diverse positive and negative pairs within a batch. Performance degrades sharply under small batch sizes.

Method¶

Overall Architecture¶

SuperCLIP adds only a single lightweight linear layer to the CLIP framework, mapping the average-pooled features from the visual encoder to text-based classification targets. The classification loss is jointly optimized with the contrastive loss, requiring no additional annotated data; the training data, visual encoder, and text encoder are all reused directly from CLIP.

Key Designs¶

Text tokens as classification labels: Each caption is tokenized via CLIP's subword tokenizer to obtain a token ID set \(\mathcal{C}\), which is encoded as a \(V\)-dimensional K-hot vector \(\mathbf{y} \in \mathbb{R}^V\) (where \(V\) is the vocabulary size). Unlike conventional classification, the "classes" here are raw text tokens, requiring no manual filtering or vocabulary construction.
IDF weighting: Using K-hot labels directly would allow high-frequency stop words to dominate learning. Inverse document frequency (IDF) weighting is therefore introduced:

\[w_c = \log\left(\frac{|\mathcal{D}|}{1 + \text{df}(c)}\right)\]

The normalized weighted label distribution is:

\[\hat{y}_c = \frac{w_c y_c}{\sum_{c'=1}^V w_{c'} y_{c'}}\]

This focuses the model on informationally dense vocabulary (e.g., "zebra", "skateboarding") while reducing overemphasis on function words such as "the" and "is".

Classification loss: A linear layer is applied to the average-pooled visual features to obtain logits \(x_c\), and weighted cross-entropy is computed as:

\[\mathcal{L}_{\text{Class}} = -\sum_{c=1}^{V} \hat{y}_c \log\left(\frac{e^{x_c}}{\sum_{c'=1}^{V} e^{x_{c'}}}\right)\]

Total loss:

\[\mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{CLIP}} + \mathcal{L}_{\text{Class}}\]

Since the classification loss does not rely on in-batch negatives, it is inherently insensitive to batch size, mitigating the performance degradation of CLIP under small batch settings.

Loss & Training¶

Pretrained on a subset of DataComp-1B (approximately 1.3 billion image-text pairs)
Image resolution 224×224, AdamW optimizer, cosine learning rate schedule
Default batch size 16K (consistent with CLIP for fair comparison)
The linear layer introduces only 0.051 GFLOPs (L-size), accounting for 0.077% of total computation
Supports DualCaption mode: contrastive loss uses short captions; classification loss uses long captions

Key Experimental Results¶

Main Results (Various Model Sizes)¶

Model	ImageNet val (%)	ImageNet v2 (%)	COCO Image Retrieval (%)	Flickr Image Retrieval (%)
CLIP B-512M	60.5	53.0	29.0	54.5
SuperCLIP B-512M	63.5 (+3.0)	55.2 (+2.2)	31.3 (+2.3)	56.9 (+2.4)
CLIP L-512M	66.1	57.4	32.7	57.0
SuperCLIP L-512M	70.1 (+4.0)	62.5 (+5.1)	35.9 (+3.2)	62.4 (+5.4)
CLIP L-12.8B	79.0	72.0	43.9	72.7
SuperCLIP L-12.8B	80.0 (+1.0)	72.8 (+0.8)	45.5 (+1.6)	74.2 (+1.5)

Recovering Rich Textual Supervision (Mixed Caption Experiments)¶

Model	Caption Ratio	38-dataset Avg. Classification (%)	COCO Image Retrieval (%)	Flickr Text Retrieval (%)
CLIP-L (1.0/0.0)	Short 100%	45.7	32.7	76.4
CLIP-L (0.0/1.0)	Long 100%	30.0	26.2	65.9
CLIP-L (0.8/0.2)	Short 80% / Long 20%	46.8	37.0	78.8
SuperCLIP-L (Dual)	Contrastive=Short / Class=Long	49.5 (+2.7)	37.6	82.5

Training CLIP with 100% long captions causes a substantial performance drop (45.7→30.0), whereas SuperCLIP's DualCaption mode effectively exploits the rich semantics of long captions.

Ablation Study¶

Configuration	ImageNet (%)	COCO Image Retrieval (%)	Flickr Text Retrieval (%)
λ=0.4	44.1	41.3	58.3
λ=1.0	47.1	44.0	61.0
λ=1.6	47.2	44.2	62.0
Without IDF	44.8	(31.6, 51.7)	(48.0, 71.1)
With IDF	47.1	(33.2, 54.7)	(48.9, 73.1)

Generalization Validation¶

Framework	ImageNet val (%)	ImageNet v2 (%)	COCO Image Retrieval (%)	Flickr Text Retrieval (%)
SigLIP	60.4	52.8	29.8	73.2
SuperSigLIP	64.1 (+3.7)	55.9 (+3.1)	32.5 (+2.7)	75.9 (+2.7)
FLIP	58.1	50.1	27.5	66.7
SuperFLIP	61.3 (+3.2)	53.5 (+3.4)	30.1 (+2.6)	72.0 (+5.3)

Key Findings¶

Word–image similarity analysis shows that CLIP's Top-20 words are exclusively object category words (zebras, kites), whereas SuperCLIP successfully elevates state words (blurry), spatial words (inside), and action words (stands) to high-ranking positions.
SuperCLIP yields consistent improvements on vision-only tasks: linear probing +1.3–1.5%, semantic segmentation +2.1–4.1%, and depth estimation also improves.
When integrated into LLaVA-1.5, SuperCLIP outperforms the CLIP encoder on multiple multimodal benchmarks, including VQAv2 (+1.8) and MMBench (+6.8).
When batch size is reduced from 32K to 1K, SuperCLIP exhibits substantially smaller performance degradation compared to CLIP.

Highlights & Insights¶

Extreme simplicity: Adding a single linear layer and a classification loss suffices to address CLIP's fine-grained alignment deficiency — an exemplary case of solving a real problem with minimal intervention.
Incisive problem diagnosis: Quantitative keyword co-occurrence statistics from DataComp-1B provide a rigorous explanation for why contrastive learning struggles to capture fine-grained semantics.
Elegant DualCaption strategy: Using short captions for the contrastive loss preserves coarse-grained alignment, while using long captions for the classification loss extracts fine-grained semantics, avoiding the need to carefully tune mixing ratios.
Batch size robustness: The classification loss is inherently independent of batch size, offering a practical solution for resource-constrained training scenarios.

Limitations & Future Work¶

Classification supervision operates only in the text-to-visual-encoder direction; enhancement in the image-to-text-encoder direction remains unexplored.
IDF weights are precomputed before training and cannot dynamically adapt to shifts in semantic distribution during training.
The linear classification head may limit the modeling capacity for more complex semantic relationships.
Improvements on certain specialized datasets (e.g., DSprites, SmallNORB, and other synthetic benchmarks) are marginal.

Orthogonal to methods such as RegionCLIP (region-level supervision) and Long-CLIP (long-text understanding), and can be combined with them.
The idea of classification supervision draws from Classification-based Supervision (Huang et al. 2024); this work scales it to CLIP pretraining.
Provides a general enhancement scheme applicable to other contrastive learning variants such as SigLIP and FLIP.
Implication: the information bottleneck in contrastive learning may be more severe than commonly assumed; supplementing with classification signals is a simple and effective remedy.

Rating¶

Novelty: ⭐⭐⭐⭐ The method is extremely simple yet the problem diagnosis is insightful; IDF weighting and DualCaption demonstrate genuine ingenuity.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-scale models, cross-framework generalization, 38-dataset evaluation, MLLM integration, and batch size analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is thoroughly analyzed and well supported by quantitative statistics.
Value: ⭐⭐⭐⭐⭐ Zero-overhead integration into any CLIP training pipeline; open-source code available.