Skip to content

SuperCLIP: CLIP with Simple Classification Supervision

Conference: NeurIPS 2025 arXiv: 2512.14480 Code: GitHub (hustvl/SuperCLIP) Area: Information Retrieval Keywords: CLIP, vision-language pretraining, classification supervision, fine-grained alignment, contrastive learning

TL;DR

SuperCLIP augments the CLIP contrastive learning framework with an extremely simple classification loss — requiring only a lightweight linear layer that increases total FLOPs by merely 0.077% — to recover fine-grained textual supervision that CLIP underutilizes, achieving consistent improvements on zero-shot classification, image-text retrieval, and vision-only tasks.

Background & Motivation

CLIP aligns images and text into a shared embedding space via contrastive learning, achieving strong performance on zero-shot classification and retrieval. However, recent work reveals a noteworthy phenomenon:

CLIP fails to fully exploit the rich supervisory signals present in text. This manifests in three aspects:

Inherent limitations of contrastive learning: CLIP optimizes only global image-text similarity, neglecting word- and phrase-level fine-grained semantics. For instance, CLIP may confuse statues with real people (object state) or struggle to distinguish a bear inside versus outside a river (spatial relation).

Sparsity of web data: An analysis of 10 million captions from DataComp-1B shows that "man + newspaper" appears 333 times, but "man + newspaper + real/statue" appears only 6 times, and "bear + river + in/out" is nearly absent. Such low-frequency fine-grained combinations rarely form effective contrastive pairs within the same batch.

Richer captions actually degrade CLIP performance: Replacing training data with more detailed captions regenerated by LLaMA-3 (Recap-DataComp) and training CLIP from scratch leads to a performance drop. This indicates that the contrastive learning paradigm cannot effectively leverage richer textual descriptions — the added complexity even interferes with learning.

Strong dependence on batch size: CLIP requires large batches (typically 16K+) to form diverse positive and negative pairs within a batch. Performance degrades sharply under small batch sizes.

Method

Overall Architecture

SuperCLIP adds only a single lightweight linear layer to the CLIP framework, mapping the average-pooled features from the visual encoder to text-based classification targets. The classification loss is jointly optimized with the contrastive loss, requiring no additional annotated data; the training data, visual encoder, and text encoder are all reused directly from CLIP.

Key Designs

  1. Text tokens as classification labels: Each caption is tokenized via CLIP's subword tokenizer to obtain a token ID set \(\mathcal{C}\), which is encoded as a \(V\)-dimensional K-hot vector \(\mathbf{y} \in \mathbb{R}^V\) (where \(V\) is the vocabulary size). Unlike conventional classification, the "classes" here are raw text tokens, requiring no manual filtering or vocabulary construction.

  2. IDF weighting: Using K-hot labels directly would allow high-frequency stop words to dominate learning. Inverse document frequency (IDF) weighting is therefore introduced:

\[w_c = \log\left(\frac{|\mathcal{D}|}{1 + \text{df}(c)}\right)\]

The normalized weighted label distribution is:

\[\hat{y}_c = \frac{w_c y_c}{\sum_{c'=1}^V w_{c'} y_{c'}}\]

This focuses the model on informationally dense vocabulary (e.g., "zebra", "skateboarding") while reducing overemphasis on function words such as "the" and "is".

  1. Classification loss: A linear layer is applied to the average-pooled visual features to obtain logits \(x_c\), and weighted cross-entropy is computed as:
\[\mathcal{L}_{\text{Class}} = -\sum_{c=1}^{V} \hat{y}_c \log\left(\frac{e^{x_c}}{\sum_{c'=1}^{V} e^{x_{c'}}}\right)\]
  1. Total loss:
\[\mathcal{L}_{\text{Total}} = \mathcal{L}_{\text{CLIP}} + \mathcal{L}_{\text{Class}}\]

Since the classification loss does not rely on in-batch negatives, it is inherently insensitive to batch size, mitigating the performance degradation of CLIP under small batch settings.

Loss & Training

  • Pretrained on a subset of DataComp-1B (approximately 1.3 billion image-text pairs)
  • Image resolution 224×224, AdamW optimizer, cosine learning rate schedule
  • Default batch size 16K (consistent with CLIP for fair comparison)
  • The linear layer introduces only 0.051 GFLOPs (L-size), accounting for 0.077% of total computation
  • Supports DualCaption mode: contrastive loss uses short captions; classification loss uses long captions

Key Experimental Results

Main Results (Various Model Sizes)

Model ImageNet val (%) ImageNet v2 (%) COCO Image Retrieval (%) Flickr Image Retrieval (%)
CLIP B-512M 60.5 53.0 29.0 54.5
SuperCLIP B-512M 63.5 (+3.0) 55.2 (+2.2) 31.3 (+2.3) 56.9 (+2.4)
CLIP L-512M 66.1 57.4 32.7 57.0
SuperCLIP L-512M 70.1 (+4.0) 62.5 (+5.1) 35.9 (+3.2) 62.4 (+5.4)
CLIP L-12.8B 79.0 72.0 43.9 72.7
SuperCLIP L-12.8B 80.0 (+1.0) 72.8 (+0.8) 45.5 (+1.6) 74.2 (+1.5)

Recovering Rich Textual Supervision (Mixed Caption Experiments)

Model Caption Ratio 38-dataset Avg. Classification (%) COCO Image Retrieval (%) Flickr Text Retrieval (%)
CLIP-L (1.0/0.0) Short 100% 45.7 32.7 76.4
CLIP-L (0.0/1.0) Long 100% 30.0 26.2 65.9
CLIP-L (0.8/0.2) Short 80% / Long 20% 46.8 37.0 78.8
SuperCLIP-L (Dual) Contrastive=Short / Class=Long 49.5 (+2.7) 37.6 82.5

Training CLIP with 100% long captions causes a substantial performance drop (45.7→30.0), whereas SuperCLIP's DualCaption mode effectively exploits the rich semantics of long captions.

Ablation Study

Configuration ImageNet (%) COCO Image Retrieval (%) Flickr Text Retrieval (%)
λ=0.4 44.1 41.3 58.3
λ=1.0 47.1 44.0 61.0
λ=1.6 47.2 44.2 62.0
Without IDF 44.8 (31.6, 51.7) (48.0, 71.1)
With IDF 47.1 (33.2, 54.7) (48.9, 73.1)

Generalization Validation

Framework ImageNet val (%) ImageNet v2 (%) COCO Image Retrieval (%) Flickr Text Retrieval (%)
SigLIP 60.4 52.8 29.8 73.2
SuperSigLIP 64.1 (+3.7) 55.9 (+3.1) 32.5 (+2.7) 75.9 (+2.7)
FLIP 58.1 50.1 27.5 66.7
SuperFLIP 61.3 (+3.2) 53.5 (+3.4) 30.1 (+2.6) 72.0 (+5.3)

Key Findings

  • Word–image similarity analysis shows that CLIP's Top-20 words are exclusively object category words (zebras, kites), whereas SuperCLIP successfully elevates state words (blurry), spatial words (inside), and action words (stands) to high-ranking positions.
  • SuperCLIP yields consistent improvements on vision-only tasks: linear probing +1.3–1.5%, semantic segmentation +2.1–4.1%, and depth estimation also improves.
  • When integrated into LLaVA-1.5, SuperCLIP outperforms the CLIP encoder on multiple multimodal benchmarks, including VQAv2 (+1.8) and MMBench (+6.8).
  • When batch size is reduced from 32K to 1K, SuperCLIP exhibits substantially smaller performance degradation compared to CLIP.

Highlights & Insights

  1. Extreme simplicity: Adding a single linear layer and a classification loss suffices to address CLIP's fine-grained alignment deficiency — an exemplary case of solving a real problem with minimal intervention.
  2. Incisive problem diagnosis: Quantitative keyword co-occurrence statistics from DataComp-1B provide a rigorous explanation for why contrastive learning struggles to capture fine-grained semantics.
  3. Elegant DualCaption strategy: Using short captions for the contrastive loss preserves coarse-grained alignment, while using long captions for the classification loss extracts fine-grained semantics, avoiding the need to carefully tune mixing ratios.
  4. Batch size robustness: The classification loss is inherently independent of batch size, offering a practical solution for resource-constrained training scenarios.

Limitations & Future Work

  • Classification supervision operates only in the text-to-visual-encoder direction; enhancement in the image-to-text-encoder direction remains unexplored.
  • IDF weights are precomputed before training and cannot dynamically adapt to shifts in semantic distribution during training.
  • The linear classification head may limit the modeling capacity for more complex semantic relationships.
  • Improvements on certain specialized datasets (e.g., DSprites, SmallNORB, and other synthetic benchmarks) are marginal.
  • Orthogonal to methods such as RegionCLIP (region-level supervision) and Long-CLIP (long-text understanding), and can be combined with them.
  • The idea of classification supervision draws from Classification-based Supervision (Huang et al. 2024); this work scales it to CLIP pretraining.
  • Provides a general enhancement scheme applicable to other contrastive learning variants such as SigLIP and FLIP.
  • Implication: the information bottleneck in contrastive learning may be more severe than commonly assumed; supplementing with classification signals is a simple and effective remedy.

Rating

  • Novelty: ⭐⭐⭐⭐ The method is extremely simple yet the problem diagnosis is insightful; IDF weighting and DualCaption demonstrate genuine ingenuity.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-scale models, cross-framework generalization, 38-dataset evaluation, MLLM integration, and batch size analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Motivation is thoroughly analyzed and well supported by quantitative statistics.
  • Value: ⭐⭐⭐⭐⭐ Zero-overhead integration into any CLIP training pipeline; open-source code available.